Description	Windows & Linux	Mac
Show Keyboard Shortcut Reference	Alt+Shift+K	Option+Shift+K
Attempt completion / Indent	Tab or Ctrl+Space	Tab or Command+Space
Run current line/selection	Ctrl+Enter	Command+Enter
Comment/uncomment current line/selection	Ctrl+Shift+C	Command+Shift+C
Save active document	Ctrl+S	Command+S
Reindent lines	Ctrl+I	Command+I

Load package

library
- Load the namespace of the package with name package and attach it on the search list.
- Return error if package not found

library(argparser)
library("argparser")

require: Just like library but return warning if package not found

require(mansun)

## Loading required package: mansun

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'mansun'

Namespace

Sometimes functions in different packages may have the same name.

pkg::name: Export specific function in a package
pkg:::name: Export internal variable/function in a package. Don’t modify these items if you know what you are doing.

lubridate::date("2016-05-14") # access without load whole package

## [1] "2016-05-14"

library(lubridate, quietly = TRUE)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

date("2016-05-14")  # convert date. base::date is masked

## [1] "2016-05-14"

base::date()    # return system data and time

## [1] "Mon May 16 10:29:24 2016"

Location of libraries

User specific library: R_LIBS_USER
- Libraries under user’s home directory
- Only can be accessed by the user in most situation.
Common library: libraries for all users. Need administrator permissoion.
- System library: R_HOME/library
  - Comes with R installation
- Site library:
  - Configured in R_HOME/etc/Rprofile.site

Sys.getenv(c("R_LIBS_USER", "R_HOME"))

##                           R_LIBS_USER 
## "~/R/x86_64-pc-linux-gnu-library/3.3" 
##                                R_HOME 
##                          "/usr/lib/R"

libPaths

Libraries in R are loaded in order:

.libPaths()

## [1] "/usr/local/lib/R/site-library" "/usr/lib/R/site-library"      
## [3] "/usr/lib/R/library"

You can also add an additional library path:

.libPaths(c("your/library/path", .libPaths()))

Or change the order of library search path:

lpath <- .libPaths()
.libPaths(lpath[c(length(lpath), 1:max(1, length(lpath)-1))])
.libPaths()

## [1] "/usr/lib/R/library"            "/usr/local/lib/R/site-library"
## [3] "/usr/lib/R/site-library"

Change Privilege(Windows)

Change Privilege(Linux/Mac)

Use chown and chmod to change owner and write permission of your install path.

chown: Change owner

chown mansun:mansun /usr/local/lib/R/site-library

chmod: Change permission of access

chmod u+w /usr/local/lib/R/site-library

Where can you find a R package

CRAN: Official R package
GitHub: The most popular web-based git repository hosting service
Bitbucket: Also a git repository hosting service
R-Forge: Another CRAN-like website to access package
Bioconductor: Specific for biostatistician

Install from CRAN

install.packages
- Download and install packages from CRAN-like repositories or from local files
- Type ?install.packages to get further information
- Packages are installed in the first element of .libpaths() if lib ins missing
- In RStudio: Tools -> Install Packages…
- Be careful of system dependency
  - Rtools(Windows), Java, XML, MySQL, …

Install multiple packages:

install.packages(c("argparser", "jsonlite"), 
    lib = "/usr/local/lib/R/site-library")

Install development version of data.table:

install.packages("data.table", type = "source",
    repos = "https://Rdatatable.github.io/data.table")

Install package with RStudio

Yun can even using RStudio to install a package into system library or site library.

Exercise

Try to install package argparser in system library or site library.

Install a local files

You can also install a local tarball:

install.packages("/tmp/Rtmp99mvkt/downloaded_packages/argparser_0.4.tar.gz", 
    lib = "/usr/local/lib/R/site-library", repo = NULL, type = "source")

Install from github

Using devtools::install_github can easily install any R package on GitHub:

To install devtools make sure you have:

Rtools on Windows
Xcode command line tools for Mac
r-devel or r-base-dev on Linux

Then:

install.packages("devtools")

Now you can install any package on github with single command:

devtools::install_github('Rdatatable/data.table')

Just take a look what happened when you execute above command.

References

Useful data I/O in R

Text:
- Plain text: writeLines, readLines, scan, …
- JSON: jsonlite::fromJSON, jsonlite::toJSON
Table:
- data.table::fwrite, data.table::fread, write.csv, read.csv, …
Binary file: writeBin, readBin, writeRDS, readRDS, …
Database:
- RSQLite, RODBC, RJDBC, RMySQL, …

Quoting in R

Beside standard characters, there are several escaped characters that have special meaning. For example:

\n: newline
\t: tab

A character with newline and tab:

greeting = "Hi!\n\tHow are you? \n\tI'm fine, thank you!"
cat(greeting)

## Hi!
##  How are you? 
##  I'm fine, thank you!

About Encoding

The most frustrating part for using R (in Windows)
Use Linux or Mac can avoid a lot of encoding issue of R
You may need to assign appropriate encoding when reading data
Useful functions to deal with encoding issues
- iconv: Convert a character vector between encodings
- iconvlist: list all available encoding
- Sys.getlocale: Get your system’s locale
- Encoding: Read or set the declared encodings for a character vector

Highlevel I/O for a text file

writeLines: Write text lines to a connection
readLines: Read text lines from a connection

writeLines(greeting, con = "dataio/greeting.txt")
readLines(con = "dataio/greeting.txt")

## [1] "Hi!"                    "\tHow are you? "       
## [3] "\tI'm fine, thank you!"

JSON

The most popular format to exchange unstructure data. For example:

{
  "teacher": "Mansun",
  "isFullTime": false,
  "hobby": ["Music", "Game"],
  "license": null,
  "lecture": [
    {
      "name": "BeyondRBasic",
      "date": "2016-05-14",
      "hour": 3
    },
    {
      "name": "RCrawler101-2",
      "date": "2016-01-09",
      "hour": 3
    }
  ]
}

jsonlite

Provide JSON parseing/generating utilities in R

fromJSON: Parse a JSON string into a R object
- name-value pair -> list
- JSON array -> vector, matrix or data.frame
toJSON: Convert a R object into JSON string

library(jsonlite)
intro = fromJSON("dataio/intro.json", simplifyVector = TRUE)
str(intro)

## List of 5
##  $ teacher   : chr "Mansun"
##  $ isFullTime: logi FALSE
##  $ hobby     : chr [1:2] "Music" "Game"
##  $ license   : NULL
##  $ lecture   :'data.frame':  2 obs. of  3 variables:
##   ..$ name: chr [1:2] "BeyondRBasic" "RCrawler101-2"
##   ..$ date: chr [1:2] "2016-05-14" "2016-01-09"
##   ..$ hour: int [1:2] 3 3

jsonlite - 2

intro$license = NULL
toJSON(intro, pretty = TRUE)  # convert a list as a pretty JSON

## {
##   "teacher": ["Mansun"],
##   "isFullTime": [false],
##   "hobby": ["Music", "Game"],
##   "lecture": [
##     {
##       "name": "BeyondRBasic",
##       "date": "2016-05-14",
##       "hour": 3
##     },
##     {
##       "name": "RCrawler101-2",
##       "date": "2016-01-09",
##       "hour": 3
##     }
##   ]
## }

Table

In most situation, you can use data.table::fwrite and data.table::fread to do such tasks much easier and faster.

Here is an example to read a file generated by excel on Windows(Traditional Chinese):

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:lubridate':
## 
##     hour, mday, month, quarter, wday, week, yday, year

from_cp950 = fread("dataio/cp950.csv", encoding = "UTF-8")
from_cp950$name = iconv(from_cp950$name, from = "CP950", to = "UTF-8")
from_cp950

##      name age
## 1: 王先生  34
## 2: 白小姐  23

Table -2

Write the CP950 file into another file with UTF-8 encoding. Read the UTF-8 file again.

fwrite(from_cp950, "dataio/utf8.csv", sep = "\t") # default write as UTF-8
from_utf8 = fread("dataio/utf8.csv", encoding = "UTF-8")
from_utf8

##      name age
## 1: 王先生  34
## 2: 白小姐  23

identical(from_cp950, from_utf8)

## [1] TRUE

Database

In most production environment, you won’t load all of your company’s data into R
R works well with most of popular databases
- Database management
Relation databases are wonderful choices for analysts

A typical SQL query string:

SELECT <columns you want to retrieve>
FROM <table>
WHERE <comparison of columns>
GROUP BY <columns you want to compute the aggregation result>
ORDER BY <columns you want to order>
HAVING <conditions to filter the result of GROUP BY>

RSQLite

A light-weight database engine interface in R

PChome Example

References

Aims of this lecture

An overall introduction for most-used feature of data.table with minimum examples
A case study of using data.table for data-preprocessing tasks
Common mistakes when using data.table

What is data.table

An enhanced version of data.frames
Optimize most heavy works in C
Minimize redundant copies
Fast and friendly file I/O
Build-in data manipulation toolchain

Why data.table

Speed
Automatic optimization
Total solution for ETL
Concise syntax

Benchmarks

Benchmarks : Grouping

Installation

Here we use data.table v1.9.7

Install development version of data.table:

install.packages("data.table", type = "source",
    repos = "https://Rdatatable.github.io/data.table")

devtools::install_github("Rdatatable/data.table")

Load package:

library(data.table)

General Form

DT[i, j, by]

Compare with SQL:

R:	i	j	by
SQL:	where	select/update	group by

Say it loud:

Take DT, subset rows using i, then calculate j grouped by by.

Your first data.table

You can create a data.table just like the way you create a data.frame. It don’t convert character to factor by default.

library(data.table)
team = data.table(name = c("Wendy", "Alice", "James", "Bill"),
                  gender = c("f", "f", "M", "M"),
                  age = c(26, 24, 28, 32))

or convert a data.frame into data.table:

teamdf = data.frame(name = c("Wendy", "Alice", "James", "Bill"),
                    gender = c("f", "f", "M", "M"),
                    age = c(26, 24, 28, 32),
                    stringsAsFactors = FALSE)
teamdt = data.table(teamdf)
identical(team, teamdt)

## [1] TRUE

data.table is a data.frame, too

class(team)

## [1] "data.table" "data.frame"

You can use any functions that suitable for data.frame with data.table. Note that the behavior of [ operator is slightly different than data.frame.

team$gender

## [1] "f" "f" "M" "M"

team[[2]]

## [1] "f" "f" "M" "M"

team[2]

##     name gender age
## 1: Alice      f  24

teamdf$gender

## [1] "f" "f" "M" "M"

teamdf[[2]]

## [1] "f" "f" "M" "M"

teamdf[2]

##   gender
## 1      f
## 2      f
## 3      M
## 4      M

Subset with i

Use i to subset a data.table is similar to data.frame. Furthermore, expression is evaluated within the frame of the data.table

team[1, ]

##     name gender age
## 1: Wendy      f  26

team[c(FALSE, FALSE, FALSE, TRUE), ]

##    name gender age
## 1: Bill      M  32

team[name == "Wendy", ]

##     name gender age
## 1: Wendy      f  26

team[name %in% c("Wendy", "Alice"), ]

##     name gender age
## 1: Wendy      f  26
## 2: Alice      f  24

Select with j

. is an abbreviation of list within data.table
j accepts a list of expression to select columns

team[, .(gender)]

##    gender
## 1:      f
## 2:      f
## 3:      M
## 4:      M

team[1, .(sex = gender)]

##    sex
## 1:   f

team[1, .(name, gender)]

##     name gender
## 1: Wendy      f

Degeneration

If you don’ wrap a single expression with list in j, the output will degenerate into a vector.

team[, gender]

## [1] "f" "f" "M" "M"

If you want to know how old is Wendy:

team[name == "Wendy", age]

## [1] 26

With

Because j handle expressions, you need to set with = FALSE when slicing with logical, interger or character vector

team[1, c("gender")]    # not works

## [1] "gender"

team[1, c(TRUE, FALSE, TRUE), with = FALSE]

##     name age
## 1: Wendy  26

team[1, c(1, 3), with = FALSE]

##     name age
## 1: Wendy  26

team[1, c("name", "age"), with = FALSE]

##     name age
## 1: Wendy  26

Assignment by reference

:= add/remove/update subsets of column by reference.

LHS := RHS form

team[, id := 1:4]   # add a column 
team[name == "James", age := 38]  # UPDATE age for james
team[name == "Bill", c("name", "age") := .("Billy", 34)] # update 2 columns 
team[, id := NULL]  # remove id

Functional form

team[gender == "f", `:=`(gender = "F", age = age + 5)]
team

##     name gender age
## 1: Wendy      F  31
## 2: Alice      F  29
## 3: James      M  38
## 4: Billy      M  34

More calculation with j

Use a list of expression in j to compute with data.table

stat = team[, .(mean_age = mean(age), sd_age = sd(age))]
stat

##    mean_age  sd_age
## 1:       33 3.91578

Add a new column:

team[, mean_age := mean(age)]
team

##     name gender age mean_age
## 1: Wendy      F  31       33
## 2: Alice      F  29       33
## 3: James      M  38       33
## 4: Billy      M  34       33

Note that putting := inside .() doesn’t work:

stat = team[, .(mean_age := mean(age), sd_age := sd(age))] # not works

Group by with by

You can use by to summarize by each group. keyby will add key automatically.

Generate a data.table of mean and stand deviation by gender:

team[, .(mean_age = mean(age), sd_age = sd(age)), by = "gender"]

##    gender mean_age   sd_age
## 1:      F       30 1.414214
## 2:      M       36 2.828427

Add a new column of mean age by gender:

team[, gender_mean_age := mean(age), by = "gender"]
team

##     name gender age mean_age gender_mean_age
## 1: Wendy      F  31       33              30
## 2: Alice      F  29       33              30
## 3: James      M  38       33              36
## 4: Billy      M  34       33              36

.N

Returns the number of rows in the subset.
Can be used in both i and j

team[.N, ] # subset last column

##     name gender age mean_age gender_mean_age
## 1: Billy      M  34       33              36

team[, .(count = .N), by = gender]  # a new data.table, count by gender

##    gender count
## 1:      F     2
## 2:      M     2

team[, count := .N, by = gender]    # add a column in data.table
team

##     name gender age mean_age gender_mean_age count
## 1: Wendy      F  31       33              30     2
## 2: Alice      F  29       33              30     2
## 3: James      M  38       33              36     2
## 4: Billy      M  34       33              36     2

Expressions in by

You can use expressions in by to separate groups. Here we divide groups with gender and age great equal 30:

team[, .(count = .N, mean_age = mean(age)), by = .(old = age >= 30, gender)]

##      old gender count mean_age
## 1:  TRUE      F     1       31
## 2: FALSE      F     1       29
## 3:  TRUE      M     2       36

Exercise

Here is a data.table contains four pets’ name, species and weight:

library(data.table)
pets = data.table(name = c("Senor", "Cookie", "Lucky", "Meow Meow"),
                  species = c("dog", "dog", "bird", "cat"),
                  weight = c(26, 13, 0.5, 6))

Please:

Create a vector called pet_weight contains all pets’ weight
Create a unit vector senor_weight contains Senor’s weight
Create a data.table called dogs_weight that contains name and weight of all dogs in pets
Modify Cookie’s weight as 15
Create a data.table called pet_stat that contains species, avg_weight and count

Answer

Answers of Pet Exercise

Subset of data.table

.SD: Subset of Data.table. It’s a data table that holds the data for the current group defined using by.
.SDcols specifies the columns that returned in .SD.

Get last row of each gender:

team[, .SD[.N, ], by = gender]

##    gender  name age mean_age gender_mean_age count
## 1:      F Alice  29       33              30     2
## 2:      M Billy  34       33              36     2

Get last row of each gender:

team[, .SD[age == max(age), ], by = gender]

##    gender  name age mean_age gender_mean_age count
## 1:      F Wendy  31       33              30     2
## 2:      M James  38       33              36     2

Subset of data.table - 2

Do some computation on age, mean_age and gender_mean_age with .SDcols:

age_cols = c("age", "mean_age", "gender_mean_age")
team[, c(age_cols) := lapply(.SD, function(x) x + 2), .SDcols = age_cols]

It’s ok to use j inside .SD to do the same thing:

team[, c(age_cols) := lapply(.SD[, age_cols, with = FALSE], 
                             function(x) x +2)]
team

##     name gender age mean_age gender_mean_age count
## 1: Wendy      F  35       37              34     2
## 2: Alice      F  33       37              34     2
## 3: James      M  42       37              40     2
## 4: Billy      M  38       37              40     2

Chaining

DT[ … ][ … ][ … ]

Chaining of data.table can avoid unnecessary intermediate assignment.

age_comparator = function(dt) {
    ifelse(dt[gender == "M", mean_age] >=  dt[gender == "M", mean_age],
           "Males are older", "Females are older")
}
gender_mean_age = team[, .(mean_age = mean(age)), by = "gender"]
gender_mean_age[, comment := age_comparator(.SD)]
# Use Chaining
gender_mean_age2 = team[, .(mean_age = mean(age)), by = "gender",
                      ][, comment := age_comparator(.SD)]
gender_mean_age

##    gender mean_age         comment
## 1:      F       34 Males are older
## 2:      M       40 Males are older

identical(gender_mean_age, gender_mean_age2)

## [1] TRUE

Key

You can set key in a data.table to:

Add/update/delete a data.table with row name
Make your sorting within data.table faster

setkey(team, gender, name)  # set name as key
setkeyv(team, c("gender", "name"))   # set name as key using character
key(team)   # show keys in a data.table

## [1] "gender" "name"

team[.("F", "Wendy"),]  # now you can slicing with key

##     name gender age mean_age gender_mean_age count
## 1: Wendy      F  35       37              34     2

team[.("F"), ]  # prefix is ok

##     name gender age mean_age gender_mean_age count
## 1: Alice      F  33       37              34     2
## 2: Wendy      F  35       37              34     2

team[.("Wendy"),]  # keys have orders. This example does not work

##    name gender age mean_age gender_mean_age count
## 1:   NA  Wendy  NA       NA              NA    NA

Sort a data.table with order

order in data.table acts just like base::order, but faster.
- means decending.

Ordering with keys is faster:

team[order(gender, -name)]

##     name gender age mean_age gender_mean_age count
## 1: Wendy      F  35       37              34     2
## 2: Alice      F  33       37              34     2
## 3: James      M  42       37              40     2
## 4: Billy      M  38       37              40     2

You still can ordering without keys:

team[order(gender, -age)]

##     name gender age mean_age gender_mean_age count
## 1: Wendy      F  35       37              34     2
## 2: Alice      F  33       37              34     2
## 3: James      M  42       37              40     2
## 4: Billy      M  38       37              40     2

set* family

All set* functions change their input by reference.

setattr: Set attributes of data.table
setnames: Set column names of data.table
set: Repeatedly update rows and columns. Use for looping
setcolorder: Column reordering of a data.table

Set row names as the first letter of name:

attributes(team)$row.names

## [1] 1 2 3 4

setattr(team, "row.names", substr(team$name, 1, 1))
attributes(team)$row.names

## [1] "A" "W" "B" "J"

set* family - 2

Set column names as upper case:

setnames(team, "name", "Name")  # set name to Name
colnames(team)

## [1] "Name"            "gender"          "age"             "mean_age"       
## [5] "gender_mean_age" "count"

setnames(team, colnames(team), tolower(colnames(team))) # set all lower
colnames(team)

## [1] "name"            "gender"          "age"             "mean_age"       
## [5] "gender_mean_age" "count"

set* family - 3

Use set to update the name of first person:

set(team, 1L, "name", "Alison")
team

##      name gender age mean_age gender_mean_age count
## 1: Alison      F  33       37              34     2
## 2:  Wendy      F  35       37              34     2
## 3:  Billy      M  38       37              40     2
## 4:  James      M  42       37              40     2

Set genderas the first column:

new_order = c("gender", colnames(team)[!colnames(team) %in% "gender"])
setcolorder(team, new_order)
team

##    gender   name age mean_age gender_mean_age count
## 1:      F Alison  33       37              34     2
## 2:      F  Wendy  35       37              34     2
## 3:      M  Billy  38       37              40     2
## 4:      M  James  42       37              40     2

Copy

:= and set* modify the data.table by references. You can use copy to do a deep copy to avoid side effect (modify a object that doesn’t return by the function).

deepf = function(dt) {
    tt = copy(dt)
    tt[, name := substr(name, 1, 1)]
}
deepf(team)
team$name

## [1] "Alison" "Wendy"  "Billy"  "James"

shallowf = function(dt) {
    tt = dt
    tt[, name := substr(name, 1, 1)]
}
shallowf(team)
team$name

## [1] "A" "W" "B" "J"

Notice that we modify tt’s name inside shallowf but the data.table team is also changed.

Feel good with side effect

It is always a good idea letting others(or you in the future) know what objects will be modified if your function have side effect.

A well-organized document or comment
Return it when using function that have side effect. For example:

team = shallowf(team)

Benckmark: order

Note that in R 3.3.0, base::order has a new argument method to choose the sorting method and the radix sorting comes from data.table (Ref). Default sorting method of base::order is still shell sorting.

Display all data.table’s

Use tables to list all data.table’s in memory.

all_tables = tables(silent=TRUE, width = 60)
all_tables[, COLS := NULL]
all_tables  # KEY column disappear in slides for unknown reason

##                NAME NROW NCOL MB KEY
## 1:       from_cp950    2    2  1    
## 2:        from_utf8    2    2  1    
## 3:  gender_mean_age    2    3  1    
## 4: gender_mean_age2    2    3  1    
## 5:             pets    4    3  1    
## 6:             stat    1    2  1    
## 7:             team    4    6  1    
## 8:           teamdt    4    3  1

Case Study

Using hourly weather data from 2012-08-15 to 2015-08-14 to illustrate features of data.table.

Special symbols in this data:

T: trace amount of precipitation
x: instrument error
V: wind direction is uncertain

Read data

fread: read text file into data.table

weather = fread("data/weather.csv")

## Warning in fread("data/weather.csv"): Bumped column 1 to type character on
## data row 51314, field contains '觀測時間(LST)ObsTime'. Coercing previously
## read values in this column from logical, integer or numeric back to
## character which may not be lossless; e.g., if '00' and '000' occurred
## before they will now be just '0', and there may be inconsistencies with
## treatment of ',,' and ',NA,' too (if they occurred in this column before
## the bump). If this matters please rerun and set 'colClasses' to 'character'
## for this column. Please note that column type detection uses the first
## 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message
## should be very rare. If reporting to datatable-help, please rerun and
## include the output from verbose=TRUE.

Read multiple data

rbindlist: Same as do.call(“rbind”, ), but faster.

filepaths = list.files("data/daily", full.names = TRUE)
locations = gsub("data\\/daily\\/|_....-..-..\\.csv", "", filepaths)
dates = gsub("data\\/daily\\/.*_|\\.csv", "", filepaths)
weather = list()
for (i in 1:length(filepaths)) {
    weather[[i]] = fread(filepaths[i])
    weather[[i]][, city := locations[i]]
    weather[[i]][, date := dates[i]]
}
weather = rbindlist(weather)

Write data

A fast data writer fwrite was introduced in data.table 1.9.7

quote( = “auto”): When quote = “auto”. All column names will only be surrounded by double quotes if the field contains the separator
sep( = “,”): The separator between columns.

# output a comma-separated file
fwrite(weather, "data/weather.csv")

# output a tab-separated file
fwrite(weather[1:24, ], "data/weather.txt", sep = "\t")

Benckmark: data I/O

Generate datetime

Here we use lubridate::ymd_h to generate datetime. The warning message is due to the missing data of Kaohsiung in 2015-06-24

library(lubridate, quietly = TRUE)
weather[, ObsTime2 := ObsTime]

# For datetime convertion
weather[ObsTime2 == "24", ObsTime2 := "0"] 

# Convert 
weather[,datetime := ymd_h(paste(date, ObsTime2), tz = "CST")]

## Warning: 2 failed to parse.

# Plus one day for ObsTime = "24"
weather[ObsTime == "24", datetime := datetime + days(1)]

# Check for rows that don't convert successful
weather[is.na(datetime)]

##                 ObsTime              StnPres                SeaPres
## 1:                                                                 
## 2: 觀測時間(LST)ObsTime 測站氣壓(hPa)StnPres 海平面氣壓(hPa)SeaPres
##           Temperature              Tddewpoint            RH          WS
## 1:                                                                     
## 2: 氣溫(℃)Temperature 露點溫度(℃)Td dew point 相對溼度(%)RH 風速(m/s)WS
##                             WD              WSGust
## 1:                                                
## 2: 風向(最多風向)(360degree)WD 最大陣風(m/s)WSGust
##                           WDGust           Precp             PrecpHour
## 1:                                                                    
## 2: 最大陣風風向(360degree)WDGust 降水量(mm)Precp 降水時數(hr)PrecpHour
##                SunShine                    GloblRad           Visb
## 1:                                                                
## 2: 日照時數(hr)SunShine 全天空日射量(MJ/㎡)GloblRad 能見度(km)Visb
##         city       date             ObsTime2 datetime
## 1: Kaohsiung 2015-06-24                          <NA>
## 2: Kaohsiung 2015-06-24 觀測時間(LST)ObsTime     <NA>

# Drop rows that don't convert successful
weather = weather[!is.na(datetime),]

str(weather)

## Classes 'data.table' and 'data.frame':   105096 obs. of  19 variables:
##  $ ObsTime    : chr  "1" "2" "3" "4" ...
##  $ StnPres    : chr  "1002.9 " "1001.5 " "1002.6 " "1002.3 " ...
##  $ SeaPres    : chr  "1005.1 " "1003.7 " "1004.8 " "1004.5 " ...
##  $ Temperature: chr  "28.6 " "28.3 " "28.3 " "28.1 " ...
##  $ Tddewpoint : chr  "21.1 " "22.1 " "21.6 " "21.7 " ...
##  $ RH         : chr  "64 " "69 " "67 " "68 " ...
##  $ WS         : chr  "1.3 " "3.2 " "2.5 " "1.1 " ...
##  $ WD         : chr  "340.0 " "20.0 " "330.0 " "330.0 " ...
##  $ WSGust     : chr  "2.7 " "5.8 " "4.6 " "4.2 " ...
##  $ WDGust     : chr  "310.0 " "40.0 " "20.0 " "320.0 " ...
##  $ Precp      : chr  "0.0 " "0.0 " "0.0 " "0.0 " ...
##  $ PrecpHour  : chr  "0.0 " "0.0 " "0.0 " "0.0 " ...
##  $ SunShine   : chr  " " " " " " " " ...
##  $ GloblRad   : chr  "0.00 " "0.00 " "0.00 " "0.00 " ...
##  $ Visb       : chr  " " "10.0 " " " " " ...
##  $ city       : chr  "Hualien" "Hualien" "Hualien" "Hualien" ...
##  $ date       : chr  "2012-08-15" "2012-08-15" "2012-08-15" "2012-08-15" ...
##  $ ObsTime2   : chr  "1" "2" "3" "4" ...
##  $ datetime   : POSIXct, format: "2012-08-15 01:00:00" "2012-08-15 02:00:00" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Strip numeric columns

Remove special symbols in our data.

cnames = colnames(weather)
num_c = cnames[!cnames %in% c("city", "date", "datetime")]
weather[, c(num_c) := (lapply(.SD, function(x){gsub("[^0-9.TxV]", "", x)})), 
        .SDcols = num_c]

Deal with special symbols

You can update partial cells in a data.table.

# Check which numeric columns contain special symbols
weather[, lapply(.SD, function(x){sum(grepl("^[TxV]$", x), na.rm = FALSE)}), 
        .SDcol = num_c]

##    ObsTime StnPres SeaPres Temperature Tddewpoint RH WS  WD WSGust WDGust
## 1:       0       0       0           0          0  0  0 209      0      0
##    Precp PrecpHour SunShine GloblRad Visb ObsTime2
## 1:  3884         0        0        0    0        0

weather[WD == "V", WD := ""]
weather[Precp == "T", Precp := "0"]

More data manipulation

weather[, `:=`(month = substr(date, 6, 7), isRain = Precp > 0)]
weather[, c(num_c) := (lapply(.SD, as.numeric)), .SDcols = num_c]
weather[, date := as.Date(date)]

Hottest date in each city

Find hottest date in each city

hottest = weather[, .(datetime, Temperature, T.Max = max(Temperature)), 
                  keyby = city
                ][, count := .N
                ][Temperature == T.Max, 
                  head(.SD[, .(T.Max, datetime)], 3), # output top 3 if ties
                  keyby = city]

Get Complete dataset

CJ: Cross Join. Generate a data.table from the cross product of the vectors.

cj = CJ(weather$city, as.character(weather$datetime), unique = TRUE)[
    , V2 := ymd_hms(V2, tz = "CST")]
setnames(cj, names(cj), c("city", "datetime"))
weather = merge(cj, weather, all.x = TRUE, all.y = FALSE, by = c("city", "datetime"))
weather[, date := as.character(datetime)
      ][,`:=`(date = as.Date(substr(date, 1, 10)),
              year = substr(date, 1, 4),
              month = substr(date, 6, 7),
              hour = substr(date, 12, 13))]

Get Complete dataset - 2

weather[city == "Kaohsiung" & date == ymd("2015-06-24"), 
        .(city, date, Temperature, Tddewpoint)]

##          city       date Temperature Tddewpoint
##  1: Kaohsiung 2015-06-24        29.4       24.7
##  2: Kaohsiung 2015-06-24          NA         NA
##  3: Kaohsiung 2015-06-24          NA         NA
##  4: Kaohsiung 2015-06-24          NA         NA
##  5: Kaohsiung 2015-06-24          NA         NA
##  6: Kaohsiung 2015-06-24          NA         NA
##  7: Kaohsiung 2015-06-24          NA         NA
##  8: Kaohsiung 2015-06-24          NA         NA
##  9: Kaohsiung 2015-06-24          NA         NA
## 10: Kaohsiung 2015-06-24          NA         NA
## 11: Kaohsiung 2015-06-24          NA         NA
## 12: Kaohsiung 2015-06-24          NA         NA
## 13: Kaohsiung 2015-06-24          NA         NA
## 14: Kaohsiung 2015-06-24          NA         NA
## 15: Kaohsiung 2015-06-24          NA         NA
## 16: Kaohsiung 2015-06-24          NA         NA
## 17: Kaohsiung 2015-06-24          NA         NA
## 18: Kaohsiung 2015-06-24          NA         NA
## 19: Kaohsiung 2015-06-24          NA         NA
## 20: Kaohsiung 2015-06-24          NA         NA
## 21: Kaohsiung 2015-06-24          NA         NA
## 22: Kaohsiung 2015-06-24          NA         NA
## 23: Kaohsiung 2015-06-24          NA         NA
## 24: Kaohsiung 2015-06-24          NA         NA
##          city       date Temperature Tddewpoint

Interpolate NA with mean

We can subset row by a particular columns after assiggn it as a key with setkey.

inter_mean = weather[year == "2015" & month == "06" & city == "Kaohsiung",
                     .(meanT = mean(Temperature, na.rm = TRUE),
                       meanTd = mean(Tddewpoint, na.rm = TRUE)),
                     by = hour]
setkey(inter_mean, hour)
Tj = which(colnames(weather) == "Temperature")
Tdj = which(colnames(weather) == "Tddewpoint")
for (i in which(is.na(weather$Temperature))) {
    set(weather, i, Tj, inter_mean[weather[i, hour], meanT])
    set(weather, i, Tdj, inter_mean[weather[i, hour], meanTd])
}

Interpolate NA with mean - 2

weather[city == "Kaohsiung" & date == ymd("2015-06-24"), 
        .(city, date, Temperature, Tddewpoint)]

##          city       date Temperature Tddewpoint
##  1: Kaohsiung 2015-06-24    29.40000   24.70000
##  2: Kaohsiung 2015-06-24    29.40345   25.12069
##  3: Kaohsiung 2015-06-24    29.24138   24.95172
##  4: Kaohsiung 2015-06-24    28.99655   24.77586
##  5: Kaohsiung 2015-06-24    28.80690   24.67586
##  6: Kaohsiung 2015-06-24    28.57241   24.51379
##  7: Kaohsiung 2015-06-24    28.65517   24.47586
##  8: Kaohsiung 2015-06-24    29.74483   25.23103
##  9: Kaohsiung 2015-06-24    30.87586   25.63448
## 10: Kaohsiung 2015-06-24    31.35172   25.98276
## 11: Kaohsiung 2015-06-24    31.82414   26.42069
## 12: Kaohsiung 2015-06-24    32.22069   26.47586
## 13: Kaohsiung 2015-06-24    32.45172   26.61034
## 14: Kaohsiung 2015-06-24    32.60000   26.65172
## 15: Kaohsiung 2015-06-24    32.62759   26.67241
## 16: Kaohsiung 2015-06-24    32.14828   26.61379
## 17: Kaohsiung 2015-06-24    32.03448   26.45862
## 18: Kaohsiung 2015-06-24    31.53793   26.16897
## 19: Kaohsiung 2015-06-24    30.85517   25.68276
## 20: Kaohsiung 2015-06-24    30.33793   25.39310
## 21: Kaohsiung 2015-06-24    30.11034   25.28276
## 22: Kaohsiung 2015-06-24    29.98966   25.26207
## 23: Kaohsiung 2015-06-24    29.79310   25.23448
## 24: Kaohsiung 2015-06-24    29.74483   25.30345
##          city       date Temperature Tddewpoint

Regression coefficient

Now we would like to compute regression coefficient by each city. In this example, all values are concatenate into single column.

weather[, .(coef = coef(lm(Tddewpoint ~ Temperature))), by = city]

##         city       coef
## 1:   Hualien -2.7831813
## 2:   Hualien  0.9292042
## 3: Kaohsiung -5.8939668
## 4: Kaohsiung  1.0290631
## 5:  Taichung -1.2627112
## 6:  Taichung  0.8169720
## 7:    Taipei -2.0428652
## 8:    Taipei  0.8500687

Regression coefficient - 2

As long as j returns a list, each element of the list will become a column in the resulting data.table.

weather[, as.list(coef(lm(Tddewpoint ~ Temperature))), by = city]

##         city (Intercept) Temperature
## 1:   Hualien   -2.783181   0.9292042
## 2: Kaohsiung   -5.893967   1.0290631
## 3:  Taichung   -1.262711   0.8169720
## 4:    Taipei   -2.042865   0.8500687

melt.data.table

data.table implement melt in reshape2. Additionally, you can use regular expression with patterns to melt multiple columns.

melted = melt.data.table(weather, 
                         id.vars = c("city", "datetime"),
                         measure = patterns(".*Pres", "Precp"),
                         value.name = c("Pres", "Precp")
                         )
melted[c(1,2,.N-1,.N),]

##       city            datetime variable   Pres Precp
## 1: Hualien 2012-08-15 01:00:00        1 1002.9     0
## 2: Hualien 2012-08-15 02:00:00        1 1001.5     0
## 3:  Taipei 2015-08-14 23:00:00        2 1008.7     0
## 4:  Taipei 2015-08-15 00:00:00        2 1008.4     0

dcast.data.table

You can also use dcast with data.table. Furthermore, you can cast multiple value.vars simultaneously.

dcasted = dcast.data.table(melted, city + datetime ~ variable,
                           value.var = c("Pres", "Precp"))
dcasted[c(1,2,.N-1,.N),]

##       city            datetime Pres_1 Pres_2 Precp_1 Precp_2
## 1: Hualien 2012-08-15 01:00:00 1002.9 1005.1       0       0
## 2: Hualien 2012-08-15 02:00:00 1001.5 1003.7       0       0
## 3:  Taipei 2015-08-14 23:00:00 1005.2 1008.7       0       0
## 4:  Taipei 2015-08-15 00:00:00 1004.9 1008.4       0       0

References

Task Schedualing in R

Sometime you may need to execute your R script periodically and automatically:

Regular Report: Generate monthly/daily report automatically
Crawler: Get latest data every hour
Forecasting System: Rebuild your preditive model with latest data

Steps to Schedualing

Invoking a batch job of R
- RScript
- R CMD BATCH
Passing arguments to R
A schedualer:
- Windows: Task Schedualer
- Linux/Mac: Crontab

Open shell with RStudio

Environment variable of R and RScript will be exported in this way.

Invoke a batch job

Rscript
- A binary front-end to execute a R script
- Recommended
- Execute a R script: Just like other programming language, you can pass arguments to your R script

Rscript <PATH OF YOUR R SCRIPT> <ARG1> <ARG2> ...

R CMD BATCH
- Run R in batch mode
- I never use this one since I use Rscript

Shebang

If you use Linux or Mac, you can execute a R script directly with shebang and permission of execution.

#!/usr/bin/env Rscript

Add permission of execution:

chmod u+x schedualing/now.R

Check permission:

ls -l schedualing/now.R

## -rwxrw-rw- 1 mansun mansun 1168 May 14 02:35 schedualing/now.R

Passing arguments

Function:
- commandArgs: a function to access command line arguments
Package:
- argparser: Command-line argument parser written purely in R
- ArgParser: a command line argument parser for R from Kyle
- argparse: Analog to Python’s argparse
- optparse: Analog to Python’s optparse

In most situation, I will use a package to dealing with arguments parsing.

Exercise

Please download our github repository under github directory. Let’s try to pass some arguments to schedualing/now.R:

Get help:

Rscript schedualing/now.R -h

Pass a positional argument:

Rscript schedualing/now.R Mansun

Add a flag -c:

Rscript schedualing/now.R Mansun -c

Add a optional argument:

Rscript schedualing/now.R Mansun -c -g "How are you?"

now.R

#!/usr/bin/env Rscript

# Use argparser
library(argparser, quietly = TRUE)

# Create an arg.parser object.
p <- arg_parser("Hi! What time it is?")

# Add a positional argument
p <- add_argument(p, arg = "who", help = "Who are you")

# Add an optional argument
# Rscript will raise a warning message if you pass with -g
# but it doenn't matter
p <- add_argument(p, arg = "--greeting", short = "-g", 
                  default = "How's going?", type = "character",
                  help = "Greeting word")

# Add a flag, default value is FALSE
p <- add_argument(p, arg = "--chat", short = "-c", flag = TRUE,
                  help = "Whether or not to have a greeting")

# Parse commandArgs(trailingOnly = TRUE) into args
args <- parse_args(p)

print(args)
str(args)

# The original arguments
command_args <- commandArgs(trailingOnly = TRUE)
cat("args from command_args:")
print(command_args)

# Get system time
now <- as.character(Sys.time())

# Construct greeting string
greeting <- sprintf("Hi %s! It is %s.", args$who, now)

if (args$chat) {
    greeting <-  paste(greeting, args$greeting)
}

print(greeting)

# Write to a text file
writeLines(greeting, "now.txt")

Task Schedualer(Windows)

Launch task scheduler on Windows:

Press Windows Logo+R to run dialog box
Enter control schedtasks

Launch task schedualer within RStudio:

# Execute a system conmmand to launch task schedualer
system("control schedtasks")

Crontab(Linux/Mac)

Maintain crontab files to execute scheduled commands in Unix-like OS for individual users.

crontab [-u user] file  
crontab [ -u user ] [ -i ] { -e | -l | -r }  
    -e      (edit user's crontab)  
    -l      (list user's crontab)  
    -r      (delete user's crontab)  
    -i      (prompt before deleting user's crontab)

You may need to use some command line editor like vim when using crontab -e to edit your crontab. Type select-editor in terminal to choose your favorite editor

Your first crontab

Download our github repository under github directory
Make sure your library argparser are installed systemwide.
Open crontab.txt. Edit it as your full path and save it. Always change directory to your working directory to prevent unexpected error.

# Execute every minutes 
* * * * * cd /home/mansun/github/BeyondBasicR/schedualing; ./now.R Mansun

Open your terminal and enter following command

crontab schedualing/crontab.txt

Type crontab -l to check your crontab:

crontab -l

## # Execute every minutes 
## * * * * * cd /home/mansun/github/BeyondBasicR/schedualing; ./now.R Mansun

Configure your cron job

┌───────────── min (0 - 59) 
│ ┌────────────── hour (0 - 23)
│ │ ┌─────────────── day of month (1 - 31)
│ │ │ ┌──────────────── month (1 - 12)
│ │ │ │ ┌───────────────── day of week (0 - 6) (0 to 6 are Sunday to Saturday)
│ │ │ │ │
* * * * *  command to execute

Comments begin with a comment mark #
Separate items of a list with Comma(,).

# Execute at 00:00 and 12:00 everyday
0 0,12 * * *  command to execute

Define ranges with Hyphen(-)

# Execute at 06:00 every Monday to Friday
0 6 * * 1-5  command to execute

Please refer to Cron for further information.

Beyond Basic R

What is in this lecture

Outline

Keyboard shortcuts of RStudio

Package

What is inside a R package

How to find a package you want

Load package

Namespace

Location of libraries

libPaths

Change Privilege(Windows)

Change Privilege(Linux/Mac)

Where can you find a R package

Install from CRAN

Install package with RStudio

Exercise

Install a local files

Install from github

References

Data I/O

Useful data I/O in R

Quoting in R

About Encoding

Highlevel I/O for a text file

JSON

jsonlite

jsonlite - 2

Table

Table -2

Database

RSQLite

References

data.table

Aims of this lecture

What is data.table

Why data.table

Benchmarks

Installation

General Form

Your first data.table

data.table is a data.frame, too

Subset with i

Select with j

Degeneration

With

Assignment by reference

More calculation with j

Group by with by

.N

Expressions in by

Exercise

Answer

Subset of data.table

Subset of data.table - 2

Chaining

Key

Sort a data.table with order

set* family

set* family - 2

set* family - 3

Copy

Feel good with side effect

Benckmark: order

Display all data.table’s

Case Study

Read data

Read multiple data

Write data

Benckmark: data I/O

Generate datetime

Strip numeric columns

Deal with special symbols

More data manipulation

Hottest date in each city

Get Complete dataset

Get Complete dataset - 2

Interpolate NA with mean

Interpolate NA with mean - 2

Regression coefficient