Beyond Basic R

2016-05-16

Mansun Kuo

Beyond Basic R

2016-05-16

Mansun Kuo

What is in this lecture

Some additional knowledge and handy packages for using R as your data manipulation and web scraping tool

Outline

Keyboard shortcuts of RStudio

All RStudio keyboard shortcuts

Description Windows & Linux Mac
Show Keyboard Shortcut Reference Alt+Shift+K Option+Shift+K
Attempt completion / Indent Tab or Ctrl+Space Tab or Command+Space
Run current line/selection Ctrl+Enter Command+Enter
Comment/uncomment current line/selection Ctrl+Shift+C Command+Shift+C
Save active document Ctrl+S Command+S
Reindent lines Ctrl+I Command+I

Package

What is inside a R package

A typical R package may have following components:

How to find a package you want

Load package

library(argparser)
library("argparser")
require(mansun)
## Loading required package: mansun
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'mansun'

Namespace

Sometimes functions in different packages may have the same name.

lubridate::date("2016-05-14") # access without load whole package    
## [1] "2016-05-14"
library(lubridate, quietly = TRUE)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
date("2016-05-14")  # convert date. base::date is masked
## [1] "2016-05-14"
base::date()    # return system data and time
## [1] "Mon May 16 10:29:24 2016"

Location of libraries

Sys.getenv(c("R_LIBS_USER", "R_HOME"))
##                           R_LIBS_USER 
## "~/R/x86_64-pc-linux-gnu-library/3.3" 
##                                R_HOME 
##                          "/usr/lib/R"

libPaths

Libraries in R are loaded in order:

.libPaths()
## [1] "/usr/local/lib/R/site-library" "/usr/lib/R/site-library"      
## [3] "/usr/lib/R/library"

You can also add an additional library path:

.libPaths(c("your/library/path", .libPaths()))

Or change the order of library search path:

lpath <- .libPaths()
.libPaths(lpath[c(length(lpath), 1:max(1, length(lpath)-1))])
.libPaths()
## [1] "/usr/lib/R/library"            "/usr/local/lib/R/site-library"
## [3] "/usr/lib/R/site-library"

Change Privilege(Windows)

Change Privilege(Linux/Mac)

Use chown and chmod to change owner and write permission of your install path.

chown mansun:mansun /usr/local/lib/R/site-library
chmod u+w /usr/local/lib/R/site-library

Where can you find a R package

Install from CRAN

Install multiple packages:

install.packages(c("argparser", "jsonlite"), 
    lib = "/usr/local/lib/R/site-library")

Install development version of data.table:

install.packages("data.table", type = "source",
    repos = "https://Rdatatable.github.io/data.table")

Install package with RStudio

Yun can even using RStudio to install a package into system library or site library.

Exercise

Try to install package argparser in system library or site library.

Install a local files

You can also install a local tarball:

install.packages("/tmp/Rtmp99mvkt/downloaded_packages/argparser_0.4.tar.gz", 
    lib = "/usr/local/lib/R/site-library", repo = NULL, type = "source")

Install from github

Using devtools::install_github can easily install any R package on GitHub:

To install devtools make sure you have:

Then:

install.packages("devtools")

Now you can install any package on github with single command:

devtools::install_github('Rdatatable/data.table')

Just take a look what happened when you execute above command.

References

Data I/O

Useful data I/O in R

Quoting in R

Beside standard characters, there are several escaped characters that have special meaning. For example:

A character with newline and tab:

greeting = "Hi!\n\tHow are you? \n\tI'm fine, thank you!"
cat(greeting)
## Hi!
##  How are you? 
##  I'm fine, thank you!

About Encoding

Highlevel I/O for a text file

writeLines(greeting, con = "dataio/greeting.txt")
readLines(con = "dataio/greeting.txt")
## [1] "Hi!"                    "\tHow are you? "       
## [3] "\tI'm fine, thank you!"

JSON

The most popular format to exchange unstructure data. For example:

{
  "teacher": "Mansun",
  "isFullTime": false,
  "hobby": ["Music", "Game"],
  "license": null,
  "lecture": [
    {
      "name": "BeyondRBasic",
      "date": "2016-05-14",
      "hour": 3
    },
    {
      "name": "RCrawler101-2",
      "date": "2016-01-09",
      "hour": 3
    }
  ]
}

jsonlite

Provide JSON parseing/generating utilities in R

library(jsonlite)
intro = fromJSON("dataio/intro.json", simplifyVector = TRUE)
str(intro)
## List of 5
##  $ teacher   : chr "Mansun"
##  $ isFullTime: logi FALSE
##  $ hobby     : chr [1:2] "Music" "Game"
##  $ license   : NULL
##  $ lecture   :'data.frame':  2 obs. of  3 variables:
##   ..$ name: chr [1:2] "BeyondRBasic" "RCrawler101-2"
##   ..$ date: chr [1:2] "2016-05-14" "2016-01-09"
##   ..$ hour: int [1:2] 3 3

jsonlite - 2

intro$license = NULL
toJSON(intro, pretty = TRUE)  # convert a list as a pretty JSON  
## {
##   "teacher": ["Mansun"],
##   "isFullTime": [false],
##   "hobby": ["Music", "Game"],
##   "lecture": [
##     {
##       "name": "BeyondRBasic",
##       "date": "2016-05-14",
##       "hour": 3
##     },
##     {
##       "name": "RCrawler101-2",
##       "date": "2016-01-09",
##       "hour": 3
##     }
##   ]
## }

Table

In most situation, you can use data.table::fwrite and data.table::fread to do such tasks much easier and faster.

Here is an example to read a file generated by excel on Windows(Traditional Chinese):

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
## 
##     hour, mday, month, quarter, wday, week, yday, year
from_cp950 = fread("dataio/cp950.csv", encoding = "UTF-8")
from_cp950$name = iconv(from_cp950$name, from = "CP950", to = "UTF-8")
from_cp950
##      name age
## 1: 王先生  34
## 2: 白小姐  23

Table -2

Write the CP950 file into another file with UTF-8 encoding. Read the UTF-8 file again.

fwrite(from_cp950, "dataio/utf8.csv", sep = "\t") # default write as UTF-8
from_utf8 = fread("dataio/utf8.csv", encoding = "UTF-8")
from_utf8
##      name age
## 1: 王先生  34
## 2: 白小姐  23
identical(from_cp950, from_utf8)
## [1] TRUE

Database

A typical SQL query string:

SELECT <columns you want to retrieve>
FROM <table>
WHERE <comparison of columns>
GROUP BY <columns you want to compute the aggregation result>
ORDER BY <columns you want to order>
HAVING <conditions to filter the result of GROUP BY>

RSQLite

A light-weight database engine interface in R

References

data.table

Aims of this lecture

What is data.table

Why data.table

Benchmarks

Benchmarks : Grouping

Installation

Here we use data.table v1.9.7

Install development version of data.table:

install.packages("data.table", type = "source",
    repos = "https://Rdatatable.github.io/data.table")

or

devtools::install_github("Rdatatable/data.table")

Load package:

library(data.table)

General Form

DT[i, j, by]


Compare with SQL:

R: i j by
SQL: where select/update group by


Say it loud:

Take DT, subset rows using i, then calculate j grouped by by.

Your first data.table

You can create a data.table just like the way you create a data.frame. It don’t convert character to factor by default.

library(data.table)
team = data.table(name = c("Wendy", "Alice", "James", "Bill"),
                  gender = c("f", "f", "M", "M"),
                  age = c(26, 24, 28, 32))

or convert a data.frame into data.table:

teamdf = data.frame(name = c("Wendy", "Alice", "James", "Bill"),
                    gender = c("f", "f", "M", "M"),
                    age = c(26, 24, 28, 32),
                    stringsAsFactors = FALSE)
teamdt = data.table(teamdf)
identical(team, teamdt)
## [1] TRUE

data.table is a data.frame, too

class(team)
## [1] "data.table" "data.frame"

You can use any functions that suitable for data.frame with data.table. Note that the behavior of [ operator is slightly different than data.frame.

team$gender
## [1] "f" "f" "M" "M"
team[[2]]
## [1] "f" "f" "M" "M"
team[2]
##     name gender age
## 1: Alice      f  24




teamdf$gender
## [1] "f" "f" "M" "M"
teamdf[[2]]
## [1] "f" "f" "M" "M"
teamdf[2]
##   gender
## 1      f
## 2      f
## 3      M
## 4      M

Subset with i

Use i to subset a data.table is similar to data.frame. Furthermore, expression is evaluated within the frame of the data.table

team[1, ]
##     name gender age
## 1: Wendy      f  26
team[c(FALSE, FALSE, FALSE, TRUE), ]
##    name gender age
## 1: Bill      M  32
team[name == "Wendy", ]
##     name gender age
## 1: Wendy      f  26
team[name %in% c("Wendy", "Alice"), ]
##     name gender age
## 1: Wendy      f  26
## 2: Alice      f  24

Select with j

team[, .(gender)]
##    gender
## 1:      f
## 2:      f
## 3:      M
## 4:      M
team[1, .(sex = gender)]
##    sex
## 1:   f
team[1, .(name, gender)]
##     name gender
## 1: Wendy      f

Degeneration

If you don’ wrap a single expression with list in j, the output will degenerate into a vector.

team[, gender]
## [1] "f" "f" "M" "M"

If you want to know how old is Wendy:

team[name == "Wendy", age]
## [1] 26

With

Because j handle expressions, you need to set with = FALSE when slicing with logical, interger or character vector

team[1, c("gender")]    # not works
## [1] "gender"
team[1, c(TRUE, FALSE, TRUE), with = FALSE]
##     name age
## 1: Wendy  26
team[1, c(1, 3), with = FALSE]
##     name age
## 1: Wendy  26
team[1, c("name", "age"), with = FALSE]
##     name age
## 1: Wendy  26

Assignment by reference

:= add/remove/update subsets of column by reference.

team[, id := 1:4]   # add a column 
team[name == "James", age := 38]  # UPDATE age for james
team[name == "Bill", c("name", "age") := .("Billy", 34)] # update 2 columns 
team[, id := NULL]  # remove id
team[gender == "f", `:=`(gender = "F", age = age + 5)]
team
##     name gender age
## 1: Wendy      F  31
## 2: Alice      F  29
## 3: James      M  38
## 4: Billy      M  34

More calculation with j

Use a list of expression in j to compute with data.table

stat = team[, .(mean_age = mean(age), sd_age = sd(age))]
stat
##    mean_age  sd_age
## 1:       33 3.91578

Add a new column:

team[, mean_age := mean(age)]
team
##     name gender age mean_age
## 1: Wendy      F  31       33
## 2: Alice      F  29       33
## 3: James      M  38       33
## 4: Billy      M  34       33

Note that putting := inside .() doesn’t work:

stat = team[, .(mean_age := mean(age), sd_age := sd(age))] # not works

Group by with by

You can use by to summarize by each group. keyby will add key automatically.

Generate a data.table of mean and stand deviation by gender:

team[, .(mean_age = mean(age), sd_age = sd(age)), by = "gender"]
##    gender mean_age   sd_age
## 1:      F       30 1.414214
## 2:      M       36 2.828427

Add a new column of mean age by gender:

team[, gender_mean_age := mean(age), by = "gender"]
team
##     name gender age mean_age gender_mean_age
## 1: Wendy      F  31       33              30
## 2: Alice      F  29       33              30
## 3: James      M  38       33              36
## 4: Billy      M  34       33              36

.N

team[.N, ] # subset last column
##     name gender age mean_age gender_mean_age
## 1: Billy      M  34       33              36
team[, .(count = .N), by = gender]  # a new data.table, count by gender
##    gender count
## 1:      F     2
## 2:      M     2
team[, count := .N, by = gender]    # add a column in data.table
team
##     name gender age mean_age gender_mean_age count
## 1: Wendy      F  31       33              30     2
## 2: Alice      F  29       33              30     2
## 3: James      M  38       33              36     2
## 4: Billy      M  34       33              36     2

Expressions in by

You can use expressions in by to separate groups. Here we divide groups with gender and age great equal 30:

team[, .(count = .N, mean_age = mean(age)), by = .(old = age >= 30, gender)]
##      old gender count mean_age
## 1:  TRUE      F     1       31
## 2: FALSE      F     1       29
## 3:  TRUE      M     2       36

Exercise

Here is a data.table contains four pets’ name, species and weight:

library(data.table)
pets = data.table(name = c("Senor", "Cookie", "Lucky", "Meow Meow"),
                  species = c("dog", "dog", "bird", "cat"),
                  weight = c(26, 13, 0.5, 6))

Please:

  1. Create a vector called pet_weight contains all pets’ weight
  2. Create a unit vector senor_weight contains Senor’s weight
  3. Create a data.table called dogs_weight that contains name and weight of all dogs in pets
  4. Modify Cookie’s weight as 15
  5. Create a data.table called pet_stat that contains species, avg_weight and count

Answer

Answers of Pet Exercise

Subset of data.table

Get last row of each gender:

team[, .SD[.N, ], by = gender]     
##    gender  name age mean_age gender_mean_age count
## 1:      F Alice  29       33              30     2
## 2:      M Billy  34       33              36     2

Get last row of each gender:

team[, .SD[age == max(age), ], by = gender]     
##    gender  name age mean_age gender_mean_age count
## 1:      F Wendy  31       33              30     2
## 2:      M James  38       33              36     2

Subset of data.table - 2

Do some computation on age, mean_age and gender_mean_age with .SDcols:

age_cols = c("age", "mean_age", "gender_mean_age")
team[, c(age_cols) := lapply(.SD, function(x) x + 2), .SDcols = age_cols]

It’s ok to use j inside .SD to do the same thing:

team[, c(age_cols) := lapply(.SD[, age_cols, with = FALSE], 
                             function(x) x +2)]
team
##     name gender age mean_age gender_mean_age count
## 1: Wendy      F  35       37              34     2
## 2: Alice      F  33       37              34     2
## 3: James      M  42       37              40     2
## 4: Billy      M  38       37              40     2

Chaining

DT[ … ][ … ][ … ]

Chaining of data.table can avoid unnecessary intermediate assignment.

age_comparator = function(dt) {
    ifelse(dt[gender == "M", mean_age] >=  dt[gender == "M", mean_age],
           "Males are older", "Females are older")
}
gender_mean_age = team[, .(mean_age = mean(age)), by = "gender"]
gender_mean_age[, comment := age_comparator(.SD)]
# Use Chaining
gender_mean_age2 = team[, .(mean_age = mean(age)), by = "gender",
                      ][, comment := age_comparator(.SD)]
gender_mean_age
##    gender mean_age         comment
## 1:      F       34 Males are older
## 2:      M       40 Males are older
identical(gender_mean_age, gender_mean_age2)
## [1] TRUE

Key

You can set key in a data.table to:

setkey(team, gender, name)  # set name as key
setkeyv(team, c("gender", "name"))   # set name as key using character
key(team)   # show keys in a data.table
## [1] "gender" "name"
team[.("F", "Wendy"),]  # now you can slicing with key
##     name gender age mean_age gender_mean_age count
## 1: Wendy      F  35       37              34     2
team[.("F"), ]  # prefix is ok
##     name gender age mean_age gender_mean_age count
## 1: Alice      F  33       37              34     2
## 2: Wendy      F  35       37              34     2
team[.("Wendy"),]  # keys have orders. This example does not work
##    name gender age mean_age gender_mean_age count
## 1:   NA  Wendy  NA       NA              NA    NA

Sort a data.table with order

Ordering with keys is faster:

team[order(gender, -name)]
##     name gender age mean_age gender_mean_age count
## 1: Wendy      F  35       37              34     2
## 2: Alice      F  33       37              34     2
## 3: James      M  42       37              40     2
## 4: Billy      M  38       37              40     2

You still can ordering without keys:

team[order(gender, -age)]
##     name gender age mean_age gender_mean_age count
## 1: Wendy      F  35       37              34     2
## 2: Alice      F  33       37              34     2
## 3: James      M  42       37              40     2
## 4: Billy      M  38       37              40     2

set* family

All set* functions change their input by reference.

Set row names as the first letter of name:

attributes(team)$row.names
## [1] 1 2 3 4
setattr(team, "row.names", substr(team$name, 1, 1))
attributes(team)$row.names
## [1] "A" "W" "B" "J"

set* family - 2

Set column names as upper case:

setnames(team, "name", "Name")  # set name to Name
colnames(team)
## [1] "Name"            "gender"          "age"             "mean_age"       
## [5] "gender_mean_age" "count"
setnames(team, colnames(team), tolower(colnames(team))) # set all lower
colnames(team)
## [1] "name"            "gender"          "age"             "mean_age"       
## [5] "gender_mean_age" "count"

set* family - 3

Use set to update the name of first person:

set(team, 1L, "name", "Alison")
team
##      name gender age mean_age gender_mean_age count
## 1: Alison      F  33       37              34     2
## 2:  Wendy      F  35       37              34     2
## 3:  Billy      M  38       37              40     2
## 4:  James      M  42       37              40     2

Set genderas the first column:

new_order = c("gender", colnames(team)[!colnames(team) %in% "gender"])
setcolorder(team, new_order)
team
##    gender   name age mean_age gender_mean_age count
## 1:      F Alison  33       37              34     2
## 2:      F  Wendy  35       37              34     2
## 3:      M  Billy  38       37              40     2
## 4:      M  James  42       37              40     2

Copy

:= and set* modify the data.table by references. You can use copy to do a deep copy to avoid side effect (modify a object that doesn’t return by the function).

deepf = function(dt) {
    tt = copy(dt)
    tt[, name := substr(name, 1, 1)]
}
deepf(team)
team$name
## [1] "Alison" "Wendy"  "Billy"  "James"
shallowf = function(dt) {
    tt = dt
    tt[, name := substr(name, 1, 1)]
}
shallowf(team)
team$name
## [1] "A" "W" "B" "J"

Notice that we modify tt’s name inside shallowf but the data.table team is also changed.

Feel good with side effect

It is always a good idea letting others(or you in the future) know what objects will be modified if your function have side effect.

team = shallowf(team)

Benckmark: order

Benckmark: order

Note that in R 3.3.0, base::order has a new argument method to choose the sorting method and the radix sorting comes from data.table (Ref). Default sorting method of base::order is still shell sorting.

Display all data.table’s

Use tables to list all data.table’s in memory.

all_tables = tables(silent=TRUE, width = 60)
all_tables[, COLS := NULL]
all_tables  # KEY column disappear in slides for unknown reason
##                NAME NROW NCOL MB KEY
## 1:       from_cp950    2    2  1    
## 2:        from_utf8    2    2  1    
## 3:  gender_mean_age    2    3  1    
## 4: gender_mean_age2    2    3  1    
## 5:             pets    4    3  1    
## 6:             stat    1    2  1    
## 7:             team    4    6  1    
## 8:           teamdt    4    3  1

Case Study

Using hourly weather data from 2012-08-15 to 2015-08-14 to illustrate features of data.table.

Special symbols in this data:

  • T: trace amount of precipitation
  • x: instrument error
  • V: wind direction is uncertain

Read data

weather = fread("data/weather.csv")
## Warning in fread("data/weather.csv"): Bumped column 1 to type character on
## data row 51314, field contains '觀測時間(LST)ObsTime'. Coercing previously
## read values in this column from logical, integer or numeric back to
## character which may not be lossless; e.g., if '00' and '000' occurred
## before they will now be just '0', and there may be inconsistencies with
## treatment of ',,' and ',NA,' too (if they occurred in this column before
## the bump). If this matters please rerun and set 'colClasses' to 'character'
## for this column. Please note that column type detection uses the first
## 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message
## should be very rare. If reporting to datatable-help, please rerun and
## include the output from verbose=TRUE.

Read multiple data

filepaths = list.files("data/daily", full.names = TRUE)
locations = gsub("data\\/daily\\/|_....-..-..\\.csv", "", filepaths)
dates = gsub("data\\/daily\\/.*_|\\.csv", "", filepaths)
weather = list()
for (i in 1:length(filepaths)) {
    weather[[i]] = fread(filepaths[i])
    weather[[i]][, city := locations[i]]
    weather[[i]][, date := dates[i]]
}
weather = rbindlist(weather)

Write data

A fast data writer fwrite was introduced in data.table 1.9.7

# output a comma-separated file
fwrite(weather, "data/weather.csv")

# output a tab-separated file
fwrite(weather[1:24, ], "data/weather.txt", sep = "\t")

Benckmark: data I/O

Benckmark: data I/O

Generate datetime

Here we use lubridate::ymd_h to generate datetime. The warning message is due to the missing data of Kaohsiung in 2015-06-24

library(lubridate, quietly = TRUE)
weather[, ObsTime2 := ObsTime]

# For datetime convertion
weather[ObsTime2 == "24", ObsTime2 := "0"] 

# Convert 
weather[,datetime := ymd_h(paste(date, ObsTime2), tz = "CST")]
## Warning: 2 failed to parse.
# Plus one day for ObsTime = "24"
weather[ObsTime == "24", datetime := datetime + days(1)]

# Check for rows that don't convert successful
weather[is.na(datetime)]
##                 ObsTime              StnPres                SeaPres
## 1:                                                                 
## 2: 觀測時間(LST)ObsTime 測站氣壓(hPa)StnPres 海平面氣壓(hPa)SeaPres
##           Temperature              Tddewpoint            RH          WS
## 1:                                                                     
## 2: 氣溫(℃)Temperature 露點溫度(℃)Td dew point 相對溼度(%)RH 風速(m/s)WS
##                             WD              WSGust
## 1:                                                
## 2: 風向(最多風向)(360degree)WD 最大陣風(m/s)WSGust
##                           WDGust           Precp             PrecpHour
## 1:                                                                    
## 2: 最大陣風風向(360degree)WDGust 降水量(mm)Precp 降水時數(hr)PrecpHour
##                SunShine                    GloblRad           Visb
## 1:                                                                
## 2: 日照時數(hr)SunShine 全天空日射量(MJ/㎡)GloblRad 能見度(km)Visb
##         city       date             ObsTime2 datetime
## 1: Kaohsiung 2015-06-24                          <NA>
## 2: Kaohsiung 2015-06-24 觀測時間(LST)ObsTime     <NA>
# Drop rows that don't convert successful
weather = weather[!is.na(datetime),]

str(weather)
## Classes 'data.table' and 'data.frame':   105096 obs. of  19 variables:
##  $ ObsTime    : chr  "1" "2" "3" "4" ...
##  $ StnPres    : chr  "1002.9 " "1001.5 " "1002.6 " "1002.3 " ...
##  $ SeaPres    : chr  "1005.1 " "1003.7 " "1004.8 " "1004.5 " ...
##  $ Temperature: chr  "28.6 " "28.3 " "28.3 " "28.1 " ...
##  $ Tddewpoint : chr  "21.1 " "22.1 " "21.6 " "21.7 " ...
##  $ RH         : chr  "64 " "69 " "67 " "68 " ...
##  $ WS         : chr  "1.3 " "3.2 " "2.5 " "1.1 " ...
##  $ WD         : chr  "340.0 " "20.0 " "330.0 " "330.0 " ...
##  $ WSGust     : chr  "2.7 " "5.8 " "4.6 " "4.2 " ...
##  $ WDGust     : chr  "310.0 " "40.0 " "20.0 " "320.0 " ...
##  $ Precp      : chr  "0.0 " "0.0 " "0.0 " "0.0 " ...
##  $ PrecpHour  : chr  "0.0 " "0.0 " "0.0 " "0.0 " ...
##  $ SunShine   : chr  " " " " " " " " ...
##  $ GloblRad   : chr  "0.00 " "0.00 " "0.00 " "0.00 " ...
##  $ Visb       : chr  " " "10.0 " " " " " ...
##  $ city       : chr  "Hualien" "Hualien" "Hualien" "Hualien" ...
##  $ date       : chr  "2012-08-15" "2012-08-15" "2012-08-15" "2012-08-15" ...
##  $ ObsTime2   : chr  "1" "2" "3" "4" ...
##  $ datetime   : POSIXct, format: "2012-08-15 01:00:00" "2012-08-15 02:00:00" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Strip numeric columns

Remove special symbols in our data.

cnames = colnames(weather)
num_c = cnames[!cnames %in% c("city", "date", "datetime")]
weather[, c(num_c) := (lapply(.SD, function(x){gsub("[^0-9.TxV]", "", x)})), 
        .SDcols = num_c]

Deal with special symbols

You can update partial cells in a data.table.

# Check which numeric columns contain special symbols
weather[, lapply(.SD, function(x){sum(grepl("^[TxV]$", x), na.rm = FALSE)}), 
        .SDcol = num_c]
##    ObsTime StnPres SeaPres Temperature Tddewpoint RH WS  WD WSGust WDGust
## 1:       0       0       0           0          0  0  0 209      0      0
##    Precp PrecpHour SunShine GloblRad Visb ObsTime2
## 1:  3884         0        0        0    0        0
weather[WD == "V", WD := ""]
weather[Precp == "T", Precp := "0"]

More data manipulation

weather[, `:=`(month = substr(date, 6, 7), isRain = Precp > 0)]
weather[, c(num_c) := (lapply(.SD, as.numeric)), .SDcols = num_c]
weather[, date := as.Date(date)]

Hottest date in each city

Find hottest date in each city

hottest = weather[, .(datetime, Temperature, T.Max = max(Temperature)), 
                  keyby = city
                ][, count := .N
                ][Temperature == T.Max, 
                  head(.SD[, .(T.Max, datetime)], 3), # output top 3 if ties
                  keyby = city]

Get Complete dataset

cj = CJ(weather$city, as.character(weather$datetime), unique = TRUE)[
    , V2 := ymd_hms(V2, tz = "CST")]
setnames(cj, names(cj), c("city", "datetime"))
weather = merge(cj, weather, all.x = TRUE, all.y = FALSE, by = c("city", "datetime"))
weather[, date := as.character(datetime)
      ][,`:=`(date = as.Date(substr(date, 1, 10)),
              year = substr(date, 1, 4),
              month = substr(date, 6, 7),
              hour = substr(date, 12, 13))]

Get Complete dataset - 2

weather[city == "Kaohsiung" & date == ymd("2015-06-24"), 
        .(city, date, Temperature, Tddewpoint)]
##          city       date Temperature Tddewpoint
##  1: Kaohsiung 2015-06-24        29.4       24.7
##  2: Kaohsiung 2015-06-24          NA         NA
##  3: Kaohsiung 2015-06-24          NA         NA
##  4: Kaohsiung 2015-06-24          NA         NA
##  5: Kaohsiung 2015-06-24          NA         NA
##  6: Kaohsiung 2015-06-24          NA         NA
##  7: Kaohsiung 2015-06-24          NA         NA
##  8: Kaohsiung 2015-06-24          NA         NA
##  9: Kaohsiung 2015-06-24          NA         NA
## 10: Kaohsiung 2015-06-24          NA         NA
## 11: Kaohsiung 2015-06-24          NA         NA
## 12: Kaohsiung 2015-06-24          NA         NA
## 13: Kaohsiung 2015-06-24          NA         NA
## 14: Kaohsiung 2015-06-24          NA         NA
## 15: Kaohsiung 2015-06-24          NA         NA
## 16: Kaohsiung 2015-06-24          NA         NA
## 17: Kaohsiung 2015-06-24          NA         NA
## 18: Kaohsiung 2015-06-24          NA         NA
## 19: Kaohsiung 2015-06-24          NA         NA
## 20: Kaohsiung 2015-06-24          NA         NA
## 21: Kaohsiung 2015-06-24          NA         NA
## 22: Kaohsiung 2015-06-24          NA         NA
## 23: Kaohsiung 2015-06-24          NA         NA
## 24: Kaohsiung 2015-06-24          NA         NA
##          city       date Temperature Tddewpoint

Interpolate NA with mean

We can subset row by a particular columns after assiggn it as a key with setkey.

inter_mean = weather[year == "2015" & month == "06" & city == "Kaohsiung",
                     .(meanT = mean(Temperature, na.rm = TRUE),
                       meanTd = mean(Tddewpoint, na.rm = TRUE)),
                     by = hour]
setkey(inter_mean, hour)
Tj = which(colnames(weather) == "Temperature")
Tdj = which(colnames(weather) == "Tddewpoint")
for (i in which(is.na(weather$Temperature))) {
    set(weather, i, Tj, inter_mean[weather[i, hour], meanT])
    set(weather, i, Tdj, inter_mean[weather[i, hour], meanTd])
}

Interpolate NA with mean - 2

weather[city == "Kaohsiung" & date == ymd("2015-06-24"), 
        .(city, date, Temperature, Tddewpoint)]
##          city       date Temperature Tddewpoint
##  1: Kaohsiung 2015-06-24    29.40000   24.70000
##  2: Kaohsiung 2015-06-24    29.40345   25.12069
##  3: Kaohsiung 2015-06-24    29.24138   24.95172
##  4: Kaohsiung 2015-06-24    28.99655   24.77586
##  5: Kaohsiung 2015-06-24    28.80690   24.67586
##  6: Kaohsiung 2015-06-24    28.57241   24.51379
##  7: Kaohsiung 2015-06-24    28.65517   24.47586
##  8: Kaohsiung 2015-06-24    29.74483   25.23103
##  9: Kaohsiung 2015-06-24    30.87586   25.63448
## 10: Kaohsiung 2015-06-24    31.35172   25.98276
## 11: Kaohsiung 2015-06-24    31.82414   26.42069
## 12: Kaohsiung 2015-06-24    32.22069   26.47586
## 13: Kaohsiung 2015-06-24    32.45172   26.61034
## 14: Kaohsiung 2015-06-24    32.60000   26.65172
## 15: Kaohsiung 2015-06-24    32.62759   26.67241
## 16: Kaohsiung 2015-06-24    32.14828   26.61379
## 17: Kaohsiung 2015-06-24    32.03448   26.45862
## 18: Kaohsiung 2015-06-24    31.53793   26.16897
## 19: Kaohsiung 2015-06-24    30.85517   25.68276
## 20: Kaohsiung 2015-06-24    30.33793   25.39310
## 21: Kaohsiung 2015-06-24    30.11034   25.28276
## 22: Kaohsiung 2015-06-24    29.98966   25.26207
## 23: Kaohsiung 2015-06-24    29.79310   25.23448
## 24: Kaohsiung 2015-06-24    29.74483   25.30345
##          city       date Temperature Tddewpoint

Regression coefficient

Now we would like to compute regression coefficient by each city. In this example, all values are concatenate into single column.

weather[, .(coef = coef(lm(Tddewpoint ~ Temperature))), by = city]
##         city       coef
## 1:   Hualien -2.7831813
## 2:   Hualien  0.9292042
## 3: Kaohsiung -5.8939668
## 4: Kaohsiung  1.0290631
## 5:  Taichung -1.2627112
## 6:  Taichung  0.8169720
## 7:    Taipei -2.0428652
## 8:    Taipei  0.8500687

Regression coefficient - 2

As long as j returns a list, each element of the list will become a column in the resulting data.table.

weather[, as.list(coef(lm(Tddewpoint ~ Temperature))), by = city]
##         city (Intercept) Temperature
## 1:   Hualien   -2.783181   0.9292042
## 2: Kaohsiung   -5.893967   1.0290631
## 3:  Taichung   -1.262711   0.8169720
## 4:    Taipei   -2.042865   0.8500687

melt.data.table

data.table implement melt in reshape2. Additionally, you can use regular expression with patterns to melt multiple columns.

melted = melt.data.table(weather, 
                         id.vars = c("city", "datetime"),
                         measure = patterns(".*Pres", "Precp"),
                         value.name = c("Pres", "Precp")
                         )
melted[c(1,2,.N-1,.N),]
##       city            datetime variable   Pres Precp
## 1: Hualien 2012-08-15 01:00:00        1 1002.9     0
## 2: Hualien 2012-08-15 02:00:00        1 1001.5     0
## 3:  Taipei 2015-08-14 23:00:00        2 1008.7     0
## 4:  Taipei 2015-08-15 00:00:00        2 1008.4     0

dcast.data.table

You can also use dcast with data.table. Furthermore, you can cast multiple value.vars simultaneously.

dcasted = dcast.data.table(melted, city + datetime ~ variable,
                           value.var = c("Pres", "Precp"))
dcasted[c(1,2,.N-1,.N),]
##       city            datetime Pres_1 Pres_2 Precp_1 Precp_2
## 1: Hualien 2012-08-15 01:00:00 1002.9 1005.1       0       0
## 2: Hualien 2012-08-15 02:00:00 1001.5 1003.7       0       0
## 3:  Taipei 2015-08-14 23:00:00 1005.2 1008.7       0       0
## 4:  Taipei 2015-08-15 00:00:00 1004.9 1008.4       0       0

References

Schedualing

Task Schedualing in R

Sometime you may need to execute your R script periodically and automatically:

Steps to Schedualing

  1. Invoking a batch job of R
    • RScript
    • R CMD BATCH
  2. Passing arguments to R
  3. A schedualer:
    • Windows: Task Schedualer
    • Linux/Mac: Crontab

Open shell with RStudio

Environment variable of R and RScript will be exported in this way.

Invoke a batch job

Rscript <PATH OF YOUR R SCRIPT> <ARG1> <ARG2> ...

Shebang

If you use Linux or Mac, you can execute a R script directly with shebang and permission of execution.

#!/usr/bin/env Rscript

Add permission of execution:

chmod u+x schedualing/now.R

Check permission:

ls -l schedualing/now.R
## -rwxrw-rw- 1 mansun mansun 1168 May 14 02:35 schedualing/now.R

Passing arguments

In most situation, I will use a package to dealing with arguments parsing.

Exercise

Please download our github repository under github directory. Let’s try to pass some arguments to schedualing/now.R:

Get help:

Rscript schedualing/now.R -h

Pass a positional argument:

Rscript schedualing/now.R Mansun

Add a flag -c:

Rscript schedualing/now.R Mansun -c

Add a optional argument:

Rscript schedualing/now.R Mansun -c -g "How are you?"

now.R

#!/usr/bin/env Rscript

# Use argparser
library(argparser, quietly = TRUE)

# Create an arg.parser object.
p <- arg_parser("Hi! What time it is?")

# Add a positional argument
p <- add_argument(p, arg = "who", help = "Who are you")

# Add an optional argument
# Rscript will raise a warning message if you pass with -g
# but it doenn't matter
p <- add_argument(p, arg = "--greeting", short = "-g", 
                  default = "How's going?", type = "character",
                  help = "Greeting word")

# Add a flag, default value is FALSE
p <- add_argument(p, arg = "--chat", short = "-c", flag = TRUE,
                  help = "Whether or not to have a greeting")

# Parse commandArgs(trailingOnly = TRUE) into args
args <- parse_args(p)

print(args)
str(args)

# The original arguments
command_args <- commandArgs(trailingOnly = TRUE)
cat("args from command_args:")
print(command_args)

# Get system time
now <- as.character(Sys.time())

# Construct greeting string
greeting <- sprintf("Hi %s! It is %s.", args$who, now)

if (args$chat) {
    greeting <-  paste(greeting, args$greeting)
}

print(greeting)

# Write to a text file
writeLines(greeting, "now.txt")

Task Schedualer(Windows)

Launch task scheduler on Windows:

  1. Press Windows Logo+R to run dialog box
  2. Enter control schedtasks

Launch task schedualer within RStudio:

# Execute a system conmmand to launch task schedualer
system("control schedtasks")

Set trigger

Set job

Crontab(Linux/Mac)

Maintain crontab files to execute scheduled commands in Unix-like OS for individual users.

crontab [-u user] file  
crontab [ -u user ] [ -i ] { -e | -l | -r }  
    -e      (edit user's crontab)  
    -l      (list user's crontab)  
    -r      (delete user's crontab)  
    -i      (prompt before deleting user's crontab)  

You may need to use some command line editor like vim when using crontab -e to edit your crontab. Type select-editor in terminal to choose your favorite editor

Your first crontab

# Execute every minutes 
* * * * * cd /home/mansun/github/BeyondBasicR/schedualing; ./now.R Mansun
crontab schedualing/crontab.txt
crontab -l
## # Execute every minutes 
## * * * * * cd /home/mansun/github/BeyondBasicR/schedualing; ./now.R Mansun

Configure your cron job

┌───────────── min (0 - 59) 
│ ┌────────────── hour (0 - 23)
│ │ ┌─────────────── day of month (1 - 31)
│ │ │ ┌──────────────── month (1 - 12)
│ │ │ │ ┌───────────────── day of week (0 - 6) (0 to 6 are Sunday to Saturday)
│ │ │ │ │
* * * * *  command to execute
# Execute at 00:00 and 12:00 everyday
0 0,12 * * *  command to execute
# Execute at 06:00 every Monday to Friday
0 6 * * 1-5  command to execute

Please refer to Cron for further information.

Exercise

Set a schedualing job on your favorite OS.

References