2016-05-16
2016-05-16
Some additional knowledge and handy packages for using R as your data manipulation and web scraping tool
All RStudio keyboard shortcuts
Description | Windows & Linux | Mac |
---|---|---|
Show Keyboard Shortcut Reference | Alt+Shift+K | Option+Shift+K |
Attempt completion / Indent | Tab or Ctrl+Space | Tab or Command+Space |
Run current line/selection | Ctrl+Enter | Command+Enter |
Comment/uncomment current line/selection | Ctrl+Shift+C | Command+Shift+C |
Save active document | Ctrl+S | Command+S |
Reindent lines | Ctrl+I | Command+I |
A typical R package may have following components:
library(argparser)
library("argparser")
require(mansun)
## Loading required package: mansun
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'mansun'
Sometimes functions in different packages may have the same name.
lubridate::date("2016-05-14") # access without load whole package
## [1] "2016-05-14"
library(lubridate, quietly = TRUE)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
date("2016-05-14") # convert date. base::date is masked
## [1] "2016-05-14"
base::date() # return system data and time
## [1] "Mon May 16 10:29:24 2016"
Sys.getenv(c("R_LIBS_USER", "R_HOME"))
## R_LIBS_USER
## "~/R/x86_64-pc-linux-gnu-library/3.3"
## R_HOME
## "/usr/lib/R"
Libraries in R are loaded in order:
.libPaths()
## [1] "/usr/local/lib/R/site-library" "/usr/lib/R/site-library"
## [3] "/usr/lib/R/library"
You can also add an additional library path:
.libPaths(c("your/library/path", .libPaths()))
Or change the order of library search path:
lpath <- .libPaths()
.libPaths(lpath[c(length(lpath), 1:max(1, length(lpath)-1))])
.libPaths()
## [1] "/usr/lib/R/library" "/usr/local/lib/R/site-library"
## [3] "/usr/lib/R/site-library"
Use chown and chmod to change owner and write permission of your install path.
chown mansun:mansun /usr/local/lib/R/site-library
chmod u+w /usr/local/lib/R/site-library
Install multiple packages:
install.packages(c("argparser", "jsonlite"),
lib = "/usr/local/lib/R/site-library")
Install development version of data.table:
install.packages("data.table", type = "source",
repos = "https://Rdatatable.github.io/data.table")
Yun can even using RStudio to install a package into system library or site library.
Try to install package argparser in system library or site library.
You can also install a local tarball:
install.packages("/tmp/Rtmp99mvkt/downloaded_packages/argparser_0.4.tar.gz",
lib = "/usr/local/lib/R/site-library", repo = NULL, type = "source")
Using devtools::install_github can easily install any R package on GitHub:
To install devtools make sure you have:
Then:
install.packages("devtools")
Now you can install any package on github with single command:
devtools::install_github('Rdatatable/data.table')
Just take a look what happened when you execute above command.
Beside standard characters, there are several escaped characters that have special meaning. For example:
A character with newline and tab:
greeting = "Hi!\n\tHow are you? \n\tI'm fine, thank you!"
cat(greeting)
## Hi!
## How are you?
## I'm fine, thank you!
writeLines(greeting, con = "dataio/greeting.txt")
readLines(con = "dataio/greeting.txt")
## [1] "Hi!" "\tHow are you? "
## [3] "\tI'm fine, thank you!"
The most popular format to exchange unstructure data. For example:
{
"teacher": "Mansun",
"isFullTime": false,
"hobby": ["Music", "Game"],
"license": null,
"lecture": [
{
"name": "BeyondRBasic",
"date": "2016-05-14",
"hour": 3
},
{
"name": "RCrawler101-2",
"date": "2016-01-09",
"hour": 3
}
]
}
Provide JSON parseing/generating utilities in R
library(jsonlite)
intro = fromJSON("dataio/intro.json", simplifyVector = TRUE)
str(intro)
## List of 5
## $ teacher : chr "Mansun"
## $ isFullTime: logi FALSE
## $ hobby : chr [1:2] "Music" "Game"
## $ license : NULL
## $ lecture :'data.frame': 2 obs. of 3 variables:
## ..$ name: chr [1:2] "BeyondRBasic" "RCrawler101-2"
## ..$ date: chr [1:2] "2016-05-14" "2016-01-09"
## ..$ hour: int [1:2] 3 3
intro$license = NULL
toJSON(intro, pretty = TRUE) # convert a list as a pretty JSON
## {
## "teacher": ["Mansun"],
## "isFullTime": [false],
## "hobby": ["Music", "Game"],
## "lecture": [
## {
## "name": "BeyondRBasic",
## "date": "2016-05-14",
## "hour": 3
## },
## {
## "name": "RCrawler101-2",
## "date": "2016-01-09",
## "hour": 3
## }
## ]
## }
In most situation, you can use data.table::fwrite and data.table::fread to do such tasks much easier and faster.
Here is an example to read a file generated by excel on Windows(Traditional Chinese):
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, mday, month, quarter, wday, week, yday, year
from_cp950 = fread("dataio/cp950.csv", encoding = "UTF-8")
from_cp950$name = iconv(from_cp950$name, from = "CP950", to = "UTF-8")
from_cp950
## name age
## 1: 王先生 34
## 2: 白小姐 23
Write the CP950 file into another file with UTF-8 encoding. Read the UTF-8 file again.
fwrite(from_cp950, "dataio/utf8.csv", sep = "\t") # default write as UTF-8
from_utf8 = fread("dataio/utf8.csv", encoding = "UTF-8")
from_utf8
## name age
## 1: 王先生 34
## 2: 白小姐 23
identical(from_cp950, from_utf8)
## [1] TRUE
A typical SQL query string:
SELECT <columns you want to retrieve>
FROM <table>
WHERE <comparison of columns>
GROUP BY <columns you want to compute the aggregation result>
ORDER BY <columns you want to order>
HAVING <conditions to filter the result of GROUP BY>
A light-weight database engine interface in R
Here we use data.table v1.9.7
Install development version of data.table:
install.packages("data.table", type = "source",
repos = "https://Rdatatable.github.io/data.table")
or
devtools::install_github("Rdatatable/data.table")
Load package:
library(data.table)
DT[i, j, by]
Compare with SQL:
R: | i | j | by |
---|---|---|---|
SQL: | where | select/update | group by |
Say it loud:
Take DT, subset rows using i, then calculate j grouped by by.
You can create a data.table just like the way you create a data.frame. It don’t convert character to factor by default.
library(data.table)
team = data.table(name = c("Wendy", "Alice", "James", "Bill"),
gender = c("f", "f", "M", "M"),
age = c(26, 24, 28, 32))
or convert a data.frame into data.table:
teamdf = data.frame(name = c("Wendy", "Alice", "James", "Bill"),
gender = c("f", "f", "M", "M"),
age = c(26, 24, 28, 32),
stringsAsFactors = FALSE)
teamdt = data.table(teamdf)
identical(team, teamdt)
## [1] TRUE
class(team)
## [1] "data.table" "data.frame"
You can use any functions that suitable for data.frame with data.table. Note that the behavior of [
operator is slightly different than data.frame.
team$gender
## [1] "f" "f" "M" "M"
team[[2]]
## [1] "f" "f" "M" "M"
team[2]
## name gender age
## 1: Alice f 24
teamdf$gender
## [1] "f" "f" "M" "M"
teamdf[[2]]
## [1] "f" "f" "M" "M"
teamdf[2]
## gender
## 1 f
## 2 f
## 3 M
## 4 M
Use i to subset a data.table is similar to data.frame. Furthermore, expression is evaluated within the frame of the data.table
team[1, ]
## name gender age
## 1: Wendy f 26
team[c(FALSE, FALSE, FALSE, TRUE), ]
## name gender age
## 1: Bill M 32
team[name == "Wendy", ]
## name gender age
## 1: Wendy f 26
team[name %in% c("Wendy", "Alice"), ]
## name gender age
## 1: Wendy f 26
## 2: Alice f 24
team[, .(gender)]
## gender
## 1: f
## 2: f
## 3: M
## 4: M
team[1, .(sex = gender)]
## sex
## 1: f
team[1, .(name, gender)]
## name gender
## 1: Wendy f
If you don’ wrap a single expression with list in j, the output will degenerate into a vector.
team[, gender]
## [1] "f" "f" "M" "M"
If you want to know how old is Wendy:
team[name == "Wendy", age]
## [1] 26
Because j handle expressions, you need to set with = FALSE when slicing with logical, interger or character vector
team[1, c("gender")] # not works
## [1] "gender"
team[1, c(TRUE, FALSE, TRUE), with = FALSE]
## name age
## 1: Wendy 26
team[1, c(1, 3), with = FALSE]
## name age
## 1: Wendy 26
team[1, c("name", "age"), with = FALSE]
## name age
## 1: Wendy 26
:= add/remove/update subsets of column by reference.
team[, id := 1:4] # add a column
team[name == "James", age := 38] # UPDATE age for james
team[name == "Bill", c("name", "age") := .("Billy", 34)] # update 2 columns
team[, id := NULL] # remove id
team[gender == "f", `:=`(gender = "F", age = age + 5)]
team
## name gender age
## 1: Wendy F 31
## 2: Alice F 29
## 3: James M 38
## 4: Billy M 34
Use a list of expression in j to compute with data.table
stat = team[, .(mean_age = mean(age), sd_age = sd(age))]
stat
## mean_age sd_age
## 1: 33 3.91578
Add a new column:
team[, mean_age := mean(age)]
team
## name gender age mean_age
## 1: Wendy F 31 33
## 2: Alice F 29 33
## 3: James M 38 33
## 4: Billy M 34 33
Note that putting := inside .() doesn’t work:
stat = team[, .(mean_age := mean(age), sd_age := sd(age))] # not works
You can use by to summarize by each group. keyby will add key automatically.
Generate a data.table of mean and stand deviation by gender:
team[, .(mean_age = mean(age), sd_age = sd(age)), by = "gender"]
## gender mean_age sd_age
## 1: F 30 1.414214
## 2: M 36 2.828427
Add a new column of mean age by gender:
team[, gender_mean_age := mean(age), by = "gender"]
team
## name gender age mean_age gender_mean_age
## 1: Wendy F 31 33 30
## 2: Alice F 29 33 30
## 3: James M 38 33 36
## 4: Billy M 34 33 36
team[.N, ] # subset last column
## name gender age mean_age gender_mean_age
## 1: Billy M 34 33 36
team[, .(count = .N), by = gender] # a new data.table, count by gender
## gender count
## 1: F 2
## 2: M 2
team[, count := .N, by = gender] # add a column in data.table
team
## name gender age mean_age gender_mean_age count
## 1: Wendy F 31 33 30 2
## 2: Alice F 29 33 30 2
## 3: James M 38 33 36 2
## 4: Billy M 34 33 36 2
You can use expressions in by to separate groups. Here we divide groups with gender and age great equal 30:
team[, .(count = .N, mean_age = mean(age)), by = .(old = age >= 30, gender)]
## old gender count mean_age
## 1: TRUE F 1 31
## 2: FALSE F 1 29
## 3: TRUE M 2 36
Here is a data.table contains four pets’ name, species and weight:
library(data.table)
pets = data.table(name = c("Senor", "Cookie", "Lucky", "Meow Meow"),
species = c("dog", "dog", "bird", "cat"),
weight = c(26, 13, 0.5, 6))
Please:
Get last row of each gender:
team[, .SD[.N, ], by = gender]
## gender name age mean_age gender_mean_age count
## 1: F Alice 29 33 30 2
## 2: M Billy 34 33 36 2
Get last row of each gender:
team[, .SD[age == max(age), ], by = gender]
## gender name age mean_age gender_mean_age count
## 1: F Wendy 31 33 30 2
## 2: M James 38 33 36 2
Do some computation on age, mean_age and gender_mean_age with .SDcols:
age_cols = c("age", "mean_age", "gender_mean_age")
team[, c(age_cols) := lapply(.SD, function(x) x + 2), .SDcols = age_cols]
It’s ok to use j inside .SD to do the same thing:
team[, c(age_cols) := lapply(.SD[, age_cols, with = FALSE],
function(x) x +2)]
team
## name gender age mean_age gender_mean_age count
## 1: Wendy F 35 37 34 2
## 2: Alice F 33 37 34 2
## 3: James M 42 37 40 2
## 4: Billy M 38 37 40 2
DT[ … ][ … ][ … ]
Chaining of data.table can avoid unnecessary intermediate assignment.
age_comparator = function(dt) {
ifelse(dt[gender == "M", mean_age] >= dt[gender == "M", mean_age],
"Males are older", "Females are older")
}
gender_mean_age = team[, .(mean_age = mean(age)), by = "gender"]
gender_mean_age[, comment := age_comparator(.SD)]
# Use Chaining
gender_mean_age2 = team[, .(mean_age = mean(age)), by = "gender",
][, comment := age_comparator(.SD)]
gender_mean_age
## gender mean_age comment
## 1: F 34 Males are older
## 2: M 40 Males are older
identical(gender_mean_age, gender_mean_age2)
## [1] TRUE
You can set key in a data.table to:
setkey(team, gender, name) # set name as key
setkeyv(team, c("gender", "name")) # set name as key using character
key(team) # show keys in a data.table
## [1] "gender" "name"
team[.("F", "Wendy"),] # now you can slicing with key
## name gender age mean_age gender_mean_age count
## 1: Wendy F 35 37 34 2
team[.("F"), ] # prefix is ok
## name gender age mean_age gender_mean_age count
## 1: Alice F 33 37 34 2
## 2: Wendy F 35 37 34 2
team[.("Wendy"),] # keys have orders. This example does not work
## name gender age mean_age gender_mean_age count
## 1: NA Wendy NA NA NA NA
Ordering with keys is faster:
team[order(gender, -name)]
## name gender age mean_age gender_mean_age count
## 1: Wendy F 35 37 34 2
## 2: Alice F 33 37 34 2
## 3: James M 42 37 40 2
## 4: Billy M 38 37 40 2
You still can ordering without keys:
team[order(gender, -age)]
## name gender age mean_age gender_mean_age count
## 1: Wendy F 35 37 34 2
## 2: Alice F 33 37 34 2
## 3: James M 42 37 40 2
## 4: Billy M 38 37 40 2
All set* functions change their input by reference.
Set row names as the first letter of name:
attributes(team)$row.names
## [1] 1 2 3 4
setattr(team, "row.names", substr(team$name, 1, 1))
attributes(team)$row.names
## [1] "A" "W" "B" "J"
Set column names as upper case:
setnames(team, "name", "Name") # set name to Name
colnames(team)
## [1] "Name" "gender" "age" "mean_age"
## [5] "gender_mean_age" "count"
setnames(team, colnames(team), tolower(colnames(team))) # set all lower
colnames(team)
## [1] "name" "gender" "age" "mean_age"
## [5] "gender_mean_age" "count"
Use set to update the name of first person:
set(team, 1L, "name", "Alison")
team
## name gender age mean_age gender_mean_age count
## 1: Alison F 33 37 34 2
## 2: Wendy F 35 37 34 2
## 3: Billy M 38 37 40 2
## 4: James M 42 37 40 2
Set genderas the first column:
new_order = c("gender", colnames(team)[!colnames(team) %in% "gender"])
setcolorder(team, new_order)
team
## gender name age mean_age gender_mean_age count
## 1: F Alison 33 37 34 2
## 2: F Wendy 35 37 34 2
## 3: M Billy 38 37 40 2
## 4: M James 42 37 40 2
:= and set* modify the data.table by references. You can use copy to do a deep copy to avoid side effect (modify a object that doesn’t return by the function).
deepf = function(dt) {
tt = copy(dt)
tt[, name := substr(name, 1, 1)]
}
deepf(team)
team$name
## [1] "Alison" "Wendy" "Billy" "James"
shallowf = function(dt) {
tt = dt
tt[, name := substr(name, 1, 1)]
}
shallowf(team)
team$name
## [1] "A" "W" "B" "J"
Notice that we modify tt’s name inside shallowf but the data.table team is also changed.
It is always a good idea letting others(or you in the future) know what objects will be modified if your function have side effect.
team = shallowf(team)
Note that in R 3.3.0, base::order has a new argument method to choose the sorting method and the radix sorting comes from data.table (Ref). Default sorting method of base::order is still shell sorting.
Use tables to list all data.table’s in memory.
all_tables = tables(silent=TRUE, width = 60)
all_tables[, COLS := NULL]
all_tables # KEY column disappear in slides for unknown reason
## NAME NROW NCOL MB KEY
## 1: from_cp950 2 2 1
## 2: from_utf8 2 2 1
## 3: gender_mean_age 2 3 1
## 4: gender_mean_age2 2 3 1
## 5: pets 4 3 1
## 6: stat 1 2 1
## 7: team 4 6 1
## 8: teamdt 4 3 1
Using hourly weather data from 2012-08-15 to 2015-08-14 to illustrate features of data.table.
Special symbols in this data:
weather = fread("data/weather.csv")
## Warning in fread("data/weather.csv"): Bumped column 1 to type character on
## data row 51314, field contains '觀測時間(LST)ObsTime'. Coercing previously
## read values in this column from logical, integer or numeric back to
## character which may not be lossless; e.g., if '00' and '000' occurred
## before they will now be just '0', and there may be inconsistencies with
## treatment of ',,' and ',NA,' too (if they occurred in this column before
## the bump). If this matters please rerun and set 'colClasses' to 'character'
## for this column. Please note that column type detection uses the first
## 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message
## should be very rare. If reporting to datatable-help, please rerun and
## include the output from verbose=TRUE.
filepaths = list.files("data/daily", full.names = TRUE)
locations = gsub("data\\/daily\\/|_....-..-..\\.csv", "", filepaths)
dates = gsub("data\\/daily\\/.*_|\\.csv", "", filepaths)
weather = list()
for (i in 1:length(filepaths)) {
weather[[i]] = fread(filepaths[i])
weather[[i]][, city := locations[i]]
weather[[i]][, date := dates[i]]
}
weather = rbindlist(weather)
A fast data writer fwrite was introduced in data.table 1.9.7
# output a comma-separated file
fwrite(weather, "data/weather.csv")
# output a tab-separated file
fwrite(weather[1:24, ], "data/weather.txt", sep = "\t")
Here we use lubridate::ymd_h to generate datetime. The warning message is due to the missing data of Kaohsiung in 2015-06-24
library(lubridate, quietly = TRUE)
weather[, ObsTime2 := ObsTime]
# For datetime convertion
weather[ObsTime2 == "24", ObsTime2 := "0"]
# Convert
weather[,datetime := ymd_h(paste(date, ObsTime2), tz = "CST")]
## Warning: 2 failed to parse.
# Plus one day for ObsTime = "24"
weather[ObsTime == "24", datetime := datetime + days(1)]
# Check for rows that don't convert successful
weather[is.na(datetime)]
## ObsTime StnPres SeaPres
## 1:
## 2: 觀測時間(LST)ObsTime 測站氣壓(hPa)StnPres 海平面氣壓(hPa)SeaPres
## Temperature Tddewpoint RH WS
## 1:
## 2: 氣溫(℃)Temperature 露點溫度(℃)Td dew point 相對溼度(%)RH 風速(m/s)WS
## WD WSGust
## 1:
## 2: 風向(最多風向)(360degree)WD 最大陣風(m/s)WSGust
## WDGust Precp PrecpHour
## 1:
## 2: 最大陣風風向(360degree)WDGust 降水量(mm)Precp 降水時數(hr)PrecpHour
## SunShine GloblRad Visb
## 1:
## 2: 日照時數(hr)SunShine 全天空日射量(MJ/㎡)GloblRad 能見度(km)Visb
## city date ObsTime2 datetime
## 1: Kaohsiung 2015-06-24 <NA>
## 2: Kaohsiung 2015-06-24 觀測時間(LST)ObsTime <NA>
# Drop rows that don't convert successful
weather = weather[!is.na(datetime),]
str(weather)
## Classes 'data.table' and 'data.frame': 105096 obs. of 19 variables:
## $ ObsTime : chr "1" "2" "3" "4" ...
## $ StnPres : chr "1002.9 " "1001.5 " "1002.6 " "1002.3 " ...
## $ SeaPres : chr "1005.1 " "1003.7 " "1004.8 " "1004.5 " ...
## $ Temperature: chr "28.6 " "28.3 " "28.3 " "28.1 " ...
## $ Tddewpoint : chr "21.1 " "22.1 " "21.6 " "21.7 " ...
## $ RH : chr "64 " "69 " "67 " "68 " ...
## $ WS : chr "1.3 " "3.2 " "2.5 " "1.1 " ...
## $ WD : chr "340.0 " "20.0 " "330.0 " "330.0 " ...
## $ WSGust : chr "2.7 " "5.8 " "4.6 " "4.2 " ...
## $ WDGust : chr "310.0 " "40.0 " "20.0 " "320.0 " ...
## $ Precp : chr "0.0 " "0.0 " "0.0 " "0.0 " ...
## $ PrecpHour : chr "0.0 " "0.0 " "0.0 " "0.0 " ...
## $ SunShine : chr " " " " " " " " ...
## $ GloblRad : chr "0.00 " "0.00 " "0.00 " "0.00 " ...
## $ Visb : chr " " "10.0 " " " " " ...
## $ city : chr "Hualien" "Hualien" "Hualien" "Hualien" ...
## $ date : chr "2012-08-15" "2012-08-15" "2012-08-15" "2012-08-15" ...
## $ ObsTime2 : chr "1" "2" "3" "4" ...
## $ datetime : POSIXct, format: "2012-08-15 01:00:00" "2012-08-15 02:00:00" ...
## - attr(*, ".internal.selfref")=<externalptr>
Remove special symbols in our data.
cnames = colnames(weather)
num_c = cnames[!cnames %in% c("city", "date", "datetime")]
weather[, c(num_c) := (lapply(.SD, function(x){gsub("[^0-9.TxV]", "", x)})),
.SDcols = num_c]
You can update partial cells in a data.table.
# Check which numeric columns contain special symbols
weather[, lapply(.SD, function(x){sum(grepl("^[TxV]$", x), na.rm = FALSE)}),
.SDcol = num_c]
## ObsTime StnPres SeaPres Temperature Tddewpoint RH WS WD WSGust WDGust
## 1: 0 0 0 0 0 0 0 209 0 0
## Precp PrecpHour SunShine GloblRad Visb ObsTime2
## 1: 3884 0 0 0 0 0
weather[WD == "V", WD := ""]
weather[Precp == "T", Precp := "0"]
weather[, `:=`(month = substr(date, 6, 7), isRain = Precp > 0)]
weather[, c(num_c) := (lapply(.SD, as.numeric)), .SDcols = num_c]
weather[, date := as.Date(date)]
Find hottest date in each city
hottest = weather[, .(datetime, Temperature, T.Max = max(Temperature)),
keyby = city
][, count := .N
][Temperature == T.Max,
head(.SD[, .(T.Max, datetime)], 3), # output top 3 if ties
keyby = city]
cj = CJ(weather$city, as.character(weather$datetime), unique = TRUE)[
, V2 := ymd_hms(V2, tz = "CST")]
setnames(cj, names(cj), c("city", "datetime"))
weather = merge(cj, weather, all.x = TRUE, all.y = FALSE, by = c("city", "datetime"))
weather[, date := as.character(datetime)
][,`:=`(date = as.Date(substr(date, 1, 10)),
year = substr(date, 1, 4),
month = substr(date, 6, 7),
hour = substr(date, 12, 13))]
weather[city == "Kaohsiung" & date == ymd("2015-06-24"),
.(city, date, Temperature, Tddewpoint)]
## city date Temperature Tddewpoint
## 1: Kaohsiung 2015-06-24 29.4 24.7
## 2: Kaohsiung 2015-06-24 NA NA
## 3: Kaohsiung 2015-06-24 NA NA
## 4: Kaohsiung 2015-06-24 NA NA
## 5: Kaohsiung 2015-06-24 NA NA
## 6: Kaohsiung 2015-06-24 NA NA
## 7: Kaohsiung 2015-06-24 NA NA
## 8: Kaohsiung 2015-06-24 NA NA
## 9: Kaohsiung 2015-06-24 NA NA
## 10: Kaohsiung 2015-06-24 NA NA
## 11: Kaohsiung 2015-06-24 NA NA
## 12: Kaohsiung 2015-06-24 NA NA
## 13: Kaohsiung 2015-06-24 NA NA
## 14: Kaohsiung 2015-06-24 NA NA
## 15: Kaohsiung 2015-06-24 NA NA
## 16: Kaohsiung 2015-06-24 NA NA
## 17: Kaohsiung 2015-06-24 NA NA
## 18: Kaohsiung 2015-06-24 NA NA
## 19: Kaohsiung 2015-06-24 NA NA
## 20: Kaohsiung 2015-06-24 NA NA
## 21: Kaohsiung 2015-06-24 NA NA
## 22: Kaohsiung 2015-06-24 NA NA
## 23: Kaohsiung 2015-06-24 NA NA
## 24: Kaohsiung 2015-06-24 NA NA
## city date Temperature Tddewpoint
We can subset row by a particular columns after assiggn it as a key with setkey.
inter_mean = weather[year == "2015" & month == "06" & city == "Kaohsiung",
.(meanT = mean(Temperature, na.rm = TRUE),
meanTd = mean(Tddewpoint, na.rm = TRUE)),
by = hour]
setkey(inter_mean, hour)
Tj = which(colnames(weather) == "Temperature")
Tdj = which(colnames(weather) == "Tddewpoint")
for (i in which(is.na(weather$Temperature))) {
set(weather, i, Tj, inter_mean[weather[i, hour], meanT])
set(weather, i, Tdj, inter_mean[weather[i, hour], meanTd])
}
weather[city == "Kaohsiung" & date == ymd("2015-06-24"),
.(city, date, Temperature, Tddewpoint)]
## city date Temperature Tddewpoint
## 1: Kaohsiung 2015-06-24 29.40000 24.70000
## 2: Kaohsiung 2015-06-24 29.40345 25.12069
## 3: Kaohsiung 2015-06-24 29.24138 24.95172
## 4: Kaohsiung 2015-06-24 28.99655 24.77586
## 5: Kaohsiung 2015-06-24 28.80690 24.67586
## 6: Kaohsiung 2015-06-24 28.57241 24.51379
## 7: Kaohsiung 2015-06-24 28.65517 24.47586
## 8: Kaohsiung 2015-06-24 29.74483 25.23103
## 9: Kaohsiung 2015-06-24 30.87586 25.63448
## 10: Kaohsiung 2015-06-24 31.35172 25.98276
## 11: Kaohsiung 2015-06-24 31.82414 26.42069
## 12: Kaohsiung 2015-06-24 32.22069 26.47586
## 13: Kaohsiung 2015-06-24 32.45172 26.61034
## 14: Kaohsiung 2015-06-24 32.60000 26.65172
## 15: Kaohsiung 2015-06-24 32.62759 26.67241
## 16: Kaohsiung 2015-06-24 32.14828 26.61379
## 17: Kaohsiung 2015-06-24 32.03448 26.45862
## 18: Kaohsiung 2015-06-24 31.53793 26.16897
## 19: Kaohsiung 2015-06-24 30.85517 25.68276
## 20: Kaohsiung 2015-06-24 30.33793 25.39310
## 21: Kaohsiung 2015-06-24 30.11034 25.28276
## 22: Kaohsiung 2015-06-24 29.98966 25.26207
## 23: Kaohsiung 2015-06-24 29.79310 25.23448
## 24: Kaohsiung 2015-06-24 29.74483 25.30345
## city date Temperature Tddewpoint
Now we would like to compute regression coefficient by each city. In this example, all values are concatenate into single column.
weather[, .(coef = coef(lm(Tddewpoint ~ Temperature))), by = city]
## city coef
## 1: Hualien -2.7831813
## 2: Hualien 0.9292042
## 3: Kaohsiung -5.8939668
## 4: Kaohsiung 1.0290631
## 5: Taichung -1.2627112
## 6: Taichung 0.8169720
## 7: Taipei -2.0428652
## 8: Taipei 0.8500687
As long as j returns a list, each element of the list will become a column in the resulting data.table.
weather[, as.list(coef(lm(Tddewpoint ~ Temperature))), by = city]
## city (Intercept) Temperature
## 1: Hualien -2.783181 0.9292042
## 2: Kaohsiung -5.893967 1.0290631
## 3: Taichung -1.262711 0.8169720
## 4: Taipei -2.042865 0.8500687
data.table implement melt in reshape2. Additionally, you can use regular expression with patterns to melt multiple columns.
melted = melt.data.table(weather,
id.vars = c("city", "datetime"),
measure = patterns(".*Pres", "Precp"),
value.name = c("Pres", "Precp")
)
melted[c(1,2,.N-1,.N),]
## city datetime variable Pres Precp
## 1: Hualien 2012-08-15 01:00:00 1 1002.9 0
## 2: Hualien 2012-08-15 02:00:00 1 1001.5 0
## 3: Taipei 2015-08-14 23:00:00 2 1008.7 0
## 4: Taipei 2015-08-15 00:00:00 2 1008.4 0
You can also use dcast with data.table. Furthermore, you can cast multiple value.vars simultaneously.
dcasted = dcast.data.table(melted, city + datetime ~ variable,
value.var = c("Pres", "Precp"))
dcasted[c(1,2,.N-1,.N),]
## city datetime Pres_1 Pres_2 Precp_1 Precp_2
## 1: Hualien 2012-08-15 01:00:00 1002.9 1005.1 0 0
## 2: Hualien 2012-08-15 02:00:00 1001.5 1003.7 0 0
## 3: Taipei 2015-08-14 23:00:00 1005.2 1008.7 0 0
## 4: Taipei 2015-08-15 00:00:00 1004.9 1008.4 0 0
Sometime you may need to execute your R script periodically and automatically:
Environment variable of R and RScript will be exported in this way.
Rscript <PATH OF YOUR R SCRIPT> <ARG1> <ARG2> ...
If you use Linux or Mac, you can execute a R script directly with shebang and permission of execution.
#!/usr/bin/env Rscript
Add permission of execution:
chmod u+x schedualing/now.R
Check permission:
ls -l schedualing/now.R
## -rwxrw-rw- 1 mansun mansun 1168 May 14 02:35 schedualing/now.R
In most situation, I will use a package to dealing with arguments parsing.
Please download our github repository under github directory. Let’s try to pass some arguments to schedualing/now.R:
Get help:
Rscript schedualing/now.R -h
Pass a positional argument:
Rscript schedualing/now.R Mansun
Add a flag -c:
Rscript schedualing/now.R Mansun -c
Add a optional argument:
Rscript schedualing/now.R Mansun -c -g "How are you?"
#!/usr/bin/env Rscript
# Use argparser
library(argparser, quietly = TRUE)
# Create an arg.parser object.
p <- arg_parser("Hi! What time it is?")
# Add a positional argument
p <- add_argument(p, arg = "who", help = "Who are you")
# Add an optional argument
# Rscript will raise a warning message if you pass with -g
# but it doenn't matter
p <- add_argument(p, arg = "--greeting", short = "-g",
default = "How's going?", type = "character",
help = "Greeting word")
# Add a flag, default value is FALSE
p <- add_argument(p, arg = "--chat", short = "-c", flag = TRUE,
help = "Whether or not to have a greeting")
# Parse commandArgs(trailingOnly = TRUE) into args
args <- parse_args(p)
print(args)
str(args)
# The original arguments
command_args <- commandArgs(trailingOnly = TRUE)
cat("args from command_args:")
print(command_args)
# Get system time
now <- as.character(Sys.time())
# Construct greeting string
greeting <- sprintf("Hi %s! It is %s.", args$who, now)
if (args$chat) {
greeting <- paste(greeting, args$greeting)
}
print(greeting)
# Write to a text file
writeLines(greeting, "now.txt")
Launch task scheduler on Windows:
Launch task schedualer within RStudio:
# Execute a system conmmand to launch task schedualer
system("control schedtasks")
Maintain crontab files to execute scheduled commands in Unix-like OS for individual users.
crontab [-u user] file
crontab [ -u user ] [ -i ] { -e | -l | -r }
-e (edit user's crontab)
-l (list user's crontab)
-r (delete user's crontab)
-i (prompt before deleting user's crontab)
You may need to use some command line editor like vim when using crontab -e to edit your crontab. Type select-editor in terminal to choose your favorite editor
# Execute every minutes
* * * * * cd /home/mansun/github/BeyondBasicR/schedualing; ./now.R Mansun
crontab schedualing/crontab.txt
crontab -l
## # Execute every minutes
## * * * * * cd /home/mansun/github/BeyondBasicR/schedualing; ./now.R Mansun
┌───────────── min (0 - 59)
│ ┌────────────── hour (0 - 23)
│ │ ┌─────────────── day of month (1 - 31)
│ │ │ ┌──────────────── month (1 - 12)
│ │ │ │ ┌───────────────── day of week (0 - 6) (0 to 6 are Sunday to Saturday)
│ │ │ │ │
* * * * * command to execute
# Execute at 00:00 and 12:00 everyday
0 0,12 * * * command to execute
# Execute at 06:00 every Monday to Friday
0 6 * * 1-5 command to execute
Please refer to Cron for further information.
Set a schedualing job on your favorite OS.