data.table: Swiss Army Knife for Data Wrangling

While data.table is mostly praised for its speed (which is undeniable), I was impressed with its versatility as well. The ability to manipulate the data in variety of ways using an extremely compact syntax is second to none. Sure it needs a bit of getting used to but the payoffs are phenomenal. I implemented one of my recent project using data.table which helped me learn some of its really cool features. The package’s documentation does an impressive job too communicating the philosophy as well as usage. Here I document some of its features I have learned so far. I plan to update this regularly as I learn more.

I am going to use ggplot2::mpg data set for demonstration of its data wrangling features. So let’s first create a data.table from this data:

library(data.table)
mpg <- data.table(ggplot2::mpg)
mpg
#>      manufacturer  model displ year cyl      trans drv cty hwy fl   class
#>   1:         audi     a4   1.8 1999   4   auto(l5)   f  18  29  p compact
#>   2:         audi     a4   1.8 1999   4 manual(m5)   f  21  29  p compact
#>   3:         audi     a4   2.0 2008   4 manual(m6)   f  20  31  p compact
#>   4:         audi     a4   2.0 2008   4   auto(av)   f  21  30  p compact
#>   5:         audi     a4   2.8 1999   6   auto(l5)   f  16  26  p compact
#>  ---                                                                     
#> 230:   volkswagen passat   2.0 2008   4   auto(s6)   f  19  28  p midsize
#> 231:   volkswagen passat   2.0 2008   4 manual(m6)   f  21  29  p midsize
#> 232:   volkswagen passat   2.8 1999   6   auto(l5)   f  16  26  p midsize
#> 233:   volkswagen passat   2.8 1999   6 manual(m5)   f  18  26  p midsize
#> 234:   volkswagen passat   3.6 2008   6   auto(s6)   f  17  26  p midsize

Row Selection

Select first row

mpg[1]

Select first three rows

mpg[1:3]

Select all but first three rows

mpg[-(1:3)]

Select rows five through seven but with order reversed

mpg[7:5]

Select the last row

.N is a special symbol to get the number of rows in the data table:

mpg[.N]

Select all rows except the last row

mpg[-.N]

Random 10 observations

mpg[sample(.N, 10)]

Random 10% observations

mpg[sample(.N, (.N * 0.1))]

Top 10 rows

Any data.frame functions can directly be used because data.table is also a data.frame

head(mpg, 10)

Bottom 10 rows

tail(mpg, 10)

Select all a4 model observations

mpg[model == "a4"]

Select all 6 cylinder a4 models

mpg[model == "a4" & cyl == 6]

Select all non-a4 model observations

mpg[model != "a4"]

Select all models except a4 and passat

mpg[!model %in% c("a4", "passat")]

Select models with city fuel economy more than 25 mpg

mpg[cty > 25]

Select every other row

mpg[rep(c(TRUE, FALSE), length = .N)]

Select compact vehicles for 2008 model year

Multiple variables can be used with on argument

mpg[.(2008, "compact"), on = c("year", "class")]

Amongst subcompact and midsize vehicles of 1999 model year, what is the best highway fuel economy

mpg[.(1999, c("subcompact", "midsize")), min(hwy), on = c("year", "class")]

Select all observations with model name starting with c

mpg[grep("^c", model)]

First row of 2008 model year compact vehicles

mult argument can be used to get first/last observation in a sub-group

mpg[.(2008, "compact"), mult = "first", on = c("year", "class")]

Last row of 1999 model year SUVs

mpg[.(1999, "suv"), mult = "last", on = c("year", "class")]

First row of all three subgroups: 4, 5 and 6 cylinder rear wheel drive cars

mpg[.(c(4, 5, 6), "r"), mult = "first", on = c("cyl", "drv")]
#>    manufacturer   model displ year cyl      trans drv cty hwy   fl
#> 1:         <NA>    <NA>    NA   NA   4       <NA>   r  NA  NA <NA>
#> 2:         <NA>    <NA>    NA   NA   5       <NA>   r  NA  NA <NA>
#> 3:         ford mustang   3.8 1999   6 manual(m5)   r  18  26    r
#>         class
#> 1:       <NA>
#> 2:       <NA>
#> 3: subcompact

First row of all three subgroups: 4, 5 and 6 cylinder rear wheel drive cars, but matching rows only

nomatch = 0L can be used to show only matching rows

mpg[.(c(4, 5, 6), "r"), mult = "first", nomatch = 0L, on = c("cyl", "drv")]
#>    manufacturer   model displ year cyl      trans drv cty hwy fl
#> 1:         ford mustang   3.8 1999   6 manual(m5)   r  18  26  r
#>         class
#> 1: subcompact

First row of each class

.SD can be used to select all columns

mpg[, head(.SD, 1), by = class]

Last row of each class

mpg[, tail(.SD, 1), by = class]

Five worst highway fuel economy vehicles

head(mpg[order(hwy)], 5)

Five best highway fuel economy vehicles

head(mpg[order(-hwy)], 5)

Best fuel economy vehicle in each class

mpg[order(hwy), head(.SD, 1), by = class]

Column (variable) Selection

Select model, cyl, drv, hwy

mpg[, .(model, cyl, drv, hwy)]

Select first three variables

mpg[, 1:3]

Select all but first three variables

mpg[, -(1:3)]

Select variables by first storing in a character vector

cols <- c("model", "cyl", "drv", "hwy")
mpg[, ..cols] # option 1
mpg[, cols, with = FALSE] # option 2

Select all but variables stored in a character vector

cols <- c("model", "cyl", "drv", "hwy")
mpg[, -..cols] # option 1
mpg[, -cols, with = FALSE] # option 2

Select all variables in the range displ:drv

mpg[, displ:drv]

Select all variables except those in the range displ:drv

mpg[, -(displ:drv)]

Select all variables starting with m

cols <- grep("^m", names(mpg))
mpg[, ..cols]

Select all variables not starting with m

cols <- grep("^m", names(mpg))
mpg[, -..cols]

Select all numeric columns

cols <- mpg[, sapply(.SD, is.numeric)]
mpg[, ..cols]

Select all non-numeric columns

cols <- mpg[, sapply(.SD, is.numeric)]
mpg[, -..cols]

Computing on J

Number of observations

mpg[, .N]

Number of observations in each class

mpg[, .N, by = class]

Number of observations in each class, ordered alphabetically by class

mpg[, .N, keyby = class]

Mean city fuel economy for each class

mpg[, mean(cty), by = class]

Mean city fuel economy for each class and assign column name

mpg[, .(mean_city = mean(cty)), by = class]

Mean city fuel economy for each class in increasing order

mpg[, .(mean_city = mean(cty)), keyby = class]

Mean highway fuel economy for honda and toyota vehicles

mpg[manufacturer %in% c("honda", "toyota"), mean(hwy), by = manufacturer]

Class of all variables, return as data table

mpg[, lapply(.SD, class)]

Class of all variables, return as vector

mpg[, sapply(.SD, class)]

Fraction of NA values for each variable

mpg[, lapply(.SD, function(x) mean(is.na(x)))]

Mean of all numeric variables

.SDcols can be used to apply a function to only subset of columns

cols <- mpg[, sapply(.SD, is.numeric)]
mpg[, lapply(.SD, mean), .SDcols = cols]

Change all non-numeric columns to upper case

cols <- mpg[, sapply(.SD, is.numeric)]
mpg[, lapply(.SD, toupper), .SDcols = -cols]

Number of vehicles with average fuel economy more than 25

mpg[, sum(((cty + hwy) / 2) > 25)]

More Expressions in J

Scatterplot of city fuel economy versus displacement

mpg[, lattice::xyplot(cty ~ displ)]

Density plot of highway fuel economy

mpg[, lattice::densityplot(~ hwy)]

Update by Reference

:= is used to update data.table by reference

Add a column for average fuel economy

mpg[, avg_mpg := (cty + hwy) / 2]

Add a column with mean of city fuel economy

mpg[, avecty := mean(cty)]

Add a factor variable (values = good, bad) based on city fuel economy being more or less than average city fuel economy

mpg[, mpgrating := ifelse(cty < mean(cty), "good", "bad")]

Change ‘f’ to ‘Front’ for drv values

mpg["f", drv := "Front", on = "drv"]

Remove Columns

Remove fl column

mpg[, fl := NULL]

Remove first three columns

mpg[, c(1:3) := NULL]

Remove last column

mpg[, length(mpg) := NULL]

Remove all columns starting with m

mpg[, grep("^m", names(mpg)) := NULL]

Remove manufacturer and year columns

mpg[, `:=`(manufacturer = NULL,
           year = NULL)]