# data.table: Swiss Army Knife for Data Wrangling

While data.table is mostly praised for its speed (which is undeniable), I was impressed with its versatility as well. The ability to manipulate the data in variety of ways using an extremely compact syntax is second to none. Sure it needs a bit of getting used to but the payoffs are phenomenal. I implemented one of my recent project using data.table which helped me learn some of its really cool features. The package’s documentation does an impressive job too communicating the philosophy as well as usage. Here I document some of its features I have learned so far. I plan to update this regularly as I learn more.

I am going to use `ggplot2::mpg`

data set for demonstration of its data wrangling features. So let’s first create a data.table from this data:

```
library(data.table)
mpg <- data.table(ggplot2::mpg)
mpg
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> 1: audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
#> 2: audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
#> 3: audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
#> 4: audi a4 2.0 2008 4 auto(av) f 21 30 p compact
#> 5: audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
#> ---
#> 230: volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
#> 231: volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
#> 232: volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
#> 233: volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
#> 234: volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize
```

### Row Selection

**Select first row**

`mpg[1]`

**Select first three rows**

`mpg[1:3]`

**Select all but first three rows**

`mpg[-(1:3)]`

**Select rows five through seven but with order reversed**

`mpg[7:5]`

**Select the last row**

`.N`

is a special symbol to get the number of rows in the data table:

`mpg[.N]`

**Select all rows except the last row**

`mpg[-.N]`

**Random 10 observations**

`mpg[sample(.N, 10)]`

**Random 10% observations**

`mpg[sample(.N, (.N * 0.1))]`

**Top 10 rows**

Any data.frame functions can directly be used because data.table is also a data.frame

`head(mpg, 10)`

**Bottom 10 rows**

`tail(mpg, 10)`

**Select all a4 model observations**

`mpg[model == "a4"]`

**Select all 6 cylinder a4 models**

`mpg[model == "a4" & cyl == 6]`

**Select all non- a4 model observations**

`mpg[model != "a4"]`

**Select all models except a4 and passat**

`mpg[!model %in% c("a4", "passat")]`

**Select models with city fuel economy more than 25 mpg**

`mpg[cty > 25]`

**Select every other row**

`mpg[rep(c(TRUE, FALSE), length = .N)]`

**Select compact vehicles for 2008 model year**

Multiple variables can be used with `on`

argument

`mpg[.(2008, "compact"), on = c("year", "class")]`

**Amongst subcompact and midsize vehicles of 1999 model year, what is the best highway fuel economy**

`mpg[.(1999, c("subcompact", "midsize")), min(hwy), on = c("year", "class")]`

**Select all observations with model name starting with c **

`mpg[grep("^c", model)]`

**First row of 2008 model year compact vehicles**

`mult`

argument can be used to get first/last observation in a sub-group

`mpg[.(2008, "compact"), mult = "first", on = c("year", "class")]`

**Last row of 1999 model year SUVs**

`mpg[.(1999, "suv"), mult = "last", on = c("year", "class")]`

**First row of all three subgroups: 4, 5 and 6 cylinder rear wheel drive cars**

```
mpg[.(c(4, 5, 6), "r"), mult = "first", on = c("cyl", "drv")]
#> manufacturer model displ year cyl trans drv cty hwy fl
#> 1: <NA> <NA> NA NA 4 <NA> r NA NA <NA>
#> 2: <NA> <NA> NA NA 5 <NA> r NA NA <NA>
#> 3: ford mustang 3.8 1999 6 manual(m5) r 18 26 r
#> class
#> 1: <NA>
#> 2: <NA>
#> 3: subcompact
```

**First row of all three subgroups: 4, 5 and 6 cylinder rear wheel drive cars, but matching rows only**

`nomatch = 0L`

can be used to show only matching rows

```
mpg[.(c(4, 5, 6), "r"), mult = "first", nomatch = 0L, on = c("cyl", "drv")]
#> manufacturer model displ year cyl trans drv cty hwy fl
#> 1: ford mustang 3.8 1999 6 manual(m5) r 18 26 r
#> class
#> 1: subcompact
```

**First row of each class**

`.SD`

can be used to select all columns

`mpg[, head(.SD, 1), by = class]`

**Last row of each class**

`mpg[, tail(.SD, 1), by = class]`

**Five worst highway fuel economy vehicles**

`head(mpg[order(hwy)], 5)`

**Five best highway fuel economy vehicles**

`head(mpg[order(-hwy)], 5)`

**Best fuel economy vehicle in each class**

`mpg[order(hwy), head(.SD, 1), by = class]`

### Column (variable) Selection

**Select model, cyl, drv, hwy**

`mpg[, .(model, cyl, drv, hwy)]`

**Select first three variables**

`mpg[, 1:3]`

**Select all but first three variables**

`mpg[, -(1:3)]`

**Select variables by first storing in a character vector**

```
cols <- c("model", "cyl", "drv", "hwy")
mpg[, ..cols] # option 1
mpg[, cols, with = FALSE] # option 2
```

**Select all but variables stored in a character vector**

```
cols <- c("model", "cyl", "drv", "hwy")
mpg[, -..cols] # option 1
mpg[, -cols, with = FALSE] # option 2
```

**Select all variables in the range displ:drv**

`mpg[, displ:drv]`

**Select all variables except those in the range displ:drv**

`mpg[, -(displ:drv)]`

**Select all variables starting with m **

```
cols <- grep("^m", names(mpg))
mpg[, ..cols]
```

**Select all variables not starting with m **

```
cols <- grep("^m", names(mpg))
mpg[, -..cols]
```

**Select all numeric columns **

```
cols <- mpg[, sapply(.SD, is.numeric)]
mpg[, ..cols]
```

**Select all non-numeric columns **

```
cols <- mpg[, sapply(.SD, is.numeric)]
mpg[, -..cols]
```

### Computing on J

**Number of observations**

`mpg[, .N]`

**Number of observations in each class**

`mpg[, .N, by = class]`

**Number of observations in each class, ordered alphabetically by class**

`mpg[, .N, keyby = class]`

**Mean city fuel economy for each class**

`mpg[, mean(cty), by = class]`

**Mean city fuel economy for each class and assign column name**

`mpg[, .(mean_city = mean(cty)), by = class]`

**Mean city fuel economy for each class in increasing order**

`mpg[, .(mean_city = mean(cty)), keyby = class]`

**Mean highway fuel economy for honda and toyota vehicles**

`mpg[manufacturer %in% c("honda", "toyota"), mean(hwy), by = manufacturer]`

**Class of all variables, return as data table**

`mpg[, lapply(.SD, class)]`

**Class of all variables, return as vector**

`mpg[, sapply(.SD, class)]`

**Fraction of NA values for each variable**

`mpg[, lapply(.SD, function(x) mean(is.na(x)))]`

**Mean of all numeric variables**

`.SDcols`

can be used to apply a function to only subset of columns

```
cols <- mpg[, sapply(.SD, is.numeric)]
mpg[, lapply(.SD, mean), .SDcols = cols]
```

**Change all non-numeric columns to upper case**

```
cols <- mpg[, sapply(.SD, is.numeric)]
mpg[, lapply(.SD, toupper), .SDcols = -cols]
```

**Number of vehicles with average fuel economy more than 25**

`mpg[, sum(((cty + hwy) / 2) > 25)]`

### More Expressions in J

**Scatterplot of city fuel economy versus displacement**

`mpg[, lattice::xyplot(cty ~ displ)]`

**Density plot of highway fuel economy**

`mpg[, lattice::densityplot(~ hwy)]`

### Update by Reference

`:=`

is used to update data.table by reference

**Add a column for average fuel economy**

`mpg[, avg_mpg := (cty + hwy) / 2]`

**Add a column with mean of city fuel economy**

`mpg[, avecty := mean(cty)]`

**Add a factor variable (values = good, bad) based on city fuel economy being more or less than average city fuel economy**

`mpg[, mpgrating := ifelse(cty < mean(cty), "good", "bad")]`

**Change ‘f’ to ‘Front’ for drv values**

`mpg["f", drv := "Front", on = "drv"]`

### Remove Columns

**Remove fl column**

`mpg[, fl := NULL]`

**Remove first three columns**

`mpg[, c(1:3) := NULL]`

**Remove last column**

`mpg[, length(mpg) := NULL]`

**Remove all columns starting with m**

`mpg[, grep("^m", names(mpg)) := NULL]`

**Remove manufacturer and year columns**

```
mpg[, `:=`(manufacturer = NULL,
year = NULL)]
```