dtplyr provides a data.table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data.table code.
See vignette("translation")
for details of the current
translations, and table.express and
rqdatatable for
related work.
You can install from CRAN with:
install.packages("dtplyr")
Or try the development version from GitHub with:
# install.packages("devtools")
::install_github("tidyverse/dtplyr") devtools
To use dtplyr, you must at least load dtplyr and dplyr. You may also want to load data.table so you can access the other goodies that it provides:
library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
Then use lazy_dt()
to create a “lazy” data table that
tracks the operations performed on it.
<- lazy_dt(mtcars) mtcars2
You can preview the transformation (including the generated data.table code) by printing the result:
%>%
mtcars2 filter(wt < 5) %>%
mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
group_by(cyl) %>%
summarise(l100k = mean(l100k))
#> Source: local data table [3 x 2]
#> Call: `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)),
#> keyby = .(cyl)]
#>
#> cyl l100k
#> <dbl> <dbl>
#> 1 4 9.05
#> 2 6 12.0
#> 3 8 14.9
#>
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
But generally you should reserve this only for debugging, and use
as.data.table()
, as.data.frame()
, or
as_tibble()
to indicate that you’re done with the
transformation and want to access the results:
%>%
mtcars2 filter(wt < 5) %>%
mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
group_by(cyl) %>%
summarise(l100k = mean(l100k)) %>%
as_tibble()
#> # A tibble: 3 × 2
#> cyl l100k
#> <dbl> <dbl>
#> 1 4 9.05
#> 2 6 12.0
#> 3 8 14.9
There are two primary reasons that dtplyr will always be somewhat slower than data.table:
Each dplyr verb must do some work to convert dplyr syntax to data.table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets. Initial benchmarks suggest that the overhead should be under 1ms per dplyr call.
To match dplyr semantics, mutate()
does not modify
in place by default. This means that most expressions involving
mutate()
must make a copy that would not be necessary if
you were using data.table directly. (You can opt out of this behaviour
in lazy_dt()
with immutable = FALSE
).
Please note that the dtplyr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.