Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tidyverse/dtplyr

Data table backend for dplyr
https://github.com/tidyverse/dtplyr

datatable dplyr r

Last synced: about 2 months ago
JSON representation

Data table backend for dplyr

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# dtplyr

[![CRAN status](https://www.r-pkg.org/badges/version/dtplyr)](https://cran.r-project.org/package=dtplyr)
[![R-CMD-check](https://github.com/tidyverse/dtplyr/workflows/R-CMD-check/badge.svg)](https://github.com/tidyverse/dtplyr/actions)
[![Codecov test coverage](https://codecov.io/gh/tidyverse/dtplyr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/dtplyr?branch=main)

## Overview

dtplyr provides a [data.table](http://r-datatable.com/) backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data.table code.

See `vignette("translation")` for details of the current translations, and [table.express](https://github.com/asardaes/table.express) and [rqdatatable](https://github.com/WinVector/rqdatatable/) for related work.

## Installation

You can install from CRAN with:

```R
install.packages("dtplyr")
```

Or try the development version from GitHub with:

```R
# install.packages("pak")
pak::pak("tidyverse/dtplyr")
```

## Usage

To use dtplyr, you must at least load dtplyr and dplyr. You may also want to load [data.table](http://r-datatable.com/) so you can access the other goodies that it provides:

```{r setup}
library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
```

Then use `lazy_dt()` to create a "lazy" data table that tracks the operations performed on it.

```{r}
mtcars2 <- lazy_dt(mtcars)
```

You can preview the transformation (including the generated data.table code) by printing the result:

```{r}
mtcars2 %>%
filter(wt < 5) %>%
mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
group_by(cyl) %>%
summarise(l100k = mean(l100k))
```

But generally you should reserve this only for debugging, and use `as.data.table()`, `as.data.frame()`, or `as_tibble()` to indicate that you're done with the transformation and want to access the results:

```{r}
mtcars2 %>%
filter(wt < 5) %>%
mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
group_by(cyl) %>%
summarise(l100k = mean(l100k)) %>%
as_tibble()
```

## Why is dtplyr slower than data.table?

There are two primary reasons that dtplyr will always be somewhat slower than data.table:

* Each dplyr verb must do some work to convert dplyr syntax to data.table
syntax. This takes time proportional to the complexity of the input code,
not the input _data_, so should be a negligible overhead for large datasets.
[Initial benchmarks][benchmark] suggest that the overhead should be under
1ms per dplyr call.

* To match dplyr semantics, `mutate()` does not modify in place by default.
This means that most expressions involving `mutate()` must make a copy
that would not be necessary if you were using data.table directly.
(You can opt out of this behaviour in `lazy_dt()` with `immutable = FALSE`).

[benchmark]: https://dtplyr.tidyverse.org/articles/translation.html#performance

## Code of Conduct

Please note that the dtplyr project is released with a [Contributor Code of Conduct](https://dtplyr.tidyverse.org/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.