https://github.com/TysonStanley/tidyfast
Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats
https://github.com/TysonStanley/tidyfast
Last synced: about 1 month ago
JSON representation
Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats
- Host: GitHub
- URL: https://github.com/TysonStanley/tidyfast
- Owner: TysonStanley
- Created: 2019-10-24T06:46:24.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-04-08T20:49:17.000Z (about 1 year ago)
- Last Synced: 2024-10-13T14:14:17.795Z (6 months ago)
- Language: C++
- Homepage: https://tysonbarrett.com/tidyfast/
- Size: 16 MB
- Stars: 187
- Watchers: 5
- Forks: 4
- Open Issues: 4
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Code of conduct: .github/CODE_OF_CONDUCT.md
Awesome Lists containing this project
- jimsghstars - TysonStanley/tidyfast - Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats (C++)
README
---
output: github_document
editor_options:
chunk_output_type: console
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "70%"
)
set.seed(843)
```# `tidyfast v0.4.0`
[](https://CRAN.R-project.org/package=tidyfast)
[](https://lifecycle.r-lib.org/articles/stages.html#maturing)
[](https://app.codecov.io/gh/TysonStanley/tidyfast?branch=master)

[](https://github.com/TysonStanley/tidyfast/actions/workflows/R-CMD-check.yaml)**Note: The expansion of `dtplyr` has made some of the functionality in `tidyfast` redundant. See `dtplyr` for a list of functions that are handled within that framework.**
The goal of `tidyfast` is to provide fast and efficient alternatives to some `tidyr` (and a few `dplyr`) functions using `data.table` under the hood. Each have the prefix of `dt_` to allow for autocomplete in IDEs such as RStudio. These should compliment some of the current functionality in `dtplyr` (but notably does not use the `lazy_dt()` framework of `dtplyr`). This package imports `data.table` and `cpp11` (no other dependencies).
These are, in essence, translations from a more `tidyverse` grammar to `data.table`. Most functions herein are in places where, in my opinion, the `data.table` syntax is not obvious or clear. As such, these functions can translate a simple function call into the fast, efficient, and concise syntax of `data.table`.
The current functions include:
**Nesting and unnesting** (similar to `dplyr::group_nest()` and `tidyr::unnest()`):
- `dt_nest()` for nesting data tables
- `dt_unnest()` for unnesting data tables
- `dt_hoist()` for unnesting vectors in a list-column in a data table**Pivoting** (similar to `tidyr::pivot_longer()` and `tidyr::pivot_wider()`)
- `dt_pivot_longer()` for fast pivoting using `data.table::melt()`
- `dt_pivot_wider()` for fast pivoting using `data.table::dcast()`**If Else** (similar to `dplyr::case_when()`):
- `dt_case_when()` for `dplyr::case_when()` syntax with the speed of `data.table::fifelse()`
**Fill** (similar to `tidyr::fill()`)
- `dt_fill()` for filling `NA` values with values before it, after it, or both. This can be done by a grouping variable (e.g. fill in `NA` values with values within an individual).
**Count** and **Uncount** (similar to `tidyr::uncount()` and `dplyr::count()`)
- `dt_count()` for fast counting by group(s)
- `dt_uncount()` for creating full data from a count table**Separate** (similar to `tidyr::separate()`)
- `dt_separate()` for splitting a single column into multiple based on a match within the column (e.g., column with values like "A.B" could be split into two columns by using the period as the separator where column 1 would have "A" and 2 would have "B"). It is built on `data.table::tstrsplit()`. This is not well tested yet and lacks some functionality of `tidyr::separate()`.
**Adjust `data.table` print options**
- `dt_print_options()` for adjusting the options for `print.data.table()`
## General API
`tidyfast` attempts to convert syntax from `tidyr` with its accompanying grammar to `data.table` function calls. As such, we have tried to maintain the `tidyr` syntax as closely as possible without hurting speed and efficiency. Some more advanced use cases in `tidyr` may not translate yet. We try to be transparent about the shortcomings in syntax and behavior where known.
Each function that takes data (labeled as `dt_` in the package docs) as its first argument automatically coerces it to a data table with `as.data.table()` if it isn't already a data table. Each of these functions will return a data table.
## Installation
You can install the stable version from CRAN with:
``` r
install.packages("tidyfast")
```or you can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("TysonStanley/tidyfast")
``````{r, echo=FALSE}
devtools::load_all(here::here())
```## Examples
The initial versions of the nesting and unnesting functions were shown in a [preprint](https://osf.io/preprints/psyarxiv/u8ekc/). Herein is shown some simple applications and the functions' speed/efficiency.
```{r, eval=FALSE}
library(tidyfast)
```### Nesting and Unnesting
The following data table will be used for the nesting/unnesting examples.
```{r, message = FALSE, warning = FALSE}
set.seed(84322)library(data.table)
library(dplyr) # to compare with case_when()
library(tidyr) # to compare with fill() and separate()
library(ggplot2) # figures
library(ggbeeswarm) # figuresdt <- data.table(
x = rnorm(1e5),
y = runif(1e5),
grp = sample(1L:5L, 1e5, replace = TRUE),
nested1 = lapply(1:10, sample, 10, replace = TRUE),
nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE),
id = 1:1e5)
```To make all the comparisons herein more equal, we will set the number of threads that `data.table` will use to 1.
```{r}
setDTthreads(1)
```We can nest this data using `dt_nest()`:
```{r}
nested <- dt_nest(dt, grp)
nested
```We can also unnest this with `dt_unnest()`:
```{r}
dt_unnest(nested, col = data)
```When our list columns don't have data tables (as output from `dt_nest()`) we can use the `dt_hoist()` function, that will unnest vectors. It keeps all the other variables that are not list-columns as well.
```{r}
dt_hoist(dt, nested1, nested2)
```Speed comparisons (similar to those shown in the preprint) are highlighted below. Notably, the timings are without the `nested1` and `nested2` columns of the original `dt` object from above. Also, all `dplyr` and `tidyr` functions use a `tbl` version of the `dt` table.
```{r, echo = FALSE, fig.width=4, fig.height=8, dpi=300, warning=FALSE, message=FALSE}
tbl <- as_tibble(dt) %>% select(x, y, id, grp)
dt2 <- dt[, .(x,y,id,grp)]
nesting <- bench::mark(
nested1 <- dt_nest(dt2, grp),
group_nest(tbl, grp),
check = FALSE,
iterations = 50) %>%
mutate(expression = c("dt_nest", "group_nest"))
nested_tbl <- as_tibble(nested1)
unnesting <- bench::mark(
dt_unnest(nested1, data),
unnest(nested_tbl, data),
check = FALSE,
iterations = 50) %>%
mutate(expression = c("dt_unnest", "unnest"))nest_unnest <- bind_rows(nesting, unnesting)
theme_set(
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.y = element_line(linetype = "dashed"),
panel.grid.major.x = element_blank(),
legend.position = "none")
)library(ggplot2)
as.data.table(nest_unnest$time) %>%
setNames(c("dt_nest", "group_nest", "dt_unnest", "unnest")) %>%
dt_pivot_longer(cols = dt_nest:unnest, names_to = "expression", values_to = "time") %>%
.[, type := dt_case_when(stringr::str_detect(expression, "unnest") ~ "Unnesting",
TRUE ~ "Nesting")] %>%
.[, time := as.numeric(time)] %>%
ggplot(aes(expression, time, color = expression)) +
ggbeeswarm::geom_beeswarm(alpha = .6) +
labs(x = "",
y = "Time (seconds)") +
facet_wrap(type~., scales = "free", ncol = 1) +
scale_color_viridis_d(option = "plasma", end = .8) +
scale_y_log10() +
NULLselect(nesting, expression, median, mem_alloc)
select(unnesting, expression, median, mem_alloc)
```## Pivoting
Thanks to [@markfairbanks](https://github.com/markfairbanks), we now have pivoting translations to `data.table::melt()` and `data.table::dcast()`. Consider the following example (similar to the example in `tidyr::pivot_longer()` and `tidyr::pivot_wider()`):
```{r}
billboard <- tidyr::billboard# note the warning - melt is telling us what
# it did with the various data types---logical (where there were just NAs
# and numeric
longer <- billboard %>%
dt_pivot_longer(
cols = c(-artist, -track, -date.entered),
names_to = "week",
values_to = "rank"
)
longerwider <- longer %>%
dt_pivot_wider(
names_from = week,
values_from = rank
)
wider[, .(artist, track, wk1, wk2)]
```Notably, there are some current limitations to these: 1) `tidyselect` techniques do not work across the board (e.g. cannot use `start_with()` and friends) and 2) the functions are new and likely prone to edge-case bugs.
But let's compare some basic speed and efficiency. Because of the `data.table` functions, these are extremely fast and efficient.
```{r first_pivot, echo = FALSE, warning=FALSE, message=FALSE}
bill_dt <- as.data.table(billboard)
longer_timings <- bench::mark(
dt_pivot_longer = dt_pivot_longer(bill_dt, cols = c(-artist, -track, -date.entered),
names_to = "week", names_prefix = "wk", values_to = "rank"),
pivot_longer = pivot_longer(billboard, cols = c(-artist, -track, -date.entered),
names_to = "week", names_prefix = "wk", values_to = "rank"),
check = FALSE,
iterations = 40)
``````{r second_pivot, echo = FALSE, fig.width=5, fig.height=4, dpi=300, warning=FALSE, message=FALSE}
longer_tbl <- as_tibble(longer)
wider_timings <- bench::mark(
dt_pivot_wider = dt_pivot_wider(longer, names_from = week, values_from = rank),
pivot_wider = pivot_wider(longer_tbl, names_from = week, values_from = rank),
check = FALSE,
iterations = 40)
``````{r third_pivot, echo = FALSE, fig.width=5, fig.height=4, dpi=300, warning=FALSE, message=FALSE}
pivot_timings <- rbind(longer_timings, wider_timings) %>%
mutate(type = c("longer", "longer", "wider", "wider")) %>%
mutate(expression = as.character(expression))pivot_timings %>%
dt_hoist(time) %>%
mutate(time = lubridate::seconds(time)) %>%
filter(type == "longer") %>%
ggplot(aes(x = expression,
y = time,
color = expression)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(option = "plasma", end = .8) +
scale_y_log10() +
facet_grid(~type, space = "free", scales = "free")pivot_timings %>%
dt_hoist(time) %>%
mutate(time = lubridate::seconds(time)) %>%
filter(type == "wider") %>%
ggplot(aes(x = expression,
y = time,
color = expression)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(option = "plasma", end = .8) +
scale_y_log10() +
facet_grid(~type, space = "free", scales = "free")pivot_timings %>%
select(expression, median, mem_alloc)
```### If Else
Also, the new `dt_case_when()` function is built on the very fast `data.table::fiflese()` but has syntax like unto `dplyr::case_when()`. That is, it looks like:
```{r, eval = FALSE}
dt_case_when(condition1 ~ label1,
condition2 ~ label2,
...)
```To show that each method, `dt_case_when()`, `dplyr::case_when()`, and `data.table::fifelse()` produce the same result, consider the following example.
```{r}
x <- rnorm(1e6)medianx <- median(x)
x_cat <-
dt_case_when(x < medianx ~ "low",
x >= medianx ~ "high",
is.na(x) ~ "unknown")
x_cat_dplyr <-
case_when(x < medianx ~ "low",
x >= medianx ~ "high",
is.na(x) ~ "unknown")
x_cat_fif <-
fifelse(x < medianx, "low",
fifelse(x >= medianx, "high",
fifelse(is.na(x), "unknown", NA_character_)))identical(x_cat, x_cat_dplyr)
identical(x_cat, x_cat_fif)
```Notably, `dt_case_when()` is very fast and memory efficient, given it is built on `data.table::fifelse()`.
```{r, echo = FALSE, warning = FALSE, message = FALSE, fig.width=6, fig.height=5, dpi=300}
marks <-
bench::mark(dt_case_when(x < medianx ~ "low",
x >= medianx ~ "high",
is.na(x) ~ "unknown"),
case_when(x < medianx ~ "low",
x >= medianx ~ "high",
is.na(x) ~ "unknown"),
fifelse(x < medianx, "low",
fifelse(x >= medianx, "high",
fifelse(is.na(x), "unknown", NA_character_))),
iterations = 50)library(ggbeeswarm) # for the speed comparison plot
marks$time %>%
setNames(c("dt_case_when", "case_when", "fifelse")) %>%
data.frame() %>%
tidyr::gather() %>%
mutate(value = as.numeric(value)) %>%
ggplot(aes(key, value, color = key)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(option = "plasma", end = .8) +
scale_y_log10()marks %>%
select(expression, median, mem_alloc) %>%
mutate(expression = c("dt_case_when", "case_when", "fifelse")) %>%
arrange(expression)
```## Fill
A new function is `dt_fill()`, which fulfills the role of `tidyr::fill()` to fill in `NA` values with values around it (either the value above, below, or trying both). This currently relies on the efficient `C++` code from `tidyr` (`fillUp()` and `fillDown()`).
```{r}
x = 1:10
dt_with_nas <- data.table(
x = x,
y = shift(x, 2L),
z = shift(x, -2L),
a = sample(c(rep(NA, 10), x), 10),
id = sample(1:3, 10, replace = TRUE)
)# Original
dt_with_nas# All defaults
dt_fill(dt_with_nas, y, z, a, immutable = FALSE)# by id variable called `grp`
dt_fill(dt_with_nas,
y, z, a,
id = list(id))# both down and then up filling by group
dt_fill(dt_with_nas,
y, z, a,
id = list(id),
.direction = "downup")
```In its current form, `dt_fill()` is faster than `tidyr::fill()` and uses slightly less memory. Below are the results of filling in the `NA`s within each `id` on a 19 MB data set.
```{r}
x = 1:1e6
dt3 <- data.table(
x = x,
y = shift(x, 10L),
z = shift(x, -10L),
a = sample(c(rep(NA, 10), x), 10),
id = sample(1:3, 10, replace = TRUE))
df3 <- data.frame(dt3)marks3 <-
bench::mark(
tidyr::fill(dplyr::group_by(df3, id), x, y),
tidyfast::dt_fill(dt3, x, y, id = list(id)),
check = FALSE,
iterations = 50
)
``````{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}
marks3$time %>%
setNames(c("fill", "dt_fill")) %>%
data.frame() %>%
tidyr::gather() %>%
mutate(value = as.numeric(value)) %>%
ggplot(aes(key, value, color = key)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(end = .8) +
scale_y_log10()marks3 %>%
select(expression, median, mem_alloc)
```## Separate
The `dt_separate()` function is still under heavy development. Its behavior is similar to `tidyr::separate()` but is lacking some functionality currently. For example, `into` needs to be supplied the maximum number of possible columns to separate.
```{r, eval = FALSE}
dt_separate(data.table(col = "A.B.C"), col, into = c("A", "B"))
#> Error in `[.data.table`(dt, , eval(split_it)) :
#> Supplied 2 columns to be assigned 3 items. Please see NEWS for v1.12.2.
```For current functionality, consider the following example.
```{r}
dt_to_split <- data.table(
x = paste(letters, LETTERS, sep = ".")
)dt_separate(dt_to_split, x, into = c("lower", "upper"))
``````{r, echo = FALSE}
head(dt_separate(dt_to_split, x, into = c("lower", "upper")))
```Testing with a 4 MB data set with one variable that has columns of "A.B" repeatedly, shows that `dt_separate()` is fast and far more memory efficient compared to `tidyr::separate()`.
```{r, echo = FALSE, warning = FALSE}
dt4 <- data.table(
col = paste(rep("A", 5e5), rep("B", 5e5), sep = ".")
)
df4 <- data.frame(dt4)marks4 <-
bench::mark(
tidyr::separate(df4, col, into = c("first", "second"), sep = "\\."),
tidyfast::dt_separate(dt4, col, into = c("first", "second"), sep = ".", remove = FALSE),
tidyfast::dt_separate(dt4, col, into = c("first", "second"), sep = ".", immutable = FALSE, remove = FALSE),
check = FALSE,
iterations = 25
)
``````{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}
marks4$time %>%
setNames(c("separate", "dt_separate", "dt_separate-mutable")) %>%
data.frame() %>%
tidyr::gather() %>%
mutate(value = as.numeric(value)) %>%
ggplot(aes(key, value, color = key)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(end = .8) +
scale_y_log10()marks4 %>%
mutate(expression = c("separate", "dt_separate", "dt_separate-mutable")) %>%
select(expression, median, mem_alloc)
```## Count and Uncount
The `dt_count()` function does essentially what `dplyr::count()` does. Notably, this, unlike the majority of other `dt_` functions, wraps a very simple statement in `data.table`. That is, `data.table` makes getting counts very simple and concise. Nonetheless, `dt_count()` fits the general API of `tidyfast`. To some degree, `dt_uncount()` is also a fairly simple wrapper, although the approach may not be as straightforward as that for `dt_count()`.
The following examples show how count and uncount can work. We'll use the `dt` data table from the nesting examples.
```{r}
counted <- dt_count(dt, grp)
counted
``````{r}
uncounted <- dt_uncount(counted, N)
uncounted[]
```These are also quick (not that the `tidyverse` functions were at all slow here).
```{r}
dt5 <- copy(dt)
df5 <- data.frame(dt5)marks5 <-
bench::mark(
counted_tbl <- dplyr::count(df5, grp),
counted_dt <- tidyfast::dt_count(dt5, grp),
tidyr::uncount(counted_tbl, n),
tidyfast::dt_uncount(counted_dt, N),
check = FALSE,
iterations = 25
)
``````{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}
marks5$time %>%
setNames(c("count", "dt_count", "uncount", "dt_uncount")) %>%
data.frame() %>%
tidyr::gather() %>%
mutate(value = as.numeric(value)) %>%
mutate(type = dt_case_when(stringr::str_detect(key, "uncount") ~ "Uncounting",
TRUE ~ "Counting")) %>%
ggplot(aes(key, value, color = key)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(option = "plasma", end = .8) +
facet_grid(~type, space = "free", scales = "free") +
scale_y_log10()
```## Notes
Please note that the `tidyfast` project is released with a [Contributor Code of Conduct](https://github.com/TysonStanley/tidyfast/blob/master/.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.
We want to thank our wonderful contributors:
- [markfairbanks](https://github.com/markfairbanks) for PR #6 providing initial the pivoting functions. Note the [`tidytable`](https://github.com/markfairbanks/tidytable) package that compliments some of `tidyfast`s functionality.
**Complementary Packages:**
- [`dtplyr`](https://dtplyr.tidyverse.org)
- [`tidytable`](https://github.com/markfairbanks/tidytable)
- [`maditr`](https://github.com/gdemin/maditr)