An open API service indexing awesome lists of open source software.

https://github.com/TysonStanley/tidyfast

Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats
https://github.com/TysonStanley/tidyfast

Last synced: about 1 month ago
JSON representation

Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats

Awesome Lists containing this project

README

        

---
output: github_document
editor_options:
chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "70%"
)
set.seed(843)
```

# `tidyfast v0.4.0`

[![CRAN status](https://www.r-pkg.org/badges/version/tidyfast)](https://CRAN.R-project.org/package=tidyfast)
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html#maturing)
[![Codecov test coverage](https://codecov.io/gh/TysonStanley/tidyfast/branch/master/graph/badge.svg)](https://app.codecov.io/gh/TysonStanley/tidyfast?branch=master)
![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/tidyfast)
[![R-CMD-check](https://github.com/TysonStanley/tidyfast/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/TysonStanley/tidyfast/actions/workflows/R-CMD-check.yaml)

**Note: The expansion of `dtplyr` has made some of the functionality in `tidyfast` redundant. See `dtplyr` for a list of functions that are handled within that framework.**

The goal of `tidyfast` is to provide fast and efficient alternatives to some `tidyr` (and a few `dplyr`) functions using `data.table` under the hood. Each have the prefix of `dt_` to allow for autocomplete in IDEs such as RStudio. These should compliment some of the current functionality in `dtplyr` (but notably does not use the `lazy_dt()` framework of `dtplyr`). This package imports `data.table` and `cpp11` (no other dependencies).

These are, in essence, translations from a more `tidyverse` grammar to `data.table`. Most functions herein are in places where, in my opinion, the `data.table` syntax is not obvious or clear. As such, these functions can translate a simple function call into the fast, efficient, and concise syntax of `data.table`.

The current functions include:

**Nesting and unnesting** (similar to `dplyr::group_nest()` and `tidyr::unnest()`):

- `dt_nest()` for nesting data tables
- `dt_unnest()` for unnesting data tables
- `dt_hoist()` for unnesting vectors in a list-column in a data table

**Pivoting** (similar to `tidyr::pivot_longer()` and `tidyr::pivot_wider()`)

- `dt_pivot_longer()` for fast pivoting using `data.table::melt()`
- `dt_pivot_wider()` for fast pivoting using `data.table::dcast()`

**If Else** (similar to `dplyr::case_when()`):

- `dt_case_when()` for `dplyr::case_when()` syntax with the speed of `data.table::fifelse()`

**Fill** (similar to `tidyr::fill()`)

- `dt_fill()` for filling `NA` values with values before it, after it, or both. This can be done by a grouping variable (e.g. fill in `NA` values with values within an individual).

**Count** and **Uncount** (similar to `tidyr::uncount()` and `dplyr::count()`)

- `dt_count()` for fast counting by group(s)
- `dt_uncount()` for creating full data from a count table

**Separate** (similar to `tidyr::separate()`)

- `dt_separate()` for splitting a single column into multiple based on a match within the column (e.g., column with values like "A.B" could be split into two columns by using the period as the separator where column 1 would have "A" and 2 would have "B"). It is built on `data.table::tstrsplit()`. This is not well tested yet and lacks some functionality of `tidyr::separate()`.

**Adjust `data.table` print options**

- `dt_print_options()` for adjusting the options for `print.data.table()`

## General API

`tidyfast` attempts to convert syntax from `tidyr` with its accompanying grammar to `data.table` function calls. As such, we have tried to maintain the `tidyr` syntax as closely as possible without hurting speed and efficiency. Some more advanced use cases in `tidyr` may not translate yet. We try to be transparent about the shortcomings in syntax and behavior where known.

Each function that takes data (labeled as `dt_` in the package docs) as its first argument automatically coerces it to a data table with `as.data.table()` if it isn't already a data table. Each of these functions will return a data table.

## Installation

You can install the stable version from CRAN with:

``` r
install.packages("tidyfast")
```

or you can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("remotes")
remotes::install_github("TysonStanley/tidyfast")
```

```{r, echo=FALSE}
devtools::load_all(here::here())
```

## Examples

The initial versions of the nesting and unnesting functions were shown in a [preprint](https://osf.io/preprints/psyarxiv/u8ekc/). Herein is shown some simple applications and the functions' speed/efficiency.

```{r, eval=FALSE}
library(tidyfast)
```

### Nesting and Unnesting

The following data table will be used for the nesting/unnesting examples.

```{r, message = FALSE, warning = FALSE}
set.seed(84322)

library(data.table)
library(dplyr) # to compare with case_when()
library(tidyr) # to compare with fill() and separate()
library(ggplot2) # figures
library(ggbeeswarm) # figures

dt <- data.table(
x = rnorm(1e5),
y = runif(1e5),
grp = sample(1L:5L, 1e5, replace = TRUE),
nested1 = lapply(1:10, sample, 10, replace = TRUE),
nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE),
id = 1:1e5)
```

To make all the comparisons herein more equal, we will set the number of threads that `data.table` will use to 1.

```{r}
setDTthreads(1)
```

We can nest this data using `dt_nest()`:

```{r}
nested <- dt_nest(dt, grp)
nested
```

We can also unnest this with `dt_unnest()`:

```{r}
dt_unnest(nested, col = data)
```

When our list columns don't have data tables (as output from `dt_nest()`) we can use the `dt_hoist()` function, that will unnest vectors. It keeps all the other variables that are not list-columns as well.

```{r}
dt_hoist(dt, nested1, nested2)
```

Speed comparisons (similar to those shown in the preprint) are highlighted below. Notably, the timings are without the `nested1` and `nested2` columns of the original `dt` object from above. Also, all `dplyr` and `tidyr` functions use a `tbl` version of the `dt` table.

```{r, echo = FALSE, fig.width=4, fig.height=8, dpi=300, warning=FALSE, message=FALSE}
tbl <- as_tibble(dt) %>% select(x, y, id, grp)
dt2 <- dt[, .(x,y,id,grp)]
nesting <- bench::mark(
nested1 <- dt_nest(dt2, grp),
group_nest(tbl, grp),
check = FALSE,
iterations = 50) %>%
mutate(expression = c("dt_nest", "group_nest"))
nested_tbl <- as_tibble(nested1)
unnesting <- bench::mark(
dt_unnest(nested1, data),
unnest(nested_tbl, data),
check = FALSE,
iterations = 50) %>%
mutate(expression = c("dt_unnest", "unnest"))

nest_unnest <- bind_rows(nesting, unnesting)

theme_set(
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.y = element_line(linetype = "dashed"),
panel.grid.major.x = element_blank(),
legend.position = "none")
)

library(ggplot2)
as.data.table(nest_unnest$time) %>%
setNames(c("dt_nest", "group_nest", "dt_unnest", "unnest")) %>%
dt_pivot_longer(cols = dt_nest:unnest, names_to = "expression", values_to = "time") %>%
.[, type := dt_case_when(stringr::str_detect(expression, "unnest") ~ "Unnesting",
TRUE ~ "Nesting")] %>%
.[, time := as.numeric(time)] %>%
ggplot(aes(expression, time, color = expression)) +
ggbeeswarm::geom_beeswarm(alpha = .6) +
labs(x = "",
y = "Time (seconds)") +
facet_wrap(type~., scales = "free", ncol = 1) +
scale_color_viridis_d(option = "plasma", end = .8) +
scale_y_log10() +
NULL

select(nesting, expression, median, mem_alloc)
select(unnesting, expression, median, mem_alloc)
```

## Pivoting

Thanks to [@markfairbanks](https://github.com/markfairbanks), we now have pivoting translations to `data.table::melt()` and `data.table::dcast()`. Consider the following example (similar to the example in `tidyr::pivot_longer()` and `tidyr::pivot_wider()`):

```{r}
billboard <- tidyr::billboard

# note the warning - melt is telling us what
# it did with the various data types---logical (where there were just NAs
# and numeric
longer <- billboard %>%
dt_pivot_longer(
cols = c(-artist, -track, -date.entered),
names_to = "week",
values_to = "rank"
)
longer

wider <- longer %>%
dt_pivot_wider(
names_from = week,
values_from = rank
)
wider[, .(artist, track, wk1, wk2)]
```

Notably, there are some current limitations to these: 1) `tidyselect` techniques do not work across the board (e.g. cannot use `start_with()` and friends) and 2) the functions are new and likely prone to edge-case bugs.

But let's compare some basic speed and efficiency. Because of the `data.table` functions, these are extremely fast and efficient.

```{r first_pivot, echo = FALSE, warning=FALSE, message=FALSE}
bill_dt <- as.data.table(billboard)
longer_timings <- bench::mark(
dt_pivot_longer = dt_pivot_longer(bill_dt, cols = c(-artist, -track, -date.entered),
names_to = "week", names_prefix = "wk", values_to = "rank"),
pivot_longer = pivot_longer(billboard, cols = c(-artist, -track, -date.entered),
names_to = "week", names_prefix = "wk", values_to = "rank"),
check = FALSE,
iterations = 40)
```

```{r second_pivot, echo = FALSE, fig.width=5, fig.height=4, dpi=300, warning=FALSE, message=FALSE}
longer_tbl <- as_tibble(longer)
wider_timings <- bench::mark(
dt_pivot_wider = dt_pivot_wider(longer, names_from = week, values_from = rank),
pivot_wider = pivot_wider(longer_tbl, names_from = week, values_from = rank),
check = FALSE,
iterations = 40)
```

```{r third_pivot, echo = FALSE, fig.width=5, fig.height=4, dpi=300, warning=FALSE, message=FALSE}
pivot_timings <- rbind(longer_timings, wider_timings) %>%
mutate(type = c("longer", "longer", "wider", "wider")) %>%
mutate(expression = as.character(expression))

pivot_timings %>%
dt_hoist(time) %>%
mutate(time = lubridate::seconds(time)) %>%
filter(type == "longer") %>%
ggplot(aes(x = expression,
y = time,
color = expression)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(option = "plasma", end = .8) +
scale_y_log10() +
facet_grid(~type, space = "free", scales = "free")

pivot_timings %>%
dt_hoist(time) %>%
mutate(time = lubridate::seconds(time)) %>%
filter(type == "wider") %>%
ggplot(aes(x = expression,
y = time,
color = expression)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(option = "plasma", end = .8) +
scale_y_log10() +
facet_grid(~type, space = "free", scales = "free")

pivot_timings %>%
select(expression, median, mem_alloc)
```

### If Else

Also, the new `dt_case_when()` function is built on the very fast `data.table::fiflese()` but has syntax like unto `dplyr::case_when()`. That is, it looks like:

```{r, eval = FALSE}
dt_case_when(condition1 ~ label1,
condition2 ~ label2,
...)
```

To show that each method, `dt_case_when()`, `dplyr::case_when()`, and `data.table::fifelse()` produce the same result, consider the following example.

```{r}
x <- rnorm(1e6)

medianx <- median(x)
x_cat <-
dt_case_when(x < medianx ~ "low",
x >= medianx ~ "high",
is.na(x) ~ "unknown")
x_cat_dplyr <-
case_when(x < medianx ~ "low",
x >= medianx ~ "high",
is.na(x) ~ "unknown")
x_cat_fif <-
fifelse(x < medianx, "low",
fifelse(x >= medianx, "high",
fifelse(is.na(x), "unknown", NA_character_)))

identical(x_cat, x_cat_dplyr)
identical(x_cat, x_cat_fif)
```

Notably, `dt_case_when()` is very fast and memory efficient, given it is built on `data.table::fifelse()`.

```{r, echo = FALSE, warning = FALSE, message = FALSE, fig.width=6, fig.height=5, dpi=300}
marks <-
bench::mark(dt_case_when(x < medianx ~ "low",
x >= medianx ~ "high",
is.na(x) ~ "unknown"),
case_when(x < medianx ~ "low",
x >= medianx ~ "high",
is.na(x) ~ "unknown"),
fifelse(x < medianx, "low",
fifelse(x >= medianx, "high",
fifelse(is.na(x), "unknown", NA_character_))),
iterations = 50)

library(ggbeeswarm) # for the speed comparison plot

marks$time %>%
setNames(c("dt_case_when", "case_when", "fifelse")) %>%
data.frame() %>%
tidyr::gather() %>%
mutate(value = as.numeric(value)) %>%
ggplot(aes(key, value, color = key)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(option = "plasma", end = .8) +
scale_y_log10()

marks %>%
select(expression, median, mem_alloc) %>%
mutate(expression = c("dt_case_when", "case_when", "fifelse")) %>%
arrange(expression)
```

## Fill

A new function is `dt_fill()`, which fulfills the role of `tidyr::fill()` to fill in `NA` values with values around it (either the value above, below, or trying both). This currently relies on the efficient `C++` code from `tidyr` (`fillUp()` and `fillDown()`).

```{r}
x = 1:10
dt_with_nas <- data.table(
x = x,
y = shift(x, 2L),
z = shift(x, -2L),
a = sample(c(rep(NA, 10), x), 10),
id = sample(1:3, 10, replace = TRUE)
)

# Original
dt_with_nas

# All defaults
dt_fill(dt_with_nas, y, z, a, immutable = FALSE)

# by id variable called `grp`
dt_fill(dt_with_nas,
y, z, a,
id = list(id))

# both down and then up filling by group
dt_fill(dt_with_nas,
y, z, a,
id = list(id),
.direction = "downup")
```

In its current form, `dt_fill()` is faster than `tidyr::fill()` and uses slightly less memory. Below are the results of filling in the `NA`s within each `id` on a 19 MB data set.

```{r}
x = 1:1e6
dt3 <- data.table(
x = x,
y = shift(x, 10L),
z = shift(x, -10L),
a = sample(c(rep(NA, 10), x), 10),
id = sample(1:3, 10, replace = TRUE))
df3 <- data.frame(dt3)

marks3 <-
bench::mark(
tidyr::fill(dplyr::group_by(df3, id), x, y),
tidyfast::dt_fill(dt3, x, y, id = list(id)),
check = FALSE,
iterations = 50
)
```

```{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}
marks3$time %>%
setNames(c("fill", "dt_fill")) %>%
data.frame() %>%
tidyr::gather() %>%
mutate(value = as.numeric(value)) %>%
ggplot(aes(key, value, color = key)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(end = .8) +
scale_y_log10()

marks3 %>%
select(expression, median, mem_alloc)
```

## Separate

The `dt_separate()` function is still under heavy development. Its behavior is similar to `tidyr::separate()` but is lacking some functionality currently. For example, `into` needs to be supplied the maximum number of possible columns to separate.

```{r, eval = FALSE}
dt_separate(data.table(col = "A.B.C"), col, into = c("A", "B"))
#> Error in `[.data.table`(dt, , eval(split_it)) :
#> Supplied 2 columns to be assigned 3 items. Please see NEWS for v1.12.2.
```

For current functionality, consider the following example.

```{r}
dt_to_split <- data.table(
x = paste(letters, LETTERS, sep = ".")
)

dt_separate(dt_to_split, x, into = c("lower", "upper"))
```

```{r, echo = FALSE}
head(dt_separate(dt_to_split, x, into = c("lower", "upper")))
```

Testing with a 4 MB data set with one variable that has columns of "A.B" repeatedly, shows that `dt_separate()` is fast and far more memory efficient compared to `tidyr::separate()`.

```{r, echo = FALSE, warning = FALSE}
dt4 <- data.table(
col = paste(rep("A", 5e5), rep("B", 5e5), sep = ".")
)
df4 <- data.frame(dt4)

marks4 <-
bench::mark(
tidyr::separate(df4, col, into = c("first", "second"), sep = "\\."),
tidyfast::dt_separate(dt4, col, into = c("first", "second"), sep = ".", remove = FALSE),
tidyfast::dt_separate(dt4, col, into = c("first", "second"), sep = ".", immutable = FALSE, remove = FALSE),
check = FALSE,
iterations = 25
)
```

```{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}
marks4$time %>%
setNames(c("separate", "dt_separate", "dt_separate-mutable")) %>%
data.frame() %>%
tidyr::gather() %>%
mutate(value = as.numeric(value)) %>%
ggplot(aes(key, value, color = key)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(end = .8) +
scale_y_log10()

marks4 %>%
mutate(expression = c("separate", "dt_separate", "dt_separate-mutable")) %>%
select(expression, median, mem_alloc)
```

## Count and Uncount

The `dt_count()` function does essentially what `dplyr::count()` does. Notably, this, unlike the majority of other `dt_` functions, wraps a very simple statement in `data.table`. That is, `data.table` makes getting counts very simple and concise. Nonetheless, `dt_count()` fits the general API of `tidyfast`. To some degree, `dt_uncount()` is also a fairly simple wrapper, although the approach may not be as straightforward as that for `dt_count()`.

The following examples show how count and uncount can work. We'll use the `dt` data table from the nesting examples.

```{r}
counted <- dt_count(dt, grp)
counted
```

```{r}
uncounted <- dt_uncount(counted, N)
uncounted[]
```

These are also quick (not that the `tidyverse` functions were at all slow here).

```{r}
dt5 <- copy(dt)
df5 <- data.frame(dt5)

marks5 <-
bench::mark(
counted_tbl <- dplyr::count(df5, grp),
counted_dt <- tidyfast::dt_count(dt5, grp),
tidyr::uncount(counted_tbl, n),
tidyfast::dt_uncount(counted_dt, N),
check = FALSE,
iterations = 25
)
```

```{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}
marks5$time %>%
setNames(c("count", "dt_count", "uncount", "dt_uncount")) %>%
data.frame() %>%
tidyr::gather() %>%
mutate(value = as.numeric(value)) %>%
mutate(type = dt_case_when(stringr::str_detect(key, "uncount") ~ "Uncounting",
TRUE ~ "Counting")) %>%
ggplot(aes(key, value, color = key)) +
ggbeeswarm::geom_beeswarm() +
labs(x = "",
y = "Time (seconds)") +
scale_color_viridis_d(option = "plasma", end = .8) +
facet_grid(~type, space = "free", scales = "free") +
scale_y_log10()
```

## Notes

Please note that the `tidyfast` project is released with a [Contributor Code of Conduct](https://github.com/TysonStanley/tidyfast/blob/master/.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.

We want to thank our wonderful contributors:

- [markfairbanks](https://github.com/markfairbanks) for PR #6 providing initial the pivoting functions. Note the [`tidytable`](https://github.com/markfairbanks/tidytable) package that compliments some of `tidyfast`s functionality.

**Complementary Packages:**

- [`dtplyr`](https://dtplyr.tidyverse.org)
- [`tidytable`](https://github.com/markfairbanks/tidytable)
- [`maditr`](https://github.com/gdemin/maditr)