https://github.com/TysonStanley/tidyfast

Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats
https://github.com/TysonStanley/tidyfast
Last synced: 4 months ago
JSON representation
Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats
Host: GitHub
URL: https://github.com/TysonStanley/tidyfast
Owner: TysonStanley
Created: 2019-10-24T06:46:24.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-04-08T20:49:17.000Z (over 1 year ago)
Last Synced: 2024-10-13T14:14:17.795Z (9 months ago)
Language: C++
Homepage: https://tysonbarrett.com/tidyfast/
Size: 16 MB
Stars: 187
Watchers: 5
Forks: 4
Open Issues: 4
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Code of conduct: .github/CODE_OF_CONDUCT.md
Awesome Lists containing this project

jimsghstars - TysonStanley/tidyfast - Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats (C++)
README

        ---

output: github_document

editor_options: 

  chunk_output_type: console

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "70%"

)

set.seed(843)

```

# `tidyfast v0.4.0` 

[![CRAN status](https://www.r-pkg.org/badges/version/tidyfast)](https://CRAN.R-project.org/package=tidyfast)

[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html#maturing)

[![Codecov test coverage](https://codecov.io/gh/TysonStanley/tidyfast/branch/master/graph/badge.svg)](https://app.codecov.io/gh/TysonStanley/tidyfast?branch=master)

![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/tidyfast)

[![R-CMD-check](https://github.com/TysonStanley/tidyfast/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/TysonStanley/tidyfast/actions/workflows/R-CMD-check.yaml)

**Note: The expansion of `dtplyr` has made some of the functionality in `tidyfast` redundant. See `dtplyr` for a list of functions that are handled within that framework.**

The goal of `tidyfast` is to provide fast and efficient alternatives to some `tidyr` (and a few `dplyr`) functions using `data.table` under the hood. Each have the prefix of `dt_` to allow for autocomplete in IDEs such as RStudio. These should compliment some of the current functionality in `dtplyr` (but notably does not use the `lazy_dt()` framework of `dtplyr`). This package imports `data.table` and `cpp11` (no other dependencies).

These are, in essence, translations from a more `tidyverse` grammar to `data.table`. Most functions herein are in places where, in my opinion, the `data.table` syntax is not obvious or clear. As such, these functions can translate a simple function call into the fast, efficient, and concise syntax of `data.table`.

The current functions include:

**Nesting and unnesting** (similar to `dplyr::group_nest()` and `tidyr::unnest()`):

- `dt_nest()` for nesting data tables

- `dt_unnest()` for unnesting data tables

- `dt_hoist()` for unnesting vectors in a list-column in a data table

**Pivoting** (similar to `tidyr::pivot_longer()` and `tidyr::pivot_wider()`)

- `dt_pivot_longer()` for fast pivoting using `data.table::melt()`

- `dt_pivot_wider()` for fast pivoting using `data.table::dcast()`

**If Else** (similar to `dplyr::case_when()`):

- `dt_case_when()` for `dplyr::case_when()` syntax with the speed of `data.table::fifelse()`

**Fill** (similar to `tidyr::fill()`)

- `dt_fill()` for filling `NA` values with values before it, after it, or both. This can be done by a grouping variable (e.g. fill in `NA` values with values within an individual).

**Count** and **Uncount** (similar to `tidyr::uncount()` and `dplyr::count()`)

- `dt_count()` for fast counting by group(s)

- `dt_uncount()` for creating full data from a count table

**Separate** (similar to `tidyr::separate()`)

- `dt_separate()` for splitting a single column into multiple based on a match within the column (e.g., column with values like "A.B" could be split into two columns by using the period as the separator where column 1 would have "A" and 2 would have "B"). It is built on `data.table::tstrsplit()`. This is not well tested yet and lacks some functionality of `tidyr::separate()`.

**Adjust `data.table` print options**

- `dt_print_options()` for adjusting the options for `print.data.table()`

## General API

`tidyfast` attempts to convert syntax from `tidyr` with its accompanying grammar to `data.table` function calls. As such, we have tried to maintain the `tidyr` syntax as closely as possible without hurting speed and efficiency. Some more advanced use cases in `tidyr` may not translate yet. We try to be transparent about the shortcomings in syntax and behavior where known.

Each function that takes data (labeled as `dt_` in the package docs) as its first argument automatically coerces it to a data table with `as.data.table()` if it isn't already a data table. Each of these functions will return a data table.

## Installation

You can install the stable version from CRAN with:

``` r

install.packages("tidyfast")

```

or you can install the development version from [GitHub](https://github.com/) with:

``` r

# install.packages("remotes")

remotes::install_github("TysonStanley/tidyfast")

```

```{r, echo=FALSE}

devtools::load_all(here::here())

```

## Examples

The initial versions of the nesting and unnesting functions were shown in a [preprint](https://osf.io/preprints/psyarxiv/u8ekc/). Herein is shown some simple applications and the functions' speed/efficiency.

```{r, eval=FALSE}

library(tidyfast)

```

### Nesting and Unnesting

The following data table will be used for the nesting/unnesting examples.

```{r, message = FALSE, warning = FALSE}

set.seed(84322)

library(data.table)

library(dplyr)       # to compare with case_when()

library(tidyr)       # to compare with fill() and separate()

library(ggplot2)     # figures

library(ggbeeswarm)  # figures

dt <- data.table(

   x = rnorm(1e5),

   y = runif(1e5),

   grp = sample(1L:5L, 1e5, replace = TRUE),

   nested1 = lapply(1:10, sample, 10, replace = TRUE),

   nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE),

   id = 1:1e5)

```

To make all the comparisons herein more equal, we will set the number of threads that `data.table` will use to 1.

```{r}

setDTthreads(1)

```

We can nest this data using `dt_nest()`:

```{r}

nested <- dt_nest(dt, grp)

nested

```

We can also unnest this with `dt_unnest()`:

```{r}

dt_unnest(nested, col = data)

```

When our list columns don't have data tables (as output from `dt_nest()`) we can use the `dt_hoist()` function, that will unnest vectors. It keeps all the other variables that are not list-columns as well.

```{r}

dt_hoist(dt, nested1, nested2)

```

Speed comparisons (similar to those shown in the preprint) are highlighted below. Notably, the timings are without the `nested1` and `nested2` columns of the original `dt` object from above. Also, all `dplyr` and `tidyr` functions use a `tbl` version of the `dt` table.

```{r, echo = FALSE, fig.width=4, fig.height=8, dpi=300, warning=FALSE, message=FALSE}

tbl <- as_tibble(dt) %>% select(x, y, id, grp)

dt2 <- dt[, .(x,y,id,grp)]

nesting <- bench::mark(

  nested1 <- dt_nest(dt2, grp),

  group_nest(tbl, grp),

  check = FALSE,

  iterations = 50) %>% 

  mutate(expression = c("dt_nest", "group_nest"))

nested_tbl <- as_tibble(nested1)

unnesting <- bench::mark(

  dt_unnest(nested1, data),

  unnest(nested_tbl, data),

  check = FALSE,

  iterations = 50) %>% 

  mutate(expression = c("dt_unnest", "unnest"))

nest_unnest <- bind_rows(nesting, unnesting)

theme_set(

  theme_minimal() +

  theme(panel.grid.minor = element_blank(),

        panel.grid.major.y = element_line(linetype = "dashed"),

        panel.grid.major.x = element_blank(),

        legend.position = "none")

)

library(ggplot2)

as.data.table(nest_unnest$time) %>% 

  setNames(c("dt_nest", "group_nest", "dt_unnest", "unnest")) %>% 

  dt_pivot_longer(cols = dt_nest:unnest, names_to = "expression", values_to = "time") %>% 

  .[, type := dt_case_when(stringr::str_detect(expression, "unnest") ~ "Unnesting",

                           TRUE ~ "Nesting")] %>% 

  .[, time := as.numeric(time)] %>% 

  ggplot(aes(expression, time, color = expression)) +

  ggbeeswarm::geom_beeswarm(alpha = .6) +

  labs(x = "",

       y = "Time (seconds)") +

  facet_wrap(type~.,  scales = "free", ncol = 1) +

  scale_color_viridis_d(option = "plasma", end = .8) +

  scale_y_log10() +

  NULL

select(nesting, expression, median, mem_alloc)

select(unnesting, expression, median, mem_alloc)

```

## Pivoting

Thanks to [@markfairbanks](https://github.com/markfairbanks), we now have pivoting translations to `data.table::melt()` and `data.table::dcast()`. Consider the following example (similar to the example in `tidyr::pivot_longer()` and `tidyr::pivot_wider()`):

```{r}

billboard <- tidyr::billboard

# note the warning - melt is telling us what 

#   it did with the various data types---logical (where there were just NAs

#   and numeric

longer <- billboard %>%

  dt_pivot_longer(

     cols = c(-artist, -track, -date.entered),

     names_to = "week",

     values_to = "rank"

  )

longer

wider <- longer %>% 

  dt_pivot_wider(

    names_from = week,

    values_from = rank

  )

wider[, .(artist, track, wk1, wk2)]

```

Notably, there are some current limitations to these: 1) `tidyselect` techniques do not work across the board (e.g. cannot use `start_with()` and friends) and 2) the functions are new and likely prone to edge-case bugs.

But let's compare some basic speed and efficiency. Because of the `data.table` functions, these are extremely fast and efficient.

```{r first_pivot, echo = FALSE, warning=FALSE, message=FALSE}

bill_dt <- as.data.table(billboard)

longer_timings <- bench::mark(

  dt_pivot_longer = dt_pivot_longer(bill_dt, cols = c(-artist, -track, -date.entered), 

                                    names_to = "week", names_prefix = "wk", values_to = "rank"),

  pivot_longer = pivot_longer(billboard, cols = c(-artist, -track, -date.entered), 

                              names_to = "week", names_prefix = "wk", values_to = "rank"),

  check = FALSE,

  iterations = 40)

```

```{r second_pivot, echo = FALSE, fig.width=5, fig.height=4, dpi=300, warning=FALSE, message=FALSE}

longer_tbl <- as_tibble(longer)

wider_timings <- bench::mark(

  dt_pivot_wider = dt_pivot_wider(longer, names_from = week, values_from = rank),

  pivot_wider = pivot_wider(longer_tbl, names_from = week, values_from = rank),

  check = FALSE,

  iterations = 40)

```

```{r third_pivot, echo = FALSE, fig.width=5, fig.height=4, dpi=300, warning=FALSE, message=FALSE}

pivot_timings <- rbind(longer_timings, wider_timings) %>% 

  mutate(type = c("longer", "longer", "wider", "wider")) %>% 

  mutate(expression = as.character(expression))

pivot_timings %>%

  dt_hoist(time) %>% 

  mutate(time = lubridate::seconds(time)) %>% 

  filter(type == "longer") %>% 

  ggplot(aes(x = expression, 

             y = time,

             color = expression)) +

  ggbeeswarm::geom_beeswarm() +

  labs(x = "",

       y = "Time (seconds)") +

  scale_color_viridis_d(option = "plasma", end = .8) +

  scale_y_log10() +

  facet_grid(~type, space = "free", scales = "free")

pivot_timings %>%

  dt_hoist(time) %>% 

  mutate(time = lubridate::seconds(time)) %>% 

  filter(type == "wider") %>% 

  ggplot(aes(x = expression, 

             y = time,

             color = expression)) +

  ggbeeswarm::geom_beeswarm() +

  labs(x = "",

       y = "Time (seconds)") +

  scale_color_viridis_d(option = "plasma", end = .8) +

  scale_y_log10() +

  facet_grid(~type, space = "free", scales = "free")

pivot_timings %>% 

  select(expression, median, mem_alloc)

```

### If Else

Also, the new `dt_case_when()` function is built on the very fast `data.table::fiflese()` but has syntax like unto `dplyr::case_when()`. That is, it looks like:

```{r, eval = FALSE}

dt_case_when(condition1 ~ label1,

             condition2 ~ label2,

             ...)

```

To show that each method, `dt_case_when()`, `dplyr::case_when()`, and `data.table::fifelse()` produce the same result, consider the following example. 

```{r}

x <- rnorm(1e6)

medianx <- median(x)

x_cat <-

  dt_case_when(x < medianx ~ "low",

               x >= medianx ~ "high",

               is.na(x) ~ "unknown")

x_cat_dplyr <-

  case_when(x < medianx ~ "low",

            x >= medianx ~ "high",

            is.na(x) ~ "unknown")

x_cat_fif <-

  fifelse(x < medianx, "low",

  fifelse(x >= medianx, "high",

  fifelse(is.na(x), "unknown", NA_character_)))

identical(x_cat, x_cat_dplyr)

identical(x_cat, x_cat_fif)

```

Notably, `dt_case_when()` is very fast and memory efficient, given it is built on `data.table::fifelse()`.

```{r, echo = FALSE, warning = FALSE, message = FALSE, fig.width=6, fig.height=5, dpi=300}

marks <-

  bench::mark(dt_case_when(x < medianx ~ "low",

                           x >= medianx ~ "high",

                           is.na(x) ~ "unknown"),

              case_when(x < medianx ~ "low",

                        x >= medianx ~ "high",

                        is.na(x) ~ "unknown"),

              fifelse(x < medianx, "low",

              fifelse(x >= medianx, "high",

              fifelse(is.na(x), "unknown", NA_character_))),

              iterations = 50)

library(ggbeeswarm)  # for the speed comparison plot

marks$time %>% 

  setNames(c("dt_case_when", "case_when", "fifelse")) %>% 

  data.frame() %>% 

  tidyr::gather() %>% 

  mutate(value = as.numeric(value)) %>% 

  ggplot(aes(key, value, color = key)) +

  ggbeeswarm::geom_beeswarm() +

  labs(x = "",

       y = "Time (seconds)") +

    scale_color_viridis_d(option = "plasma", end = .8) +

    scale_y_log10()

marks %>% 

  select(expression, median, mem_alloc) %>% 

  mutate(expression = c("dt_case_when", "case_when", "fifelse")) %>% 

  arrange(expression)

```

## Fill

A new function is `dt_fill()`, which fulfills the role of `tidyr::fill()` to fill in `NA` values with values around it (either the value above, below, or trying both). This currently relies on the efficient `C++` code from `tidyr` (`fillUp()` and `fillDown()`).

```{r}

x = 1:10

dt_with_nas <- data.table(

  x = x,

  y = shift(x, 2L),

  z = shift(x, -2L),

  a = sample(c(rep(NA, 10), x), 10),

  id = sample(1:3, 10, replace = TRUE)

)

# Original

dt_with_nas

# All defaults

dt_fill(dt_with_nas, y, z, a, immutable = FALSE)

# by id variable called `grp`

dt_fill(dt_with_nas, 

        y, z, a, 

        id = list(id))

# both down and then up filling by group

dt_fill(dt_with_nas, 

        y, z, a, 

        id = list(id), 

        .direction = "downup")

```

In its current form, `dt_fill()` is faster than `tidyr::fill()` and uses slightly less memory. Below are the results of filling in the `NA`s within each `id` on a 19 MB data set.

```{r}

x = 1:1e6

dt3 <- data.table(

  x = x,

  y = shift(x, 10L),

  z = shift(x, -10L),

  a = sample(c(rep(NA, 10), x), 10),

  id = sample(1:3, 10, replace = TRUE))

df3 <- data.frame(dt3)

marks3 <-

  bench::mark(

    tidyr::fill(dplyr::group_by(df3, id), x, y),

    tidyfast::dt_fill(dt3, x, y, id = list(id)),

    check = FALSE,

    iterations = 50

  )

```

```{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}

marks3$time %>% 

  setNames(c("fill", "dt_fill")) %>% 

  data.frame() %>% 

  tidyr::gather() %>% 

  mutate(value = as.numeric(value)) %>% 

  ggplot(aes(key, value, color = key)) +

  ggbeeswarm::geom_beeswarm() +

  labs(x = "",

       y = "Time (seconds)") +

  scale_color_viridis_d(end = .8) +

    scale_y_log10()

marks3 %>% 

  select(expression, median, mem_alloc)

```

## Separate

The `dt_separate()` function is still under heavy development. Its behavior is similar to `tidyr::separate()` but is lacking some functionality currently. For example, `into` needs to be supplied the maximum number of possible columns to separate.

```{r, eval = FALSE}

dt_separate(data.table(col = "A.B.C"), col, into = c("A", "B"))

#> Error in `[.data.table`(dt, , eval(split_it)) : 

#>   Supplied 2 columns to be assigned 3 items. Please see NEWS for v1.12.2.

```

For current functionality, consider the following example. 

```{r}

dt_to_split <- data.table(

  x = paste(letters, LETTERS, sep = ".")

)

dt_separate(dt_to_split, x, into = c("lower", "upper"))

```

```{r, echo = FALSE}

head(dt_separate(dt_to_split, x, into = c("lower", "upper")))

```

Testing with a 4 MB data set with one variable that has columns of "A.B" repeatedly, shows that `dt_separate()` is fast and far more memory efficient compared to `tidyr::separate()`.

```{r, echo = FALSE, warning = FALSE}

dt4 <- data.table(

  col = paste(rep("A", 5e5), rep("B", 5e5), sep = ".")

)

df4 <- data.frame(dt4)

marks4 <-

  bench::mark(

    tidyr::separate(df4, col, into = c("first", "second"), sep = "\\."),

    tidyfast::dt_separate(dt4, col, into = c("first", "second"), sep = ".", remove = FALSE),

    tidyfast::dt_separate(dt4, col, into = c("first", "second"), sep = ".", immutable = FALSE, remove = FALSE),

    check = FALSE,

    iterations = 25

  )

```

```{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}

marks4$time %>% 

  setNames(c("separate", "dt_separate", "dt_separate-mutable")) %>% 

  data.frame() %>% 

  tidyr::gather() %>% 

  mutate(value = as.numeric(value)) %>% 

  ggplot(aes(key, value, color = key)) +

  ggbeeswarm::geom_beeswarm() +

  labs(x = "",

       y = "Time (seconds)") +

  scale_color_viridis_d(end = .8) +

  scale_y_log10()

marks4 %>% 

  mutate(expression = c("separate", "dt_separate", "dt_separate-mutable")) %>% 

  select(expression, median, mem_alloc)

```

## Count and Uncount

The `dt_count()` function does essentially what `dplyr::count()` does. Notably, this, unlike the majority of other `dt_` functions, wraps a very simple statement in `data.table`. That is, `data.table` makes getting counts very simple and concise. Nonetheless, `dt_count()` fits the general API of `tidyfast`. To some degree, `dt_uncount()` is also a fairly simple wrapper, although the approach may not be as straightforward as that for `dt_count()`.

The following examples show how count and uncount can work. We'll use the `dt` data table from the nesting examples.

```{r}

counted <- dt_count(dt, grp)

counted

```

```{r}

uncounted <- dt_uncount(counted, N)

uncounted[]

```

These are also quick (not that the `tidyverse` functions were at all slow here).

```{r}

dt5 <- copy(dt)

df5 <- data.frame(dt5)

marks5 <-

  bench::mark(

    counted_tbl <- dplyr::count(df5, grp),

    counted_dt <- tidyfast::dt_count(dt5, grp),

    tidyr::uncount(counted_tbl, n),

    tidyfast::dt_uncount(counted_dt, N),

    check = FALSE,

    iterations = 25

  )

```

```{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}

marks5$time %>% 

  setNames(c("count", "dt_count", "uncount", "dt_uncount")) %>% 

  data.frame() %>% 

  tidyr::gather() %>% 

  mutate(value = as.numeric(value)) %>% 

  mutate(type = dt_case_when(stringr::str_detect(key, "uncount") ~ "Uncounting",

                             TRUE ~ "Counting")) %>% 

  ggplot(aes(key, value, color = key)) +

  ggbeeswarm::geom_beeswarm() +

  labs(x = "",

       y = "Time (seconds)") +

  scale_color_viridis_d(option = "plasma", end = .8) +

  facet_grid(~type, space = "free", scales = "free") +

  scale_y_log10()

```

## Notes

Please note that the `tidyfast` project is released with a [Contributor Code of Conduct](https://github.com/TysonStanley/tidyfast/blob/master/.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.

We want to thank our wonderful contributors:

- [markfairbanks](https://github.com/markfairbanks) for PR #6 providing initial the pivoting functions. Note the [`tidytable`](https://github.com/markfairbanks/tidytable) package that compliments some of `tidyfast`s functionality.

**Complementary Packages:**

- [`dtplyr`](https://dtplyr.tidyverse.org)

- [`tidytable`](https://github.com/markfairbanks/tidytable)

- [`maditr`](https://github.com/gdemin/maditr)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/TysonStanley/tidyfast

Awesome Lists containing this project

README