Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/TimTeaFan/dplyover

Create columns by applying functions to vectors and/or columns in 'dplyr'.
https://github.com/TimTeaFan/dplyover

dplyr r

Last synced: about 1 month ago
JSON representation

Create columns by applying functions to vectors and/or columns in 'dplyr'.

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, child = "man/rmd/setup.Rmd"}
```

# dplyover

![Release status](https://img.shields.io/badge/status-first%20release-yellow)
[![Lifecycle](man/figures/lifecycle-experimental.svg)](man/figures/lifecycle-experimental.svg)
[![R-CMD-check](https://github.com/TimTeaFan/dplyover/workflows/R-CMD-check/badge.svg)](https://github.com/TimTeaFan/dplyover/actions)
[![Codecov test coverage](https://codecov.io/gh/TimTeaFan/dplyover/branch/main/graph/badge.svg)](https://codecov.io/gh/TimTeaFan/dplyover?branch=main)
[![CodeFactor](https://www.codefactor.io/repository/github/timteafan/dplyover/badge)](https://www.codefactor.io/repository/github/timteafan/dplyover)
[![CRAN status](https://www.r-pkg.org/badges/version/dplyover)](https://cran.r-project.org/package=dplyover)

## Overview

dplyover logo

{dplyover} extends {dplyr}'s functionality by building a function family
around `dplyr::across()`.

The goal of this *over-across function family* is to provide a concise and
uniform syntax which can be used to create columns by applying functions to
vectors and/or sets of columns in {dplyr}. Ideally, this will:

- **reduce the amount of code** to create variables derived from existing colums,
which is especially helpful when doing exploratory data analysis (e.g. lagging,
collapsing, recoding etc. many variables in a similar way).
- **provide a clean {dplyr} approach** to create many variables which are
calculated based on two or more variables.
- **improve our mental model** so that it is easier to tackle problems where the
solution is based on creating new columns.

The functions in the *over-apply function family* create columns by applying
one or several functions to:

- `dplyr::across()` a set of columns (not part of dplyover)
- `over()` a vector (list or atomic vector)
- `over2()` two vectors of the same length (sequentially^#^)
- `over2x()` two vectors (nested^+^)
- `across2()` two sets of columns (sequentially^#^)
- `across2x()` two sets of columns (nested^+^)
- `crossover()` a set of columns and a vector (nested^+^)

# "sequentially" means that the function is sequentially applied to the
first two elements of `x[[1]]` and `y[[1]]`, then to the second pair of elements
and so on.


+ "nested" means that the function is applied to all combinations
between elements in `x` and `y` similar to a nested loop.

## Installation

{dplyover} is not on CRAN. You can install the latest version from
[GitHub](https://github.com/) with:

```{r, eval = FALSE}
# install.packages("remotes")
remotes::install_github("TimTeaFan/dplyover")
```

## Getting started

Below are a few examples of the {dplyover}'s *over-across function family*. More
functions and workarounds of how to tackle the problems below without {dplyover}
can be found in the vignette "Why dplyover?".

```{r, setup, warning = FALSE, message = FALSE}
# dplyover is an extention of dplyr on won't work without it
library(dplyr)
library(dplyover)

# For better printing:
iris <- as_tibble(iris)
```

#### Apply functions to a vector

`over()` applies one or several functions to a vector. We can use it inside
`dplyr::mutate()` to create several similar variables that we derive from an
existing column. This is helpful in cases where we want to create a batch of
similar variables with only slightly changes in the argument values of the
calling function. A good example are `lag` and `lead` variables. Below we use
column 'a' to create lag and lead variables by `1`, `2` and `3` positions.
`over()`'s `.names` argument lets us put nice names on the output columns.

```{r}
tibble(a = 1:25) %>%
mutate(over(c(1:3),
list(lag = ~ lag(a, .x),
lead = ~ lead(a, .x)),
.names = "a_{fn}{x}"))
```

#### Apply functions to a set of columns and a vector simultaniously

`crossover()` applies the functions in `.fns` to every combination of colums in
`.xcols` with elements in `.y`. This is similar to the example above, but this time,
we use a set of columns. Below we create five lagged variables for each
'Sepal.Length' and 'Sepal.Width'. Again, we use a named list as argument in `.fns`
to create nice names by specifying the glue syntax in `.names.`

```{r}
iris %>%
transmute(
crossover(starts_with("sepal"),
1:5,
list(lag = ~ lag(.x, .y)),
.names = "{xcol}_{fn}{y}")) %>%
glimpse
```

#### Apply functions to a set of variable pairs

`across2()` can be used to transform pairs of variables in one or more functions.
In the example below we want to calculate the product and the sum of all pairs
of 'Length' and 'Width' variables in the `iris` data set. We can use `{pre}` in
the glue specification in `.names` to extract the common prefix of each pair of
variables. We can further transform the names, in the example setting them
`tolower`, by specifying the `.names_fn` argument:

```{r}
iris %>%
transmute(across2(ends_with("Length"),
ends_with("Width"),
.fns = list(product = ~ .x * .y,
sum = ~ .x + .y),
.names = "{pre}_{fn}",
.names_fn = tolower))
```

## Performance and Compability

This is an experimental package which I started developing with my own use cases
in mind. I tried to keep the effort low, which is why this package *does not*
internalize (read: copy) internal {dplyr} functions (especially the 'context
internals'). This made it relatively easy to develop the package without:

1. copying tons of {dplyr} code,
1. having to figure out which dplyr-functions use the copied internals and
1. finally overwritting these functions (like `mutate` and other one-table verbs),
which would eventually lead to conflicts with other add-on packages, like for
example {tidylog}.

However, the downside is that not relying on {dplyr} internals has some negative
effects in terms of performance and compability.

In a nutshell this means:

- The *over-across function family* in {dplyover} is slower than the
original `dplyr::across`. Up until {dplyr} 1.0.3 the overhead was not too big,
but `dplyr::across` got much faster with {dplyr} 1.0.4 which is why the gap has
widend a lot.
- Although {dplyover} is designed to work in {dplyr}, some features and
edge cases will not work correctly.

The good news is that even without relying on {dplyr} internals most of the
original functionality can be replicated and although being less performant,
the current setup is optimized and falls not too far behind in terms of speed -
at least when compared to the pre v1.0.4 `dplyr::across`.

Regarding compability, I have spent quite some time testing the package and
I was able to replicate most of the tests for `dplyr::across` successfully.

For more information on the performance and compability of {dplyover} see the
vignette "Performance and Compability".

## History

I originally opened a
[feature request on GitHub](https://github.com/tidyverse/dplyr/issues/4834) to
include a very special case version of `over` (or to that time `mutate_over`)
into {dplyr}. The adivse then was to make this kind of functionality available
in a separate package. While I was working on this very special case version of
`over`, I realized that the more general use case resembles a `purrr::map`
function for inside {dplyr} verbs with different variants, which led me to the
*over-across function family*.

## Acknowledgements and Disclaimer

This package is not only an extention of {dplyr}. The main functions in
{dplyover} are directly derived and based on `dplyr::across()` (dplyr's license
and copyrights apply!). So if this package is working correctly, all the credit
should go to the dplyr team.

My own "contribution" (if you want to call it like that) merely consists of:

1. removing the dependencies on {dplyr}'s internal functions, and
2. slightly changing `across`' logic to make it work for vectors and a
combination of two vectors and/or sets of columns.

By this I most definitely introduced some bugs and edge cases which won't work,
and in which case I am the only one to blame.