https://github.com/mjskay/by_row

A proposal for an alternative to rowwise() and mutate() + map()
https://github.com/mjskay/by_row

Last synced: 4 months ago
JSON representation

A proposal for an alternative to rowwise() and mutate() + map()

Host: GitHub
URL: https://github.com/mjskay/by_row
Owner: mjskay
Created: 2021-03-03T22:15:07.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-03-04T00:11:07.000Z (over 4 years ago)
Last Synced: 2025-01-11T00:44:55.870Z (5 months ago)
Size: 2.93 KB
Stars: 9
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

jimsghstars - mjskay/by_row - A proposal for an alternative to rowwise() and mutate() + map() (Others)

README

        ---

title: "by_row(): An alternative to map() or rowwise()"

author: "Matthew Kay"

date: "3/3/2021"

output: github_document

---

This is a short proposal for a new alternative to `rowwise()` or `map()` for rowwise operations on data frames inside mutate.

```{r message=FALSE}

library(tidyverse)

library(rlang)

```

It's useful to be able to perform rowwise operations on data frames. As a trivial

example (obviously these shouldn't be list columns in practice, but ignore that for now), this fails:

```{r error=TRUE}

tibble(a = list(10, 11), b = list(3, 4)) %>%

  mutate(c = a + 2 * b)

```

One way to do rowwise operations is to use `lapply()` or `map()` inside `mutate()`, but

this requires you to pass through all the variables you want to use in an awkward way:

```{r}

tibble(a = list(10, 11), b = list(3, 4)) %>%

  mutate(c = map2_dbl(a, b, ~ .x + 2 * .y))

```

With `rowwise()` we can avoid managing all these variables:

```{r}

tibble(a = list(10, 11), b = list(3, 4)) %>%

  rowwise() %>%

  mutate(c = a + 2 * b)

```

But `rowwise()` has significant drawbacks. It changes the meaning of 

`mutate()` in a way that I have found confuses students---and, frankly, 

experts (if I can be called an expert). If you don't know the data frame 

is in a `rowwise()` state, you can write `mutate()` calls that will raise

errors that don't make sense. Or if you insert a `rowwise()` call into

a pipe for a `mutate()` and forget to ungroup the data frame afterwards,

it can break existing `mutate()` calls downstream. 

The root of the problem, I think, is that `rowwise()` forces people to keep track

of state in a way that humans aren't great at: we have to remember if a data frame

is in "rowwise" or "not rowwise" state for every `mutate()` call, and the

call that puts you in `rowwise()` mode can be arbitrarily far from the corresponding

mutate call. In human-computer interaction we

call errors due to thinking a system is in a different state a [mode error](https://en.wikipedia.org/wiki/Mode_(user_interface)#Mode_errors); it's

the reason why people hate the CapsLock key---we're not good at keeping track

of modes!

I think it would be better to explicitly "mark out" expressions intended to be

executed in a row-wise manner (similar to how `map()` works), but not have to

deal with managing variables (similar to how `rowwise()` works). The best

of both worlds!

My modest proposal is a function like this (which no doubt needs work to handle

some NSE corner cases I haven't thought of):

```{r}

by_row = function(expr) {

  .expr = enquo(expr)

  .data = cur_data_all()

  .env = parent.frame()

  simplify(lapply(seq_len(nrow(.data)), function(i) {

    eval_tidy(.expr, data = lapply(.data[i,], `[[`, 1), env = .env)

  }))

}

```

Which can be used like this:

```{r}

tibble(a = list(10, 11), b = list(3, 4)) %>%

  mutate(c = by_row(a + 2 * b))

```

The other bonus here is that this can be intermixed with non-rowwise calls in the

same `mutate()` call. This is important, because *rowwise operations are slow and you should

try to avoid them at all costs*! Putting the entire dataframe into a rowwise mode

to perform these operations encourages people to do other operations rowwise that

they don't need to do.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mjskay/by_row

Awesome Lists containing this project

README