An open API service indexing awesome lists of open source software.

https://github.com/moodymudskipper/powerjoin

Extensions of 'dplyr' and 'fuzzyjoin' Join Functions
https://github.com/moodymudskipper/powerjoin

Last synced: 11 days ago
JSON representation

Extensions of 'dplyr' and 'fuzzyjoin' Join Functions

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
tidy.opts = list(blank = FALSE)
)
options(tidyverse.quiet = TRUE)
```

# powerjoin

{powerjoin} extends {dplyr}'s join functions.

* Make your joins safer with the `check` argument and the `check_specs()`function
* Deal with conflicting column names by combining, coalescing them etc using the `conflict` argument
* Preprocess input, for instance to select columns to join without having to repeat
key columns in the selection
* Do painless fuzzy joins thanks to a generalized `by` argument accepting formulas
* Fill unmatched values using the `fill` argument
* Operate recursive joins by providing lists of data frames to `x` and `y`
* Keep or drop key columns with more flexibility thanks to an enhanced `keep`argument

## Installation

Install CRAN version with:
``` r
install.packages("powerjoin")
```

Or development version with:

``` r
remotes::install_github("moodymudskipper/powerjoin")
```

## Now let's match penguins

```{r}
library(powerjoin)
library(tidyverse)

# toy dataset built from Allison Horst's {palmerpenguins} package and
# Hadley Wickham's {babynames}

male_penguins <- tribble(
~name, ~species, ~island, ~flipper_length_mm, ~body_mass_g,
"Giordan", "Gentoo", "Biscoe", 222L, 5250L,
"Lynden", "Adelie", "Torgersen", 190L, 3900L,
"Reiner", "Adelie", "Dream", 185L, 3650L
)

female_penguins <- tribble(
~name, ~species, ~island, ~flipper_length_mm, ~body_mass_g,
"Alonda", "Gentoo", "Biscoe", 211, 4500L,
"Ola", "Adelie", "Dream", 190, 3600L,
"Mishayla", "Gentoo", "Biscoe", 215, 4750L,
)
```

## Safer joins

The `check` argument receives an object created by the `check_specs()` function,
which provides ways to handle specific input properties, its arguments
can be :

* `"ignore"` : stay silent (default except for `implicit_keys`)
* `"inform"`
* `"warn"`
* `"abort"`

We can print these defaults :

```{r}
check_specs()
```

By default it works like {dplyr}, informing in case of implicit keys, and no
further checks :

```{r, error = TRUE}
power_inner_join(
male_penguins[c("species", "island")],
female_penguins[c("species", "island")]
)
```

We can silence the implicit key detection and check that we have unique keys in
the right table

```{r}
check_specs(implicit_keys = "ignore", duplicate_keys_right = "abort")
```

```{r, error = TRUE}
power_inner_join(
male_penguins[c("species", "island")],
female_penguins[c("species", "island")],
check = check_specs(implicit_keys = "ignore", duplicate_keys_right = "abort")
)
```

The `column_conflict` argument guarantees that you won't have columns renamed without you
knowing, you might need it most of the time, we could setup some development and
production specs for our most common joins:

```{r}
dev_specs <- check_specs(
column_conflict = "abort",
inconsistent_factor_levels = "inform",
inconsistent_type = "inform"
)

prod_specs <- check_specs(
column_conflict = "abort",
implicit_keys = "abort"
)
```

This will save some typing :

```{r, error = TRUE, eval = FALSE}
power_inner_join(
male_penguins,
female_penguins,
by = c("species", "island"),
check = dev_specs
)
#> Error: The following columns are conflicted and their conflicts are not handled:
#> 'name', 'flipper_length_mm', 'body_mass_g'
```

## Handle column conflict

We saw above how to fail when encountering column conflict, here we show how to
handle it.

To resolve conflicts between identically named join columns, set the `conflict`
argument to a 2 argument function (or formula) that will take as arguments the 2 conflicting
joined columns after the join.

```{r}
df1 <- tibble(id = 1:3, value = c(10, NA, 30))
df2 <- tibble(id = 2:4, value = c(22, 32, 42))

power_left_join(df1, df2, by = "id", conflict = `+`)
```

Coalescing is the most common use case and we provide the functions `coalesce_xy()`
and `coalesce_yx()` to ease this task (both wrapped around `dplyr::coalesce()`).

```{r}
power_left_join(df1, df2, by = "id", conflict = coalesce_xy)

power_left_join(df1, df2, by = "id", conflict = coalesce_yx)
```

Note that the function is operating on vectors by default, not rowwise, however
we can make it work rowwise by using `rw` in the lhs of the formula.

```{r}
power_left_join(df1, df2, by = "id", conflict = ~ sum(.x, .y, na.rm = TRUE))

power_left_join(df1, df2, by = "id", conflict = rw ~ sum(.x, .y, na.rm = TRUE))
```

If you need finer control, `conflict` can also be a named list of such functions,
formulas or special values, each to be applied on the relevant pair of conflicted
columns.

## Preprocess inputs

Traditionally key columns need to be repeated when preprocessing inputs
before a join, which is an annoyance and an opportunity for mistakes.
With {powerjoin} we can do :

```{r}
power_inner_join(
male_penguins %>% select_keys_and(name),
female_penguins %>% select_keys_and(female_name = name),
by = c("species", "island")
)
```

For semi joins, just omit arguments to `select_keys_and()`:

```{r}
power_inner_join(
male_penguins,
female_penguins %>% select_keys_and(),
by = c("species", "island")
)
```

We could also aggregate on keys before the join, without the need for any
`group_by()`/`ungroup()` gymnastics :

```{r}
power_left_join(
male_penguins %>% summarize_by_keys(male_weight = mean(body_mass_g)),
female_penguins %>% summarize_by_keys(female_weight = mean(body_mass_g)),
by = c("species", "island")
)
```

`pack_along_keys()` packs given columns, or all non key columns by default, into
a data frame column named by the `name` argument, it's useful to namespace the
data and avoid conflicts

```{r}
power_left_join(
male_penguins %>% pack_along_keys(name = "m"),
female_penguins %>% pack_along_keys(name = "f"),
by = c("species", "island")
)
```

We have more of these, all variants of tidyverse functions :

* `nest_by_keys()` nests given columns, or all by default, if `name` is given
a single list column of data frames is created
* `complete_keys()` expands the key columns, so all combinations are present,
filling the rest of the new rows with `NA`s. Absent factor levels are expanded
as well.

These functions do not modify the data but add an attribute that will be processed
by the join function later on, so no function should be used on top of them.

## Fuzzy joins

To do fuzzy joins we use formulas in the `by` argument, in this formula we use,
`.x` and `.y` to describe the left and right tables. This is very flexible
but can be costly since a cartesian product is computed.

```{r}
power_inner_join(
male_penguins %>% select_keys_and(male_name = name),
female_penguins %>% select_keys_and(female_name = name),
by = c(~.x$flipper_length_mm < .y$flipper_length_mm, ~.x$body_mass_g > .y$body_mass_g)
)
```

We might also mix fuzzy joins with regular joins :

```{r}
power_inner_join(
male_penguins %>% select_keys_and(male_name = name),
female_penguins %>% select_keys_and(female_name = name),
by = c("island", ~.x$flipper_length_mm > .y$flipper_length_mm)
)
```

Finally we might want to create a column with a value used in the comparison,
in that case we will use `<-` in the formula (several times if needed)`:

```{r}
power_inner_join(
male_penguins %>% select_keys_and(male_name = name),
female_penguins %>% select_keys_and(female_name = name),
by = ~ (mass_ratio <- .y$body_mass_g / .x$body_mass_g) > 1.2
)
```

## Fill unmatched values

The `fill` argument is used to specify what to fill unmatched values with,
note that missing values resulting from matches are not replaced.

```{r}
df1 <- tibble(id = 1:3)
df2 <- tibble(id = 1:2, value2 = c(2, NA), value3 = c(NA, 3))

power_left_join(df1, df2, by = "id", fill = 0)

power_left_join(df1, df2, by = "id", fill = list(value2 = 0))
```

## Join recursively

The `x` and `y` arguments accept lists of data frames so one can do :

```{r}
df1 <- tibble(id = 1, a = "foo")
df2 <- tibble(id = 1, b = "bar")
df3 <- tibble(id = 1, c = "baz")

power_left_join(list(df1, df2, df3), by = "id")

power_left_join(df1, list(df2, df3), by = "id")
```

## Enhanced `keep` argument

By default, as in *{dplyr}*, key columns are merged and given names from the
left table. In case of a fuzzy join columns that participate in a fuzzy join are
kept from both sides.

We provide additional values `"left"`, `"right"`, `"both"` and `"none"` to choose
which keys to keep or drop.

## Notes

This package supersedes the {safejoin} package which had an unfortunate homonym on CRAN and
had a suboptimal interface and implementation.

Hadley Wickham, Romain François and David Robinson are credited for their work
in {dplyr} and {fuzzyjoin} since this package contains some code copied from these packages.