https://github.com/moodymudskipper/powerjoin

Extensions of 'dplyr' and 'fuzzyjoin' Join Functions
https://github.com/moodymudskipper/powerjoin
Last synced: 3 months ago
JSON representation
Extensions of 'dplyr' and 'fuzzyjoin' Join Functions
Host: GitHub
URL: https://github.com/moodymudskipper/powerjoin
Owner: moodymudskipper
License: other
Created: 2021-10-20T12:13:00.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2024-12-06T08:10:05.000Z (7 months ago)
Last Synced: 2025-03-28T16:10:01.913Z (3 months ago)
Language: R
Homepage:
Size: 2.07 MB
Stars: 104
Watchers: 3
Forks: 1
Open Issues: 13
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project

jimsghstars - moodymudskipper/powerjoin - Extensions of 'dplyr' and 'fuzzyjoin' Join Functions (R)
README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%",

  tidy.opts = list(blank = FALSE)

)

options(tidyverse.quiet = TRUE)

```

# powerjoin 

{powerjoin} extends {dplyr}'s join functions.

* Make your joins safer with the `check` argument and the `check_specs()`function

* Deal with conflicting column names by combining, coalescing them etc using the `conflict` argument

* Preprocess input, for instance to select columns to join without having to repeat

key columns in the selection

* Do painless fuzzy joins thanks to a generalized `by` argument accepting formulas

* Fill unmatched values using the `fill` argument

* Operate recursive joins by providing lists of data frames to `x` and `y`

* Keep or drop key columns with more flexibility thanks to an enhanced `keep`argument

## Installation

Install CRAN version with:

``` r

install.packages("powerjoin")

```

Or development version with:

``` r

remotes::install_github("moodymudskipper/powerjoin")

```

## Now let's match penguins

```{r}

library(powerjoin)

library(tidyverse)

# toy dataset built from Allison Horst's {palmerpenguins} package and 

# Hadley Wickham's {babynames}

male_penguins <- tribble(

     ~name,    ~species,     ~island, ~flipper_length_mm, ~body_mass_g,

 "Giordan",    "Gentoo",    "Biscoe",               222L,        5250L,

  "Lynden",    "Adelie", "Torgersen",               190L,        3900L,

  "Reiner",    "Adelie",     "Dream",               185L,        3650L

)

female_penguins <- tribble(

     ~name,    ~species,  ~island, ~flipper_length_mm, ~body_mass_g,

  "Alonda",    "Gentoo", "Biscoe",               211,        4500L,

     "Ola",    "Adelie",  "Dream",               190,        3600L,

"Mishayla",    "Gentoo", "Biscoe",               215,        4750L,

)

```

## Safer joins

The `check` argument receives an object created by the `check_specs()` function,

which provides ways to handle specific input properties, its arguments

can be :

* `"ignore"` : stay silent (default except for `implicit_keys`)

* `"inform"`

* `"warn"`

* `"abort"`

We can print these defaults :

```{r}

check_specs()

```

By default it works like {dplyr}, informing in case of implicit keys, and no

further checks :

```{r, error = TRUE}

power_inner_join(

  male_penguins[c("species", "island")],

  female_penguins[c("species", "island")]

)

```

We can silence the implicit key detection and check that we have unique keys in

the right table

```{r}

check_specs(implicit_keys = "ignore", duplicate_keys_right = "abort")

```

```{r, error = TRUE}

power_inner_join(

  male_penguins[c("species", "island")],

  female_penguins[c("species", "island")],

  check = check_specs(implicit_keys = "ignore", duplicate_keys_right = "abort")

)

```

The `column_conflict` argument guarantees that you won't have columns renamed without you

knowing, you might need it most of the time, we could setup some development and

production specs for our most common joins:

```{r}

dev_specs <- check_specs(

  column_conflict = "abort",

  inconsistent_factor_levels = "inform",

  inconsistent_type = "inform"

)

prod_specs <- check_specs(

  column_conflict = "abort",

  implicit_keys = "abort"

)

```

This will save some typing :

```{r, error = TRUE, eval = FALSE}

power_inner_join(

  male_penguins,

  female_penguins,

  by = c("species", "island"),

  check = dev_specs

)

#> Error: The following columns are conflicted and their conflicts are not handled: 

#> 'name', 'flipper_length_mm', 'body_mass_g'

```

## Handle column conflict

We saw above how to fail when encountering column conflict, here we show how to

handle it.

To resolve conflicts between identically named join columns, set the `conflict`

argument to a 2 argument function (or formula) that will take as arguments the 2 conflicting 

joined columns after the join.

```{r}

df1 <- tibble(id = 1:3, value = c(10, NA, 30))

df2 <- tibble(id = 2:4, value = c(22, 32, 42))

power_left_join(df1, df2, by = "id", conflict = `+`)

```

 

Coalescing is the most common use case and we provide the functions `coalesce_xy()`

and `coalesce_yx()` to ease this task (both wrapped around `dplyr::coalesce()`).

```{r}

power_left_join(df1, df2, by = "id", conflict = coalesce_xy)

power_left_join(df1, df2, by = "id", conflict = coalesce_yx)

```

Note that the function is operating on vectors by default, not rowwise, however

we can make it work rowwise by using `rw` in the lhs of the formula.

```{r}

power_left_join(df1, df2, by = "id", conflict = ~ sum(.x, .y, na.rm = TRUE))

power_left_join(df1, df2, by = "id", conflict = rw ~ sum(.x, .y, na.rm = TRUE))

```

If you need finer control, `conflict` can also be a named list of such functions,

formulas or special values, each to be applied on the relevant pair of conflicted

columns.

## Preprocess inputs

Traditionally key columns need to be repeated when preprocessing inputs 

before a join, which is an annoyance and an opportunity for mistakes.

With {powerjoin} we can do :

```{r}

power_inner_join(

  male_penguins %>% select_keys_and(name),

  female_penguins %>% select_keys_and(female_name = name),

  by = c("species", "island")

)

```

For semi joins, just omit arguments to `select_keys_and()`: 

```{r}

power_inner_join(

  male_penguins,

  female_penguins %>% select_keys_and(),

  by = c("species", "island")

)

```

We could also aggregate on keys before the join, without the need for any

`group_by()`/`ungroup()` gymnastics :

```{r}

power_left_join(

  male_penguins %>% summarize_by_keys(male_weight = mean(body_mass_g)),

  female_penguins %>% summarize_by_keys(female_weight = mean(body_mass_g)),

  by = c("species", "island")

)

```

`pack_along_keys()` packs given columns, or all non key columns by default, into

a data frame column named by the `name` argument, it's useful to namespace the

data and avoid conflicts

```{r}

power_left_join(

  male_penguins %>% pack_along_keys(name = "m"),

  female_penguins %>% pack_along_keys(name = "f"),

  by = c("species", "island")

)

```

We have more of these, all variants of tidyverse functions :

* `nest_by_keys()` nests given columns, or all by default, if `name` is given

a single list column of data frames is created

* `complete_keys()` expands the key columns, so all combinations are present,

filling the rest of the new rows with `NA`s. Absent factor levels are expanded

as well.

These functions do not modify the data but add an attribute that will be processed

by the join function later on, so no function should be used on top of them.

## Fuzzy joins

To do fuzzy joins we use formulas in the `by` argument, in this formula we use,

`.x` and `.y` to describe the left and right tables. This is very flexible

but can be costly since a cartesian product is computed.

```{r}

power_inner_join(

    male_penguins %>% select_keys_and(male_name = name),

    female_penguins %>% select_keys_and(female_name = name),

    by = c(~.x$flipper_length_mm < .y$flipper_length_mm, ~.x$body_mass_g > .y$body_mass_g)

)

```

We might also mix fuzzy joins with regular joins :

```{r}

power_inner_join(

    male_penguins %>% select_keys_and(male_name = name),

    female_penguins %>% select_keys_and(female_name = name),

    by = c("island", ~.x$flipper_length_mm > .y$flipper_length_mm)

)

```

Finally we might want to create a column with a value used in the comparison,

in that case we will use `<-` in the formula (several times if needed)`:

```{r}

power_inner_join(

    male_penguins %>% select_keys_and(male_name = name),

    female_penguins %>% select_keys_and(female_name = name),

    by = ~ (mass_ratio <- .y$body_mass_g / .x$body_mass_g) > 1.2

)

```

## Fill unmatched values

The `fill` argument is used to specify what to fill unmatched values with,

note that missing values resulting from matches are not replaced.

```{r}

df1 <- tibble(id = 1:3)

df2 <- tibble(id = 1:2, value2 = c(2, NA), value3 = c(NA, 3))

power_left_join(df1, df2, by = "id", fill = 0)

power_left_join(df1, df2, by = "id", fill = list(value2 = 0))

```

## Join recursively

The `x` and `y` arguments accept lists of data frames so one can do :

```{r}

df1 <- tibble(id = 1, a = "foo")

df2 <- tibble(id = 1, b = "bar")

df3 <- tibble(id = 1, c = "baz")

power_left_join(list(df1, df2, df3), by = "id")

power_left_join(df1, list(df2, df3), by = "id")

```

## Enhanced `keep` argument

By default, as in *{dplyr}*, key columns are merged and given names from the

left table. In case of a fuzzy join columns that participate in a fuzzy join are

kept from both sides.

We provide additional values `"left"`, `"right"`, `"both"` and `"none"` to choose

which keys to keep or drop.

## Notes

This package supersedes the {safejoin} package which had an unfortunate homonym on CRAN and

had a suboptimal interface and implementation.

Hadley Wickham, Romain François and David Robinson are credited for their work 

in {dplyr} and {fuzzyjoin} since this package contains some code copied from these packages.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/moodymudskipper/powerjoin

Awesome Lists containing this project

README