{"id":14066997,"url":"https://github.com/moodymudskipper/powerjoin","last_synced_at":"2025-04-04T17:09:34.195Z","repository":{"id":46527167,"uuid":"419315258","full_name":"moodymudskipper/powerjoin","owner":"moodymudskipper","description":"Extensions of 'dplyr' and 'fuzzyjoin' Join Functions","archived":false,"fork":false,"pushed_at":"2024-12-06T08:10:05.000Z","size":2169,"stargazers_count":104,"open_issues_count":13,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-28T16:10:01.913Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/moodymudskipper.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-20T12:13:00.000Z","updated_at":"2025-03-25T12:13:02.000Z","dependencies_parsed_at":"2024-01-13T07:02:50.611Z","dependency_job_id":"16677f6b-263f-4020-b0e5-766a8f3dcfb3","html_url":"https://github.com/moodymudskipper/powerjoin","commit_stats":{"total_commits":156,"total_committers":8,"mean_commits":19.5,"dds":0.25,"last_synced_commit":"aada0c0c9be353d3da69fbc6299b070eb2fa6755"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moodymudskipper%2Fpowerjoin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moodymudskipper%2Fpowerjoin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moodymudskipper%2Fpowerjoin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moodymudskipper%2Fpowerjoin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/moodymudskipper","download_url":"https://codeload.github.com/moodymudskipper/powerjoin/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247217221,"owners_count":20903009,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-13T07:05:22.398Z","updated_at":"2025-04-04T17:09:34.160Z","avatar_url":"https://github.com/moodymudskipper.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\",\n  tidy.opts = list(blank = FALSE)\n)\noptions(tidyverse.quiet = TRUE)\n```\n\n# powerjoin \u003cimg src='man/figures/logo.png' align=\"right\" height=\"139\" /\u003e\n\n{powerjoin} extends {dplyr}'s join functions.\n\n* Make your joins safer with the `check` argument and the `check_specs()`function\n* Deal with conflicting column names by combining, coalescing them etc using the `conflict` argument\n* Preprocess input, for instance to select columns to join without having to repeat\nkey columns in the selection\n* Do painless fuzzy joins thanks to a generalized `by` argument accepting formulas\n* Fill unmatched values using the `fill` argument\n* Operate recursive joins by providing lists of data frames to `x` and `y`\n* Keep or drop key columns with more flexibility thanks to an enhanced `keep`argument\n\n## Installation\n\nInstall CRAN version with:\n``` r\ninstall.packages(\"powerjoin\")\n```\n\nOr development version with:\n\n``` r\nremotes::install_github(\"moodymudskipper/powerjoin\")\n```\n\n## Now let's match penguins\n\n```{r}\nlibrary(powerjoin)\nlibrary(tidyverse)\n\n# toy dataset built from Allison Horst's {palmerpenguins} package and \n# Hadley Wickham's {babynames}\n\nmale_penguins \u003c- tribble(\n     ~name,    ~species,     ~island, ~flipper_length_mm, ~body_mass_g,\n \"Giordan\",    \"Gentoo\",    \"Biscoe\",               222L,        5250L,\n  \"Lynden\",    \"Adelie\", \"Torgersen\",               190L,        3900L,\n  \"Reiner\",    \"Adelie\",     \"Dream\",               185L,        3650L\n)\n\nfemale_penguins \u003c- tribble(\n     ~name,    ~species,  ~island, ~flipper_length_mm, ~body_mass_g,\n  \"Alonda\",    \"Gentoo\", \"Biscoe\",               211,        4500L,\n     \"Ola\",    \"Adelie\",  \"Dream\",               190,        3600L,\n\"Mishayla\",    \"Gentoo\", \"Biscoe\",               215,        4750L,\n)\n```\n\n## Safer joins\n\nThe `check` argument receives an object created by the `check_specs()` function,\nwhich provides ways to handle specific input properties, its arguments\ncan be :\n\n* `\"ignore\"` : stay silent (default except for `implicit_keys`)\n* `\"inform\"`\n* `\"warn\"`\n* `\"abort\"`\n\nWe can print these defaults :\n\n```{r}\ncheck_specs()\n```\n\nBy default it works like {dplyr}, informing in case of implicit keys, and no\nfurther checks :\n\n```{r, error = TRUE}\npower_inner_join(\n  male_penguins[c(\"species\", \"island\")],\n  female_penguins[c(\"species\", \"island\")]\n)\n```\n\nWe can silence the implicit key detection and check that we have unique keys in\nthe right table\n\n\n```{r}\ncheck_specs(implicit_keys = \"ignore\", duplicate_keys_right = \"abort\")\n```\n\n\n```{r, error = TRUE}\npower_inner_join(\n  male_penguins[c(\"species\", \"island\")],\n  female_penguins[c(\"species\", \"island\")],\n  check = check_specs(implicit_keys = \"ignore\", duplicate_keys_right = \"abort\")\n)\n```\n\nThe `column_conflict` argument guarantees that you won't have columns renamed without you\nknowing, you might need it most of the time, we could setup some development and\nproduction specs for our most common joins:\n\n```{r}\ndev_specs \u003c- check_specs(\n  column_conflict = \"abort\",\n  inconsistent_factor_levels = \"inform\",\n  inconsistent_type = \"inform\"\n)\n\nprod_specs \u003c- check_specs(\n  column_conflict = \"abort\",\n  implicit_keys = \"abort\"\n)\n```\n\nThis will save some typing :\n\n\u003c!-- For some reason this chunk makes markdown bug, so dirty fix --\u003e\n\n```{r, error = TRUE, eval = FALSE}\npower_inner_join(\n  male_penguins,\n  female_penguins,\n  by = c(\"species\", \"island\"),\n  check = dev_specs\n)\n#\u003e Error: The following columns are conflicted and their conflicts are not handled: \n#\u003e 'name', 'flipper_length_mm', 'body_mass_g'\n```\n\n## Handle column conflict\n\nWe saw above how to fail when encountering column conflict, here we show how to\nhandle it.\n\nTo resolve conflicts between identically named join columns, set the `conflict`\nargument to a 2 argument function (or formula) that will take as arguments the 2 conflicting \njoined columns after the join.\n\n```{r}\ndf1 \u003c- tibble(id = 1:3, value = c(10, NA, 30))\ndf2 \u003c- tibble(id = 2:4, value = c(22, 32, 42))\n\npower_left_join(df1, df2, by = \"id\", conflict = `+`)\n```\n \nCoalescing is the most common use case and we provide the functions `coalesce_xy()`\nand `coalesce_yx()` to ease this task (both wrapped around `dplyr::coalesce()`).\n\n```{r}\npower_left_join(df1, df2, by = \"id\", conflict = coalesce_xy)\n\npower_left_join(df1, df2, by = \"id\", conflict = coalesce_yx)\n```\n\nNote that the function is operating on vectors by default, not rowwise, however\nwe can make it work rowwise by using `rw` in the lhs of the formula.\n\n```{r}\npower_left_join(df1, df2, by = \"id\", conflict = ~ sum(.x, .y, na.rm = TRUE))\n\npower_left_join(df1, df2, by = \"id\", conflict = rw ~ sum(.x, .y, na.rm = TRUE))\n```\n\nIf you need finer control, `conflict` can also be a named list of such functions,\nformulas or special values, each to be applied on the relevant pair of conflicted\ncolumns.\n\n\n## Preprocess inputs\n\nTraditionally key columns need to be repeated when preprocessing inputs \nbefore a join, which is an annoyance and an opportunity for mistakes.\nWith {powerjoin} we can do :\n\n```{r}\npower_inner_join(\n  male_penguins %\u003e% select_keys_and(name),\n  female_penguins %\u003e% select_keys_and(female_name = name),\n  by = c(\"species\", \"island\")\n)\n```\n\nFor semi joins, just omit arguments to `select_keys_and()`: \n\n```{r}\npower_inner_join(\n  male_penguins,\n  female_penguins %\u003e% select_keys_and(),\n  by = c(\"species\", \"island\")\n)\n```\n\nWe could also aggregate on keys before the join, without the need for any\n`group_by()`/`ungroup()` gymnastics :\n\n```{r}\npower_left_join(\n  male_penguins %\u003e% summarize_by_keys(male_weight = mean(body_mass_g)),\n  female_penguins %\u003e% summarize_by_keys(female_weight = mean(body_mass_g)),\n  by = c(\"species\", \"island\")\n)\n```\n\n`pack_along_keys()` packs given columns, or all non key columns by default, into\na data frame column named by the `name` argument, it's useful to namespace the\ndata and avoid conflicts\n\n```{r}\npower_left_join(\n  male_penguins %\u003e% pack_along_keys(name = \"m\"),\n  female_penguins %\u003e% pack_along_keys(name = \"f\"),\n  by = c(\"species\", \"island\")\n)\n```\n\nWe have more of these, all variants of tidyverse functions :\n\n* `nest_by_keys()` nests given columns, or all by default, if `name` is given\na single list column of data frames is created\n* `complete_keys()` expands the key columns, so all combinations are present,\nfilling the rest of the new rows with `NA`s. Absent factor levels are expanded\nas well.\n\n\u003c!-- * `pivot_wider_by_keys()` and `pivot_longer_by_keys()` assume the \"id\" columns are the keys --\u003e\n\nThese functions do not modify the data but add an attribute that will be processed\nby the join function later on, so no function should be used on top of them.\n\n## Fuzzy joins\n\nTo do fuzzy joins we use formulas in the `by` argument, in this formula we use,\n`.x` and `.y` to describe the left and right tables. This is very flexible\nbut can be costly since a cartesian product is computed.\n\n```{r}\npower_inner_join(\n    male_penguins %\u003e% select_keys_and(male_name = name),\n    female_penguins %\u003e% select_keys_and(female_name = name),\n    by = c(~.x$flipper_length_mm \u003c .y$flipper_length_mm, ~.x$body_mass_g \u003e .y$body_mass_g)\n)\n```\n\nWe might also mix fuzzy joins with regular joins :\n\n```{r}\npower_inner_join(\n    male_penguins %\u003e% select_keys_and(male_name = name),\n    female_penguins %\u003e% select_keys_and(female_name = name),\n    by = c(\"island\", ~.x$flipper_length_mm \u003e .y$flipper_length_mm)\n)\n```\n\nFinally we might want to create a column with a value used in the comparison,\nin that case we will use `\u003c-` in the formula (several times if needed)`:\n\n```{r}\npower_inner_join(\n    male_penguins %\u003e% select_keys_and(male_name = name),\n    female_penguins %\u003e% select_keys_and(female_name = name),\n    by = ~ (mass_ratio \u003c- .y$body_mass_g / .x$body_mass_g) \u003e 1.2\n)\n```\n\n## Fill unmatched values\n\nThe `fill` argument is used to specify what to fill unmatched values with,\nnote that missing values resulting from matches are not replaced.\n\n```{r}\ndf1 \u003c- tibble(id = 1:3)\ndf2 \u003c- tibble(id = 1:2, value2 = c(2, NA), value3 = c(NA, 3))\n\npower_left_join(df1, df2, by = \"id\", fill = 0)\n\npower_left_join(df1, df2, by = \"id\", fill = list(value2 = 0))\n```\n\n## Join recursively\n\nThe `x` and `y` arguments accept lists of data frames so one can do :\n\n```{r}\ndf1 \u003c- tibble(id = 1, a = \"foo\")\ndf2 \u003c- tibble(id = 1, b = \"bar\")\ndf3 \u003c- tibble(id = 1, c = \"baz\")\n\npower_left_join(list(df1, df2, df3), by = \"id\")\n\npower_left_join(df1, list(df2, df3), by = \"id\")\n```\n\n## Enhanced `keep` argument\n\nBy default, as in *{dplyr}*, key columns are merged and given names from the\nleft table. In case of a fuzzy join columns that participate in a fuzzy join are\nkept from both sides.\n\nWe provide additional values `\"left\"`, `\"right\"`, `\"both\"` and `\"none\"` to choose\nwhich keys to keep or drop.\n\n## Notes\n\nThis package supersedes the {safejoin} package which had an unfortunate homonym on CRAN and\nhad a suboptimal interface and implementation.\n\nHadley Wickham, Romain François and David Robinson are credited for their work \nin {dplyr} and {fuzzyjoin} since this package contains some code copied from these packages.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoodymudskipper%2Fpowerjoin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoodymudskipper%2Fpowerjoin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoodymudskipper%2Fpowerjoin/lists"}