{"id":13400903,"url":"https://github.com/moodymudskipper/safejoin","last_synced_at":"2025-04-12T20:52:20.324Z","repository":{"id":45825200,"uuid":"171823235","full_name":"moodymudskipper/safejoin","owner":"moodymudskipper","description":"Wrappers around dplyr functions to join safely     using various checks","archived":false,"fork":false,"pushed_at":"2020-08-19T11:45:21.000Z","size":126,"stargazers_count":42,"open_issues_count":1,"forks_count":7,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-12T20:52:15.270Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/moodymudskipper.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-02-21T07:35:17.000Z","updated_at":"2024-04-24T03:55:19.000Z","dependencies_parsed_at":"2022-09-04T22:00:49.492Z","dependency_job_id":null,"html_url":"https://github.com/moodymudskipper/safejoin","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moodymudskipper%2Fsafejoin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moodymudskipper%2Fsafejoin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moodymudskipper%2Fsafejoin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moodymudskipper%2Fsafejoin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/moodymudskipper","download_url":"https://codeload.github.com/moodymudskipper/safejoin/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248631728,"owners_count":21136560,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T19:00:56.788Z","updated_at":"2025-04-12T20:52:20.306Z","avatar_url":"https://github.com/moodymudskipper.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- badges: start --\u003e\n[![Travis build status](https://travis-ci.com/moodymudskipper/safejoin.svg?branch=master)](https://travis-ci.com/moodymudskipper/safejoin)\n[![Codecov test coverage](https://codecov.io/gh/moodymudskipper/safejoin/branch/master/graph/badge.svg)](https://codecov.io/gh/moodymudskipper/safejoin?branch=master)\n\u003c!-- badges: end --\u003e\n\n```{r, echo = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"README-\"\n)\n```\n\n## safejoin\n\nThe package *safejoin* features wrappers around packages *dplyr* and\n*fuzzyjoin*'s functions to join safely using various checks. It also comes\npacked with features to select columns, rename them, operate on conflicting \nones (coalesce for example), or aggregate the rhs on the joining columns\nbefore joining.\n\nInstall package with:\n\n```{r, eval = FALSE}\n# install.packages(devtools)\ndevtools::install_github(\"moodymudskipper/safejoin\")\n```\n\nJoining operations often come with tests, one might want to check that:\n\n 1. `by` columns are given explicitly (*dplyr* displays a message if \n   they're not)\n 2. Factor columns used for the join have the same levels (*dplyr* displays a\n   warning if they don't)\n 3. No columns are repeated in both data.frames apart from `by` columns\n   (*dplyr* keeps them both and suffixes them silently)\n 4. Join columns form a unique key on both or either tables\n 5. All rows of both or either tables will be matched\n 6. All combinations of values of join columns are present on both or either sides\n 7. columns used for joins have same class and type\n \nThis package provides the possibility  to ignore, inform,\nwarn or abort for any of combination of these cases.\n\nThese checks are handled by a single string parameter, i.e. a sequence of\ncharacters where uppercase letters trigger failures, lower case letters trigger\nwarnings, and letters prefixed with `~` trigger messages, the codes are as follow:\n\n* `\"c\"` to check *c*onflicts of *c*olumns\n* `\"b\"` like *\"by\"* checks if `by` parameter was given explicitly\n* `\"u\"` like *unique* to check that the join columns form an unique key on `x`\n* `\"v\"` to check that the join columns form an unique key on `y`\n* `\"m\"` like *match* to check that all rows of `x` have a _match_\n* `\"n\"` to check that all rows of `y` have a _match_\n* `\"e\"` like *expand* to check that all combinations of joining columns are \n  present in `x`\n* `\"f\"`  to check that all combinations of joining columns are present in `y`\n* `\"l\"`  like *levels* to check that join columns are consistent in term of \n  factor levels\n* `\"t\"`  like *type* to check that joining columns have same class and type\n\nFor example, `check = \"MN\"` will ensure that all rows of both tables are matched.\n\nAdditionally when identically named columns are present on both\nsides, we can aggregate them into one in flexible ways (including coalesce or\njust keeping one of them). This is done through the `conflict` parameter.\n\nThe package features functions `safe_left_join`, `safe_right_join`, \n`safe_inner_join`,  `safe_full_join`, `safe_nest_join`,  `safe_semi_join`, \n`safe_anti_join`, and `eat`.\n\nThe additional function, `eat`\nis designed to be an improved join in the cases where one is growing a \ndata frame. In addition to the features above :\n\n* It uses the `...` argument to select columns from `.y` and leverages the select helpers from *dplyr*, allowing also things like renaming, negative selection, quasi-quotation...\n* It can prefix new columns or rename them in a flexible way\n* It can summarize `.y` on the fly along joining columns for more concise and\nreadable code\n* It can join recursively to a list of tables\n\nThe support of `fuzzyjoin` functions is done in two ways, `fuzzyjoin` functions \nwill be used instead of `dplyr`'s functions if :\n\n* The argument `match_fun` is filled. Then the standard `fuzzyjoin` interface\nis leveraged, except that `safejoin` supports formula notation for this argument.\n* A formula argument is provided to the `by` argument. It should use a notation\nlike `~ X(\"var1\") \u003e Y(\"var2\") \u0026 X(\"var3\") \u003c Y(\"var4\")`. This was introduced to\navoid using the arguments `multi_by` and `multi_match_fun` from \n`fuzzyjoin::fuzzy_join` which I felt were confusing, and have a single readable\nargument instead.\n\n## safe_left_join\n\n*safejoin* offers the same features for all `safe_*_join` functions so we'll\nonly review `safe_left_join` here, we also limit ourselves to checks of the form\n`~*`\n\nWe'll use *dplyr*'s data sets `band_members` and `band_instruments` along with\nextended versions.\n\n```{r}\nlibrary(safejoin)\nlibrary(dplyr,quietly = TRUE,warn.conflicts = FALSE)\nband_members_extended \u003c- band_members %\u003e%\n  mutate(cooks = factor(c(\"pasta\",\"pizza\",\"spaghetti\"),\n                        levels = c(\"pasta\",\"pizza\",\"spaghetti\"))) %\u003e%\n  add_row(name = \"John\",band = \"The Who\", cooks = \"pizza\")\n\nband_instruments_extended \u003c- band_instruments %\u003e%\n  mutate(cooks = factor(c(\"pizza\",\"pasta\",\"pizza\")))\n\nband_members\nband_instruments\nband_members_extended\nband_instruments_extended\n```\n\nNot applying any check :\n\n```{r}\nsafe_left_join(band_members,\n               band_instruments,\n               check = \"\")\n```\n\nDisplaying \"Joining, by...\" like in default *dplyr* behavior:\n\n```{r}\nsafe_left_join(band_members,\n               band_instruments,\n               check = \"~b\")\n```\n\nCheck column conflict when joining extended datasets by name:\n\n```{r}\ntry(safe_left_join(band_members_extended,\n                   band_instruments_extended,\n                   by = \"name\",\n                   check = \"~c\"))\n```\n\nCheck if `x` has unmatched combinations:\n\n```{r}\nsafe_left_join(band_members_extended,\n               band_instruments_extended,\n               by = c(\"name\",\"cooks\"),\n               check = \"~m\")\n```\n\nCheck if `y` has unmatched combinations:\n\n```{r}\nsafe_left_join(band_members_extended,\n               band_instruments_extended,\n               by = c(\"name\",\"cooks\"),\n               check = \"~n\")\n```\n\nCheck if `x` has absent combinations:\n\n```{r}\nsafe_left_join(band_members_extended,\n               band_instruments_extended,\n               by = c(\"name\",\"cooks\"),\n               check = \"~e\")\n```\n\nCheck if `y` has absent combinations:\n\n```{r}\nsafe_left_join(band_members_extended,\n               band_instruments_extended,\n               by = c(\"name\",\"cooks\"),\n               check = \"~f\")\n```\n\nCheck if `x` is unique on joining columns:\n\n```{r}\nsafe_left_join(band_members_extended,\n               band_instruments_extended,\n               by = c(\"name\",\"cooks\"),\n               check = \"~u\")\n```\n\nCheck if `y` is unique on joining columns (it is):\n\n```{r}\nsafe_left_join(band_members_extended,\n               band_instruments_extended,\n               by = c(\"name\",\"cooks\"),\n               check = \"~v\")\n```\n\nCheck if levels are compatible betweeb joining columns:\n\n```{r}\nsafe_left_join(band_members_extended,\n               band_instruments_extended,\n               by = c(\"name\",\"cooks\"),\n               check = \"~l\")\n```\n\nIn case of confict, choose either the column from `x` or from `y`:\n\n```{r}\nsafe_left_join(band_members_extended,\n               band_instruments_extended, by = \"name\",\n               conflict = ~.x)\n\nsafe_left_join(band_members_extended,\n               band_instruments_extended, \n               by = \"name\", \n               conflict = ~.y)\n```\n\nOr coalesce them :\n\n```{r}\nsafe_left_join(band_members_extended, \n               band_instruments_extended, \n               by = \"name\", conflict = coalesce)\nsafe_left_join(band_members_extended, \n               band_instruments_extended, \n               by = \"name\", conflict = ~coalesce(.y,.x))\n```\n\nOr do any custom transformation :\n\n```{r}\nsafe_left_join(band_members_extended, \n               band_instruments_extended, \n               by = \"name\", conflict = paste)\n```\n\nSome common use cases for numerics would be ``confict = `+` ``, `confict = pmin`,\n, `confict = pmax`, `confict = ~(.x+.y)/2`.\n\n`conflict = \"patch\"` is a special value where matches found in `y`\noverwrite the values in `x`, and other values are kept. It's different from\n`conflict = ~coalesce(.y,.x)` because some values in `x` might be overwritten\nby `NA`.\n\n```{r}\nsafe_left_join(band_members_extended, \n               band_instruments_extended,\n               by = \"name\", conflict = \"patch\")\n```\n\n\n## eat\n\nAll the checks above are still relevant for `eat`, we'll silence them below\nwith `check=\"\"` to focus on the additional features.\n\nSame as `safe_left_join` :\n\n```{r eat1}\nband_members_extended %\u003e% \n  eat(band_instruments_extended)\nband_members_extended %\u003e% \n  eat(band_instruments_extended, .by = \"name\", .check = \"\")\n```\n\nThe names of `eat`'s parameters start with a dot to minimize the risk of conflict\nwhen naming the arguments fed to the `...`. The `...` are usually used to pass\ncolumns to be eaten, but they are passed to `select` so more features are\navailable.\n\nSelect which column to eat:\n\n```{r}\nband_members_extended %\u003e% \n  eat(band_instruments_extended, plays, .by = \"name\", .check = \"\")\nband_members_extended %\u003e% \n  eat(band_instruments_extended, -cooks, .by = \"name\", .check = \"\")\nband_members_extended %\u003e% \n  eat(band_instruments_extended, starts_with(\"p\"), .by = \"name\", .check = \"\")\n```\n\nRename eaten columns :\n\n```{r}\nband_members_extended %\u003e% \n  eat(band_instruments_extended, .prefix = \"NEW\", .check = \"\")\nband_members_extended %\u003e% \n  eat(band_instruments_extended, PLAYS = plays, .check = \"\")\n```\n\nWe can check if the dot argument was used by using the character `\"d\"` in the check string:\n\n```{r}\nband_members_extended %\u003e% \n  eat(band_instruments_extended, .check = \"~d\")\n```\n\nIn cases of matching to many (i.e. the join columns don't form a unique key for\n`y`), we can use the parameter `.agg` to aggregate them on the fly:\n\n```{r}\nband_instruments_extended %\u003e% \n  eat(band_members_extended, .check = \"\")\nband_instruments_extended %\u003e% \n  eat(band_members_extended, .agg = ~paste(.,collapse=\"/\"), .check = \"\")\n```\n\n\nFinally we can eat a list of data frames at once, and optionally override\nthe `.prefix` argument by providing names to the elements.\n\n```{r}\nX \u003c- data.frame(a = 1:2,b = 1:2)\nY1 \u003c- list(data.frame(a = 1:2,c = 3:4), data.frame(a = 1:2,d = 5:6))\neat(X, Y1)\n\nY2 \u003c- list(data.frame(a = 1:2,c = c(3,NA)), data.frame(a = 1:2,c = c(NA,4)))\neat(X, Y2, .by = \"a\", .conflict = coalesce)\n\nY3 \u003c- list(FOO = data.frame(a = 1:2,c = 3:4), BAR = data.frame(a = 1:2,d = 5:6))\neat(X, Y3)\n\nY4 \u003c- list(FOO = data.frame(a = 1:2, c = 3:4, d = 5:6), \n           BAR = data.frame(a = 1:2, c = 3:4, e = 7:8))\neat(X, Y4)\neat(X, Y4, c)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoodymudskipper%2Fsafejoin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoodymudskipper%2Fsafejoin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoodymudskipper%2Fsafejoin/lists"}