An open API service indexing awesome lists of open source software.

https://github.com/randrescastaneda/joyn

joyn provides a set of tools to analyze the quality of merging (i.e., joining) data frames. It is a JOY to join with joyn
https://github.com/randrescastaneda/joyn

join merge

Last synced: about 1 year ago
JSON representation

joyn provides a set of tools to analyze the quality of merging (i.e., joining) data frames. It is a JOY to join with joyn

Awesome Lists containing this project

README

          

---
output: github_document
editor_options:
markdown:
wrap: 72
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# joyn

`r badger::badge_cran_checks("joyn")` `r badger::badge_cran_release("joyn", "orange")` `r badger::badge_devel("randrescastaneda/joyn", "blue")` `r badger::badge_codecov("randrescastaneda/joyn")` `r badger::badge_lifecycle("maturing", "green")`

`joyn` empowers you to assess the results of joining data frames, making it easier and more efficient to combine your tables. Similar in philosophy to the `merge` command in `Stata`, `joyn` offers matching key variables and detailed join reports to ensure accurate and insightful results.

## Motivation

Merging tables in R can be tricky. Ensuring accuracy and understanding the joined data fully can be tedious tasks. That's where `joyn` comes in. Inspired by Stata's informative approach to merging, `joyn` makes the process smoother and more insightful.

While standard R merge functions are powerful, they often lack features like assessing join accuracy, detecting potential issues, and providing detailed reports. `joyn` fills this gap by offering:

* **Intuitive join handling:** Whether you're dealing with one-to-one, one-to-many, or many-to-many relationships, `joyn` helps you navigate them confidently.
* **Informative reports:** Get clear insights into the join process with helpful reports that identify duplicate observations, missing values, and potential inconsistencies.

## What makes `joyn` special?

While standard R merge functions offer basic functionality, `joyn` goes above and beyond by providing comprehensive tools and features tailored to your data joining needs:

**1. Flexibility in join types:** Choose your ideal join type ("left", "right", or "inner") with the `keep` argument. Unlike R's default, `joyn` performs a full join by default, ensuring all observations are included, but you have full control to tailor the results.

**2. Seamless variable handling:** No more wrestling with duplicate variable names! `joyn` offers multiple options:

* **Update values:** Use `update_values` or `update_NA` to automatically update conflicting variables in the left table with values from the right table.

* **Keep both (with different names):** Enable `keep_common_vars = TRUE` to retain both variables, each with a unique suffix.

* **Selective inclusion:** Choose specific variables from the right table with `y_vars_to_keep`, ensuring you get only the data you need.

**3. Relationship awareness:** `joyn` recognizes one-to-one, one-to-many, many-to-one, and many-to-many relationships between tables. While it defaults to many-to-many for compatibility, **remember this is often not ideal**. **Always specify the correct relationship using `by` arguments** for accurate and meaningful results.

**4. Join success at a glance:** Get instant feedback on your join with the automatically generated reporting variable. Identify potential issues like unmatched observations or missing values to ensure data integrity and informed decision-making.

By addressing these common pain points and offering enhanced flexibility, `joyn` empowers you to confidently and effectively join your data frames, paving the way for deeper insights and data-driven success.

## Performance and flexibility

### The cost of Reliability

While raw speed is essential, understanding your joins every step of the way is equally crucial. `joyn` prioritizes providing **insightful information** and preventing errors over solely focusing on speed. Unlike other functions, it adds:

* **Meticulous checks:** `joyn` performs comprehensive checks to ensure your join is accurate and avoids potential missteps, like unmatched observations or missing values.
* **Detailed reporting:** Get a clear picture of your join with a dedicated report, highlighting any issues you should be aware of.
* **User-friendly summary:** Quickly grasp the join's outcome with a concise overview presented in a clear table.

These valuable features contribute to a slightly slower performance compared to functions like `data.table::merge.data.table()` or `collapse::join()`. However, the benefits of **preventing errors and gaining invaluable insights** far outweigh the minor speed difference.

### Know your needs, choose your tool

* **Speed is your top priority for massive datasets?** Consider using `data.table` or `collapse` directly.
* **Seek clear understanding and error prevention for your joins?** `joyn` is your trusted guide.

### Protective by design

`joyn` intentionally restricts certain actions and provides clear messages when encountering unexpected data configurations. This might seem **opinionated**, but it's designed to **protect you from accidentally creating inaccurate or misleading joins**. This "safety net" empowers you to confidently merge your data, knowing `joyn` has your back.

### Flexibility

Currently, `joyn` focuses on the most common and valuable join types. Future development might explore expanding its flexibility based on user needs and feedback.

## `joyn` as wrapper: Familiar Syntax, Familiar Power

While `joyn::join()` offers the core functionality and Stata-inspired arguments, you might prefer a syntax more aligned with your existing workflow. `joyn` has you covered!

**Embrace base R and `data.table`:**

* `joyn::merge()`: Leverage familiar base R and `data.table` syntax for seamless integration with your existing code.

**Join with flair using `dplyr`:**

* `joyn::{dplyr verbs}()`: Enjoy the intuitive [verb-based](https://dplyr.tidyverse.org/reference/mutate-joins.html) syntax of `dplyr` for a powerful and expressive way to perform joins.

**Dive deeper:** Explore the corresponding vignettes to unlock the full potential of these alternative interfaces and find the perfect fit for your data manipulation style.

## Installation

You can install the stable version of `joyn` from
[CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("joyn")
```

The development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("randrescastaneda/joyn")
```

## Examples

```{r example}

library(joyn)
library(data.table)

x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_),
t = c(1L, 2L, 1L, 2L, NA_integer_),
x = 11:15)

y1 = data.table(id = c(1,2, 4),
y = c(11L, 15L, 16))

x2 = data.table(id = c(1, 4, 2, 3, NA),
t = c(1L, 2L, 1L, 2L, NA_integer_),
x = c(16, 12, NA, NA, 15))

y2 = data.table(id = c(1, 2, 5, 6, 3),
yd = c(1, 2, 5, 6, 3),
y = c(11L, 15L, 20L, 13L, 10L),
x = c(16:20))

# using common variable `id` as key.
joyn(x = x1,
y = y1,
match_type = "m:1")

# keep just those observations that match
joyn(x = x1,
y = y1,
match_type = "m:1",
keep = "inner")

# Bad merge for not specifying by argument
joyn(x = x2,
y = y2,
match_type = "1:1")

# good merge, ignoring variable x from y
joyn(x = x2,
y = y2,
by = "id",
match_type = "1:1")

# update NAs in var x in table x from var x in y
joyn(x = x2,
y = y2,
by = "id",
update_NAs = TRUE)

# update values in var x in table x from var x in y
joyn(x = x2,
y = y2,
by = "id",
update_values = TRUE)

# do not bring any variable from y into x, just the report
joyn(x = x2,
y = y2,
by = "id",
y_vars_to_keep = NULL)

```