https://github.com/beniaminogreen/zoomerjoin

Superlatively-fast fuzzy-joins in R
https://github.com/beniaminogreen/zoomerjoin

blazinglyfast fuzzyjoin join r r-package rstats rust zoomer

Last synced: 3 months ago
JSON representation

Superlatively-fast fuzzy-joins in R

Host: GitHub
URL: https://github.com/beniaminogreen/zoomerjoin
Owner: beniaminogreen
License: gpl-3.0
Created: 2023-01-18T01:22:48.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-09-23T23:07:15.000Z (about 1 year ago)
Last Synced: 2024-11-28T21:07:28.237Z (11 months ago)
Topics: blazinglyfast, fuzzyjoin, join, r, r-package, rstats, rust, zoomer
Language: R
Homepage: https://beniamino.org/zoomerjoin/
Size: 70.9 MB
Stars: 103
Watchers: 6
Forks: 5
Open Issues: 9
Metadata Files:
- Readme: README.Rmd
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Citation: CITATION.cff

Awesome Lists containing this project

jimsghstars - beniaminogreen/zoomerjoin - Superlatively-fast fuzzy-joins in R (R)

README

          ---

output: github_document

always_allow_html: true

---

```{r, include=F}

library(tidyverse)

library(microbenchmark)

library(fuzzyjoin)

# rextendr::document()

devtools::load_all()

```

# zoomerjoin 

[![DOI](https://joss.theoj.org/papers/10.21105/joss.05693/status.svg)](https://doi.org/10.21105/joss.05693)

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

[![Codecov test coverage](https://codecov.io/gh/beniaminogreen/zoomerjoin/branch/main/graph/badge.svg)](https://app.codecov.io/gh/beniaminogreen/zoomerjoin?branch=main)

zoomerjoin is an R package that empowers you to fuzzy-join massive datasets

rapidly, and with little memory consumption. It is powered by high-performance

implementations of [Locality Sensitive

Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing), an

algorithm that finds the matches records between two datasets without having to

compare all possible pairs of observations. In practice, this means zoomerjoin

can fuzzily-join datasets days, or even years faster than other matching

packages. zoomerjoin has been used in-production to join datasets of hundreds

of millions of names or vectors in a matter of hours.

## Installation

### Installing from CRAN:

You can install from the CRAN as you would with any other package. Please be

aware that you will have to have Cargo (the rust toolchain and compiler) installed to build

the package from source.

```r

install.packages(zoomerjoin)

```

### Installing Rust

If your operating system or version of R is not installed, you must have the

[Rust compiler](https://www.rust-lang.org/tools/install) installed to compile

this package from sources. After the package is compiled, Rust is no longer

required, and can be safely uninstalled.

#### Installing Rust on Linux or Mac:

To install Rust on Linux or Mac, you can simply run the following snippet in

your terminal.

``` sh

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

```

#### Installing Rust on Windows:

To install Rust on windows, you can use the Rust installation wizard,

`rustup-init.exe`, found [at this

site](https://forge.rust-lang.org/infra/other-installation-methods.html).

Depending on your version of Windows, you may see an error that looks something like this:

```

error: toolchain 'stable-x86_64-pc-windows-gnu' is not installed

```

In this case, you should run `rustup install stable-x86_64-pc_windows-gnu` to

install the missing toolchain. If you're missing another toolchain, simply type

this in the place of `stable-x86_64-pc_windows-gnu` in the command above.

### Installing Package from Github:

Once you have rust installed Rust, you should be able to install the package

with either the install.packages function as above, or using the

`install_github` function from the `devtools` package or with the `pkg_install`

function from the `pak` package.

``` r

## Install with devtools

# install.packages("devtools")

devtools::install_github("beniaminogreen/zoomerjoin")

## Install with pak

# install.packages("pak")

pak::pkg_install("beniaminogreen/zoomerjoin")

```

### Loading The Package

Once the package is installed, you can load it into memory as usual by typing:

```{r, warning = FALSE, message = FALSE}

library(zoomerjoin)

```

## Usage:

The flagship feature of zoomerjoins are the jaccard_join and euclidean family

of functions, which are designed to be near drop-ins for the corresponding

dplyr/fuzzyjoin commands:

* `jaccard_left_join()`

* `jaccard_right_join()`

* `jaccard_inner_join()`

* `jaccard_full_join()`

* `euclidean_left_join()`

* `euclidean_right_join()`

* `euclidean_inner_join()`

* `euclidean_full_join()`

The `jaccard_join` family of functions provide fast fuzzy-joins for strings

using the Jaccard distance while the `euclidean_join` family provides

fuzzy-joins for points or vectors using the Euclidean distance.

### Example: Joining rows of the Database on Ideology, Money in Politics, and Elections

(DIME)

Here's a snippet showing off how to use the `jaccard_inner_join()` merge two

lists of political donors in the [Database on Ideology, Money in Politics,

and Elections (DIME)](https://data.stanford.edu/dime). You can see a more

detailed example of this vignette in the [introductory vignette](https://beniamino.org/zoomerjoin/articles/guided_tour.html).

I start with two corpuses I would like to combine, `corpus_1`:

```{r}

corpus_1 <- dime_data %>%

  head(500)

names(corpus_1) <- c("a", "field")

corpus_1

```

And `corpus_2`:

```{r}

corpus_2 <- dime_data %>%

  tail(500)

names(corpus_2) <- c("b", "field")

corpus_2

```

Both corpuses have an observation ID column, and a donor name column. We would

like to join the two datasets on the donor names column, but the two can't be

directly joined because of misspellings. Because of this, we will use the

jaccard_inner_join function to fuzzily join the two on the donor name column.

Importantly, Locality Sensitive Hashing is a [probabilistic

algorithm](https://en.wikipedia.org/wiki/Randomized_algorithm), so it may fail

to identify some matches by random chance. I adjust the hyperparameters

`n_bands` and `band_width` until the chance of true matches being dropped is

negligible. By default, the package will issue a warning if the chance of a

true match being discovered is less than 95%. You can use the

`jaccard_probability` and `jaccard_hyper_grid_search` to help understand the

probability any true matches will be discarded by the algorithm.

More details and a more thorough description of how to tune the hyperparameters

can be can be found in the [guided tour

vignette](https://beniamino.org/zoomerjoin/articles/guided_tour.html).

```{r}

set.seed(1)

start_time <- Sys.time()

join_out <- jaccard_inner_join(corpus_1, corpus_2, n_gram_width = 6, n_bands = 20, band_width = 6)

print(Sys.time() - start_time)

print(join_out)

```

Zoomerjoin is able to quickly find the matching columns without comparing all

pairs of records. This saves more and more time as the size of each list

increases, so it can scale to join datasets with millions or hundreds of

millions of rows.

# Contributing

Thanks for your interest in contributing to Zoomerjoin!

I am using a gitub-centric workflow to manage the package; You can file a bug report, request a new feature, or ask a question about the package by [filing

an issue on the issues page](https://github.com/beniaminogreen/zoomerjoin/issues), where you will also

find a range of templates to help you out. If you'd like to make changes to the code, you can write and file a [pull

request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests)

on [this page](https://github.com/beniaminogreen/zoomerjoin/pulls). I'll try to

respond to all of these in a timely manner (within a week) although

occasionally I may take longer to respond to a complicated question or issue.

Please also be aware of the [contributor code of

conduct](https://github.com/beniaminogreen/zoomerjoin/blob/main/CONTRIBUTING.md)

for contributing to the repository.

## Acknowledgments:

The Zoomerjoin was made using [this SQL join

illustration](https://commons.wikimedia.org/wiki/File:SQL_Join_-_08_A_Cross_Join_B.svg)

by [Germanx](https://commons.wikimedia.org/wiki/User:GermanX) and [this speed

limit sign](https://commons.wikimedia.org/wiki/File:Speed_limit_75_sign.svg) from the

Federal Highway Administration - MUTCD.

## References:

Bonica, Adam. 2016. Database on Ideology, Money in Politics, and Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries.

Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd. ed.). Cambridge University Press, USA.

Broder, Andrei Z. (1997), "On the resemblance and containment of documents", Compression and Complexity of Sequences: Proceedings. Positano, Salerno, Italy

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/beniaminogreen/zoomerjoin

Awesome Lists containing this project

README