{"id":14065970,"url":"https://github.com/beniaminogreen/zoomerjoin","last_synced_at":"2025-07-21T20:02:41.879Z","repository":{"id":65634373,"uuid":"590253276","full_name":"beniaminogreen/zoomerjoin","owner":"beniaminogreen","description":"Superlatively-fast fuzzy-joins in R","archived":false,"fork":false,"pushed_at":"2024-09-23T23:07:15.000Z","size":74365,"stargazers_count":103,"open_issues_count":9,"forks_count":5,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-11-28T21:07:28.237Z","etag":null,"topics":["blazinglyfast","fuzzyjoin","join","r","r-package","rstats","rust","zoomer"],"latest_commit_sha":null,"homepage":"https://beniamino.org/zoomerjoin/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/beniaminogreen.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-01-18T01:22:48.000Z","updated_at":"2024-10-18T16:09:44.000Z","dependencies_parsed_at":"2024-02-17T18:28:57.676Z","dependency_job_id":"bd8f3fac-91d3-4173-88b0-fc5ed9d189b3","html_url":"https://github.com/beniaminogreen/zoomerjoin","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beniaminogreen%2Fzoomerjoin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beniaminogreen%2Fzoomerjoin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beniaminogreen%2Fzoomerjoin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beniaminogreen%2Fzoomerjoin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/beniaminogreen","download_url":"https://codeload.github.com/beniaminogreen/zoomerjoin/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228052661,"owners_count":17862105,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["blazinglyfast","fuzzyjoin","join","r","r-package","rstats","rust","zoomer"],"created_at":"2024-08-13T07:04:52.710Z","updated_at":"2024-12-04T05:31:14.396Z","avatar_url":"https://github.com/beniaminogreen.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput: github_document\nalways_allow_html: true\n---\n\n```{r, include=F}\nlibrary(tidyverse)\nlibrary(microbenchmark)\nlibrary(fuzzyjoin)\n\n# rextendr::document()\ndevtools::load_all()\n```\n\n\n# zoomerjoin \u003cimg src='man/figures/logo.png' align=\"right\" height=\"139\" /\u003e\n\u003c!-- badges: start --\u003e\n[![DOI](https://joss.theoj.org/papers/10.21105/joss.05693/status.svg)](https://doi.org/10.21105/joss.05693)\n[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)\n[![Codecov test coverage](https://codecov.io/gh/beniaminogreen/zoomerjoin/branch/main/graph/badge.svg)](https://app.codecov.io/gh/beniaminogreen/zoomerjoin?branch=main)\n\u003c!-- badges: end --\u003e\n\nzoomerjoin is an R package that empowers you to fuzzy-join massive datasets\nrapidly, and with little memory consumption. It is powered by high-performance\nimplementations of [Locality Sensitive\nHashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing), an\nalgorithm that finds the matches records between two datasets without having to\ncompare all possible pairs of observations. In practice, this means zoomerjoin\ncan fuzzily-join datasets days, or even years faster than other matching\npackages. zoomerjoin has been used in-production to join datasets of hundreds\nof millions of names or vectors in a matter of hours.\n\n## Installation\n\n### Installing from CRAN:\n\nYou can install from the CRAN as you would with any other package. Please be\naware that you will have to have Cargo (the rust toolchain and compiler) installed to build\nthe package from source.\n\n```r\ninstall.packages(zoomerjoin)\n```\n\n\n### Installing Rust\n\nIf your operating system or version of R is not installed, you must have the\n[Rust compiler](https://www.rust-lang.org/tools/install) installed to compile\nthis package from sources. After the package is compiled, Rust is no longer\nrequired, and can be safely uninstalled.\n\n#### Installing Rust on Linux or Mac:\n\nTo install Rust on Linux or Mac, you can simply run the following snippet in\nyour terminal.\n\n``` sh\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n```\n\n#### Installing Rust on Windows:\n\nTo install Rust on windows, you can use the Rust installation wizard,\n`rustup-init.exe`, found [at this\nsite](https://forge.rust-lang.org/infra/other-installation-methods.html).\nDepending on your version of Windows, you may see an error that looks something like this:\n\n```\nerror: toolchain 'stable-x86_64-pc-windows-gnu' is not installed\n```\n\nIn this case, you should run `rustup install stable-x86_64-pc_windows-gnu` to\ninstall the missing toolchain. If you're missing another toolchain, simply type\nthis in the place of `stable-x86_64-pc_windows-gnu` in the command above.\n\n### Installing Package from Github:\n\nOnce you have rust installed Rust, you should be able to install the package\nwith either the install.packages function as above, or using the\n`install_github` function from the `devtools` package or with the `pkg_install`\nfunction from the `pak` package.\n\n``` r\n## Install with devtools\n# install.packages(\"devtools\")\ndevtools::install_github(\"beniaminogreen/zoomerjoin\")\n\n## Install with pak\n# install.packages(\"pak\")\npak::pkg_install(\"beniaminogreen/zoomerjoin\")\n```\n\n### Loading The Package\n\nOnce the package is installed, you can load it into memory as usual by typing:\n\n```{r, warning = FALSE, message = FALSE}\nlibrary(zoomerjoin)\n```\n\n## Usage:\n\nThe flagship feature of zoomerjoins are the jaccard_join and euclidean family\nof functions, which are designed to be near drop-ins for the corresponding\ndplyr/fuzzyjoin commands:\n\n* `jaccard_left_join()`\n* `jaccard_right_join()`\n* `jaccard_inner_join()`\n* `jaccard_full_join()`\n* `euclidean_left_join()`\n* `euclidean_right_join()`\n* `euclidean_inner_join()`\n* `euclidean_full_join()`\n\nThe `jaccard_join` family of functions provide fast fuzzy-joins for strings\nusing the Jaccard distance while the `euclidean_join` family provides\nfuzzy-joins for points or vectors using the Euclidean distance.\n\n### Example: Joining rows of the Database on Ideology, Money in Politics, and Elections\n(DIME)\n\nHere's a snippet showing off how to use the `jaccard_inner_join()` merge two\nlists of political donors in the [Database on Ideology, Money in Politics,\nand Elections (DIME)](https://data.stanford.edu/dime). You can see a more\ndetailed example of this vignette in the [introductory vignette](https://beniamino.org/zoomerjoin/articles/guided_tour.html).\n\nI start with two corpuses I would like to combine, `corpus_1`:\n\n```{r}\ncorpus_1 \u003c- dime_data %\u003e%\n  head(500)\nnames(corpus_1) \u003c- c(\"a\", \"field\")\ncorpus_1\n```\n\nAnd `corpus_2`:\n\n```{r}\ncorpus_2 \u003c- dime_data %\u003e%\n  tail(500)\nnames(corpus_2) \u003c- c(\"b\", \"field\")\ncorpus_2\n```\nBoth corpuses have an observation ID column, and a donor name column. We would\nlike to join the two datasets on the donor names column, but the two can't be\ndirectly joined because of misspellings. Because of this, we will use the\njaccard_inner_join function to fuzzily join the two on the donor name column.\n\nImportantly, Locality Sensitive Hashing is a [probabilistic\nalgorithm](https://en.wikipedia.org/wiki/Randomized_algorithm), so it may fail\nto identify some matches by random chance. I adjust the hyperparameters\n`n_bands` and `band_width` until the chance of true matches being dropped is\nnegligible. By default, the package will issue a warning if the chance of a\ntrue match being discovered is less than 95%. You can use the\n`jaccard_probability` and `jaccard_hyper_grid_search` to help understand the\nprobability any true matches will be discarded by the algorithm.\n\nMore details and a more thorough description of how to tune the hyperparameters\ncan be can be found in the [guided tour\nvignette](https://beniamino.org/zoomerjoin/articles/guided_tour.html).\n\n```{r}\nset.seed(1)\nstart_time \u003c- Sys.time()\njoin_out \u003c- jaccard_inner_join(corpus_1, corpus_2, n_gram_width = 6, n_bands = 20, band_width = 6)\nprint(Sys.time() - start_time)\nprint(join_out)\n```\n\nZoomerjoin is able to quickly find the matching columns without comparing all\npairs of records. This saves more and more time as the size of each list\nincreases, so it can scale to join datasets with millions or hundreds of\nmillions of rows.\n\n# Contributing\n\nThanks for your interest in contributing to Zoomerjoin!\n\nI am using a gitub-centric workflow to manage the package; You can file a bug report, request a new feature, or ask a question about the package by [filing\nan issue on the issues page](https://github.com/beniaminogreen/zoomerjoin/issues), where you will also\nfind a range of templates to help you out. If you'd like to make changes to the code, you can write and file a [pull\nrequest](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests)\non [this page](https://github.com/beniaminogreen/zoomerjoin/pulls). I'll try to\nrespond to all of these in a timely manner (within a week) although\noccasionally I may take longer to respond to a complicated question or issue.\n\nPlease also be aware of the [contributor code of\nconduct](https://github.com/beniaminogreen/zoomerjoin/blob/main/CONTRIBUTING.md)\nfor contributing to the repository.\n\n## Acknowledgments:\n\n\nThe Zoomerjoin was made using [this SQL join\nillustration](https://commons.wikimedia.org/wiki/File:SQL_Join_-_08_A_Cross_Join_B.svg)\nby [Germanx](https://commons.wikimedia.org/wiki/User:GermanX) and [this speed\nlimit sign](https://commons.wikimedia.org/wiki/File:Speed_limit_75_sign.svg) from the\nFederal Highway Administration - MUTCD.\n\n## References:\n\nBonica, Adam. 2016. Database on Ideology, Money in Politics, and Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries.\n\nJure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd. ed.). Cambridge University Press, USA.\n\nBroder, Andrei Z. (1997), \"On the resemblance and containment of documents\", Compression and Complexity of Sequences: Proceedings. Positano, Salerno, Italy\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbeniaminogreen%2Fzoomerjoin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbeniaminogreen%2Fzoomerjoin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbeniaminogreen%2Fzoomerjoin/lists"}