https://github.com/ngmarchant/comparator
Similarity and distance measures for clustering and record linkage applications in R
https://github.com/ngmarchant/comparator
clustering distance-measures distance-metrics entity-resolution r-package record-linkage similarity-measures string-similarity
Last synced: 15 days ago
JSON representation
Similarity and distance measures for clustering and record linkage applications in R
- Host: GitHub
- URL: https://github.com/ngmarchant/comparator
- Owner: ngmarchant
- License: gpl-2.0
- Created: 2020-11-28T06:17:15.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2025-09-23T01:56:14.000Z (6 months ago)
- Last Synced: 2025-12-09T06:10:53.077Z (3 months ago)
- Topics: clustering, distance-measures, distance-metrics, entity-resolution, r-package, record-linkage, similarity-measures, string-similarity
- Language: R
- Homepage:
- Size: 282 KB
- Stars: 18
- Watchers: 3
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- awesome-entity-resolution - Comparator - Efficient string comparison functions in R. (Open-Source Software / String Comparison)
README
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
library(comparator)
```
# comparator: Comparison Functions for Clustering and Record Linkage
comparator implements comparison functions for clustering and record linkage
applications. It includes functions for comparing strings, sequences and
numeric vectors. Where possible, comparators are implemented in C/C++ to
ensure fast performance.
## Supported comparators
### String comparators:
#### Edit-based:
* `Levenshtein()`: Levenshtein distance/similarity
* `DamerauLevenshtein()` Damerau-Levenshtein distance/similarity
* `Hamming()`: Hamming distance/similarity
* `OSA()`: Optimal String Alignment distance/similarity
* `LCS()`: Longest Common Subsequence distance/similarity
* `Jaro()`: Jaro distance/similarity
* `JaroWinkler()`: Jaro-Winkler distance/similarity
#### Token-based:
Not yet implemented.
#### Hybrid token-character:
* `MongeElkan()`: Monge-Elkan similarity
* `FuzzyTokenSet()`: Fuzzy Token Set distance
#### Other:
* `InVocabulary()`: Compares strings using a reference vocabulary. Useful for
comparing names.
* `Lookup()`: Retrieves distances/similarities from a lookup table
* `BinaryComp()`: Compares strings based on whether they agree/disagree
exactly.
### Numeric comparators:
* `Euclidean()`: Euclidean (L-2) distance
* `Manhattan()`: Manhattan (L-1) distance
* `Chebyshev()`: Chebyshev (L-∞) distance
* `Minkowski()`: Minkowski (L-p) distance
## Installation
You can install the latest release from [CRAN](https://CRAN.R-project.org)
by entering:
``` r
install.packages("comparator")
```
The development version can be installed from GitHub using `devtools`:
``` r
# install.packages("devtools")
devtools::install_github("ngmarchant/comparator")
```
## Example
A comparator is instantiated by calling its constructor function.
For example, we can instantiate a Levenshtein similarity comparator that
ignores differences in upper/lowercase characters as follows:
```{r lev}
comparator <- Levenshtein(similarity = TRUE, normalize = TRUE, ignore_case = TRUE)
```
We can apply the comparator to character vectors element-wise as follows:
```{r elementwise-str}
x <- c("John Doe", "Jane Doe")
y <- c("jonathon doe", "jane doe")
elementwise(comparator, x, y)
# shorthand for above
comparator(x, y)
```
This comparator is also defined on sequences:
```{r elementwise-seq}
x_seq <- list(c(1, 2, 1, 1), c(1, 2, 3, 4))
y_seq <- list(c(4, 3, 2, 1), c(1, 2, 3, 1))
elementwise(comparator, x_seq, y_seq)
# shorthand for above
comparator(x_seq, y_seq)
```
Pairwise comparisons are also supported using the following syntax:
```{r pairwise}
# compare each string in x with each string in y and return a similarity matrix
pairwise(comparator, x, y, return_matrix = TRUE)
# compare the strings in x pairwise and return a similarity matrix
pairwise(comparator, x, return_matrix = TRUE)
```