An open API service indexing awesome lists of open source software.

https://github.com/OlivierBinette/StringCompare

Efficient String Comparison Functions and Fuzzy String Matching
https://github.com/OlivierBinette/StringCompare

damerau-levenshtein edit-distance fuzzy-matching jaro-winkler levenshtein-distance pybind11 python string-matching

Last synced: about 2 months ago
JSON representation

Efficient String Comparison Functions and Fuzzy String Matching

Awesome Lists containing this project

README

          

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
"[![Python package](https://github.com/OlivierBinette/StringCompare/actions/workflows/python-package-conda.yml/badge.svg)](https://github.com/OlivierBinette/StringCompare/actions/workflows/python-package-conda.yml) \n",
"[![codecov](https://codecov.io/gh/OlivierBinette/StringCompare/branch/main/graph/badge.svg?token=F8ASD5R051)](https://codecov.io/gh/OlivierBinette/StringCompare)\n",
"[![CodeFactor](https://www.codefactor.io/repository/github/olivierbinette/stringcompare/badge)](https://www.codefactor.io/repository/github/olivierbinette/stringcompare)\n",
"[![Lifecycle Maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html)\n",
"[![Release version](https://img.shields.io/github/v/release/olivierbinette/stringcompare)](https://github.com/OlivierBinette/StringCompare/releases) \n",
"[![Sponsors](https://img.shields.io/github/sponsors/OlivierBinette)](https://github.com/sponsors/OlivierBinette) \n",
"\n",
" \n",
"# ⚡ **StringCompare**: Efficient String Comparison Functions\n",
"\n",
"**StringCompare** is a Python package for efficient string similarity computation and approximate string matching. It is inspired by the excellent [*comparator*](https://github.com/ngmarchant/comparator) and [*stringdist*](https://github.com/markvanderloo/stringdist) R packages, and from the equally excellent [*py_stringmatching*](https://github.com/anhaidgroup/py_stringmatching), [*jellyfish*](https://github.com/jamesturk/jellyfish), and [*textdistance*](https://github.com/life4/textdistance) Python packages.\n",
"\n",
"The key feature of **StringCompare** is a focus on speed, extensibility and maintainability through its [*pybind11* ](https://github.com/pybind/pybind11) C++ implementation. **StringCompare** is faster than most other Python libraries (see benchmark below) and much more memory efficient when dealing with long strings.\n",
"\n",
"The [complete API documentation](https://olivierbinette.github.io/StringCompare/source/stringcompare.html) is available on the project website [olivierbinette.github.io/StringCompare](https://olivierbinette.github.io/StringCompare)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation\n",
"\n",
"Install the released version from github using the following commands:\n",
"\n",
"```bash\n",
" pip install git+https://github.com/OlivierBinette/StringCompare.git@release\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Project Roadmap\n",
"\n",
"**StringCompare** currently implements [edit distances](https://en.wikipedia.org/wiki/Edit_distance) and similarity functions, such as the Levenshtein, Damerau-Levenshtein, Jaro, and Jaro-Winkler distances. This is *stage 1* of the following development roadmap: \n",
"\n",
"| Stage | Goals | Status|\n",
"| :-------------: | ------------- | :-------------: |\n",
"| 1 | pybind11 framework and edit-based distances (Levenshtein, Damerau-Levenshtein, Jaro, and Jaro-Winkler) | ✔️ |\n",
"| 2 | Token-based and hybrid distances (tf-idf similarity, LSH, Monge-Elkan, ...) | ... |\n",
"| 3 | Vocabulary optimizations and metric trees | ... |\n",
"| 4 | Embeddings and string distance learning | ... |\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examples\n",
"\n",
"Comparison algorithms are instanciated as `Comparator` object, which provides the `compare()` method (equivalent to `__call__()`) for string comparison."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.14285714285714285"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from stringcompare import Levenshtein, Jaro, JaroWinkler, DamerauLevenshtein, LCSDistance\n",
"\n",
"lev = Levenshtein(normalize=True, similarity=False)\n",
"\n",
"lev(\"Olivier\", \"Oliver\") # same as lev.compare(\"Olivier\", \"Oliver\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comparator objects also provide the `elementwise()` function for elementwise comparison between lists"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.14285714, 0.26666667])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lev.elementwise([\"Olivier\", \"Olivier\"], [\"Oliver\", \"Olivia\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and the `pairwise()` function for pairwise comparisons."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0. , 0.26666667],\n",
" [0.14285714, 0.28571429]])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lev.pairwise([\"Olivier\", \"Oliver\"], [\"Olivier\", \"Olivia\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Benchmark\n",
"\n",
"Comparison of the Damerau-Levenshtein implementation speed for different Python packages, when comparing the strings \"Olivier Binette\" and \"Oilvier Benet\":"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Package avg runtime (ns)\n",
"------------- ------------------\n",
"StringCompare 746.446\n",
"jellyfish 997.866\n",
"textdistance 4205.98\n"
]
}
],
"source": [
"from timeit import timeit\n",
"from tabulate import tabulate\n",
"\n",
"# Comparison functions\n",
"from stringcompare import DamerauLevenshtein\n",
"cmp = DamerauLevenshtein()\n",
"from jellyfish import damerau_levenshtein_distance\n",
"from textdistance import damerau_levenshtein\n",
"\n",
"functions = {\n",
" \"StringCompare\": cmp.compare,\n",
" \"jellyfish\": damerau_levenshtein_distance,\n",
" \"textdistance\": damerau_levenshtein,\n",
"}\n",
"\n",
"table = [\n",
" [name, timeit(lambda: fun(\"Olivier Binette\", \"Oilvier Benet\"), number=1000000) * 1000]\n",
" for name, fun in functions.items()\n",
"]\n",
"print(tabulate(table, headers=[\"Package\", \"avg runtime (ns)\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Performance notes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The use of pybind11 comes with a small performance overhead. We could be faster if we directly interfaced with CPython.\n",
"\n",
"However, the use of pybind11 allows the library to be easily extensible and maintainable. The C++ implementation has little to worry about Python, excepted for the use of a pybind11 numpy wrapper in some places. Pybind11 takes care of the details of exposing the C++ API to Python."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Known Bugs\n",
"\n",
"*pybind11* has compatibility issues with gcc 11 (e.g. on Ubuntu 21.10). If running Linux and `gcc --version` is 11, then use the following commands to configure your environment before (re)installing:\n",
"```bash\n",
" sudo apt install g++-9 gcc-9\n",
" export CC=gcc-9 CXX=g++-9\n",
"```\n",
"If this is unsuccessful, you might want to use **StringCompare** within a [Docker](https://www.docker.com/) container. I recommend using the python:3.7.9 base image. For example, after installing docker, you can launch an interactive bash session and install **StringCompare** as follows:\n",
"```bash\n",
" sudo docker run -it python:3.7.9 bash\n",
" pip install git+https://github.com/OlivierBinette/StringCompare.git\n",
" python\n",
" >>> import stringcompare\n",
"```\n",
"\n",
"Please report installation issues [here](https://github.com/OlivierBinette/StringCompare/issues)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Contribute\n",
"\n",
"**StringCompare** is currently in early development stage and contributions are welcome! See the [contributing](https://olivierbinette.github.io/StringCompare/contributing.html) page for more information. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Acknowledgements\n",
"\n",
"This project is made possible by the support of the [Natural Sciences and Engineering Research Council of Canada (NSERC)](www.nserc-crsng.gc.ca) and by the support of a [G-Research](https://www.gresearch.co.uk/) grant.\n",
"\n",
"\n",
"\n",
"I would also like to thank the support of my individual [Github sponsors](https://github.com/sponsors/olivierbinette)."
]
}
],
"metadata": {
"interpreter": {
"hash": "b582ae4d77d18d658cc55812e32328158e2f45884933450b1021a6ea5c0413ef"
},
"kernelspec": {
"display_name": "Python 3.9.7 64-bit ('groupbyrule': conda)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}