Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rapidfuzz/JaroWinkler
Python library for fast approximate string matching using Jaro and Jaro-Winkler similarity
https://github.com/rapidfuzz/JaroWinkler
cpp hacktoberfest jaro jaro-winkler python string-comparison string-matching string-similarity
Last synced: 3 months ago
JSON representation
Python library for fast approximate string matching using Jaro and Jaro-Winkler similarity
- Host: GitHub
- URL: https://github.com/rapidfuzz/JaroWinkler
- Owner: rapidfuzz
- License: mit
- Created: 2022-01-07T16:58:07.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-01-08T23:02:00.000Z (10 months ago)
- Last Synced: 2024-06-28T05:38:02.139Z (5 months ago)
- Topics: cpp, hacktoberfest, jaro, jaro-winkler, python, string-comparison, string-matching, string-similarity
- Language: Python
- Homepage:
- Size: 105 KB
- Stars: 58
- Watchers: 4
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
JaroWinkler
JaroWinkler is a library to calculate the Jaro and Jaro-Winkler similarity. It is easy to use, is far more performant than all alternatives and is designed to integrate seemingless with RapidFuzz.
## ⚡ Quickstart
```python
>>> from jarowinkler import *>>> jaro_similarity("Johnathan", "Jonathan")
0.8796296296296297>>> jarowinkler_similarity("Johnathan", "Jonathan")
0.9037037037037037
```## 🚀 Benchmarks
The implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein.
## ⚙️ Installation
You can install this library from [PyPI](https://pypi.org/project/jarowinkler/) with pip:
```
pip install jarowinkler
```
JaroWinkler provides binary wheels for all common platforms.### Source builds
For a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.
```
pip install git+https://github.com/rapidfuzz/JaroWinkler.git@main
```## 📖 Usage
Any algorithms in JaroWinkler can not only be used with strings, but with any arbitrary sequences of hashable objects:
```python
from jarowinkler import jarowinkler_similarityjarowinkler_similarity("this is an example".split(), ["this", "is", "a", "example"])
# 0.8666666666666667
```So as long as two objects have the same hash they are treated as similar. You can provide a `__hash__` method for your own object instances.
```python
class MyObject:
def __init__(self, hash):
self.hash = hashdef __hash__(self):
return self.hashjarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])
# 0.9111111111111111
```All algorithms provide a `score_cutoff` parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:
```python
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.9)
# 0.0jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.85)
# 0.8796296296296297
```JaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.
```python
from rapidfuzz import processprocess.cdist(["Johnathan", "Jonathan"], ["Johnathan", "Jonathan"], scorer=jarowinkler_similarity)
array([[1. , 0.9037037],
[0.9037037, 1. ]], dtype=float32)
```## 👍 Contributing
PRs are welcome!
- Found a bug? Report it in form of an [issue](https://github.com/rapidfuzz/JaroWinkler/issues) or even better fix it!
- Can make something faster? Great! Just avoid external dependencies and remember that existing functionality should still work.
- Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
- Have no time to code? Tell your friends and subscribers about JaroWinkler. More users, more contributions, more amazing features.Thank you :heart:
## ⚠️ License
Copyright 2021 - present [maxbachmann](https://github.com/maxbachmann). `JaroWinkler` is free and open-source software licensed under the [MIT License](https://github.com/rapidfuzz/JaroWinkler/blob/main/LICENSE).