Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yougov/Fuzzy
https://github.com/yougov/Fuzzy
Last synced: 20 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/yougov/Fuzzy
- Owner: yougov
- License: mit
- Created: 2017-06-04T23:34:38.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-07-24T16:55:00.000Z (over 1 year ago)
- Last Synced: 2024-09-19T01:12:07.523Z (about 2 months ago)
- Language: C
- Size: 106 KB
- Stars: 50
- Watchers: 8
- Forks: 11
- Open Issues: 13
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
- License: LICENSE
Awesome Lists containing this project
- awesome-python-data-science - Fuzzy - Soundex, NYSIIS, Double Metaphone. (Feature Extraction / Text/NLP)
README
.. image:: https://img.shields.io/pypi/v/Fuzzy.svg
:target: https://pypi.org/project/Fuzzy.. image:: https://img.shields.io/pypi/pyversions/Fuzzy.svg
.. image:: https://img.shields.io/travis/yougov/fuzzy/master.svg
:target: http://travis-ci.org/yougov/fuzzyFuzzy is a python library implementing common phonetic algorithms quickly.
Typically this is in string similarity exercises, but they're pretty versatile.It uses C Extensions (via Cython) for speed.
The algorithms are:
* `Soundex `_
* `NYSIIS `_
* `Double Metaphone `_ Based on Maurice
Aubrey's C code from his perl implementation.Usage
=====The functions are quite easy to use!
>>> import fuzzy
>>> soundex = fuzzy.Soundex(4)
>>> soundex('fuzzy')
'F200'
>>> dmeta = fuzzy.DMetaphone()
>>> dmeta('fuzzy')
['FS', None]
>>> fuzzy.nysiis('fuzzy')
'FASY'Performance
===========Fuzzy's Double Metaphone was ~10 times faster than the pure python
implementation by `Andrew Collins `_
in some recent `testing `_.
Soundex and NYSIIS should be similarly faster. Using iPython's timeit::In [3]: timeit soundex('fuzzy')
1000000 loops, best of 3: 326 ns per loopIn [4]: timeit dmeta('fuzzy')
100000 loops, best of 3: 2.18 us per loopIn [5]: timeit fuzzy.nysiis('fuzzy')
100000 loops, best of 3: 13.7 us per loopDistance Metrics
================We recommend the `Python-Levenshtein `_
module for fast, C based string distance/similarity metrics. Among others
functions it includes:* `Levenshtein `_ edit distance
* `Jaro `_ distance
* `Jaro-Winkler `_ distance
* `Hamming distance `_In testing it's been several times faster than comparable pure python
implementations of those algorithms.