https://github.com/yougov/Fuzzy

Last synced: over 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/yougov/Fuzzy
Owner: yougov
License: mit
Created: 2017-06-04T23:34:38.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2023-07-24T16:55:00.000Z (almost 3 years ago)
Last Synced: 2025-03-10T06:35:18.434Z (over 1 year ago)
Language: C
Size: 106 KB
Stars: 50
Watchers: 7
Forks: 13
Open Issues: 14
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
- License: LICENSE

Awesome Lists containing this project

awesome-python-data-science - Fuzzy - Soundex, NYSIIS, Double Metaphone. (Feature Extraction / Text/NLP)

README

          .. image:: https://img.shields.io/pypi/v/Fuzzy.svg

   :target: https://pypi.org/project/Fuzzy

.. image:: https://img.shields.io/pypi/pyversions/Fuzzy.svg

.. image:: https://img.shields.io/travis/yougov/fuzzy/master.svg

   :target: http://travis-ci.org/yougov/fuzzy

Fuzzy is a python library implementing common phonetic algorithms quickly.

Typically this is in string similarity exercises, but they're pretty versatile.

It uses C Extensions (via Cython) for speed.

The algorithms are:

* `Soundex `_

* `NYSIIS `_

* `Double Metaphone `_ Based on Maurice

  Aubrey's C code from his perl implementation.

Usage

=====

The functions are quite easy to use!

>>> import fuzzy

>>> soundex = fuzzy.Soundex(4)

>>> soundex('fuzzy')

'F200'

>>> dmeta = fuzzy.DMetaphone()

>>> dmeta('fuzzy')

['FS', None]

>>> fuzzy.nysiis('fuzzy')

'FASY'

Performance

===========

Fuzzy's Double Metaphone was ~10 times faster than the pure python

implementation by  `Andrew Collins `_

in some recent `testing `_.

Soundex and NYSIIS should be similarly faster. Using iPython's timeit::

  In [3]: timeit soundex('fuzzy')

  1000000 loops, best of 3: 326 ns per loop

  In [4]: timeit dmeta('fuzzy')

  100000 loops, best of 3: 2.18 us per loop

  In [5]: timeit fuzzy.nysiis('fuzzy')

  100000 loops, best of 3: 13.7 us per loop

Distance Metrics

================

We recommend the `Python-Levenshtein `_

module for fast, C based string distance/similarity metrics. Among others

functions it includes:

 * `Levenshtein `_ edit distance

 * `Jaro `_ distance

 * `Jaro-Winkler `_ distance

 * `Hamming distance `_

In testing it's been several times faster than comparable pure python

implementations of those algorithms.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yougov/Fuzzy

Awesome Lists containing this project

README