{"id":13419515,"url":"https://github.com/life4/textdistance","last_synced_at":"2025-12-11T21:05:07.140Z","repository":{"id":37431185,"uuid":"90356012","full_name":"life4/textdistance","owner":"life4","description":"📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.","archived":false,"fork":false,"pushed_at":"2024-09-09T06:24:01.000Z","size":460,"stargazers_count":3459,"open_issues_count":9,"forks_count":252,"subscribers_count":64,"default_branch":"master","last_synced_at":"2025-04-10T04:53:48.422Z","etag":null,"topics":["algorithm","algorithms","damerau-levenshtein","damerau-levenshtein-distance","diff","distance","distance-calculation","hamming-distance","jellyfish","levenshtein","levenshtein-distance","python","textdistance"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/life4.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-05T08:46:10.000Z","updated_at":"2025-04-09T06:07:55.000Z","dependencies_parsed_at":"2023-12-26T17:28:15.128Z","dependency_job_id":"54668f0c-350c-4495-a788-e67a6ccf8365","html_url":"https://github.com/life4/textdistance","commit_stats":{"total_commits":339,"total_committers":16,"mean_commits":21.1875,"dds":0.2802359882005899,"last_synced_commit":"65c5e8416355a93b476d647945b2cafedc56af2a"},"previous_names":["orsinium/textdistance"],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/life4%2Ftextdistance","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/life4%2Ftextdistance/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/life4%2Ftextdistance/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/life4%2Ftextdistance/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/life4","download_url":"https://codeload.github.com/life4/textdistance/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254198514,"owners_count":22030965,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","algorithms","damerau-levenshtein","damerau-levenshtein-distance","diff","distance","distance-calculation","hamming-distance","jellyfish","levenshtein","levenshtein-distance","python","textdistance"],"created_at":"2024-07-30T22:01:17.084Z","updated_at":"2025-12-11T21:05:02.101Z","avatar_url":"https://github.com/life4.png","language":"Python","funding_links":[],"categories":["Python","Data Processing","Similarity / Distance Measures","语言资源库","文本处理","文本数据和NLP","Text Processing","Open-Source Software"],"sub_categories":["Data Similarity","Hybird","python","String Comparison"],"readme":"# TextDistance\n\n![TextDistance logo](logo.png)\n\n[![Build Status](https://travis-ci.org/life4/textdistance.svg?branch=master)](https://travis-ci.org/life4/textdistance) [![PyPI version](https://img.shields.io/pypi/v/textdistance.svg)](https://pypi.python.org/pypi/textdistance) [![Status](https://img.shields.io/pypi/status/textdistance.svg)](https://pypi.python.org/pypi/textdistance) [![License](https://img.shields.io/pypi/l/textdistance.svg)](LICENSE)\n\n**TextDistance** -- python library for comparing distance between two or more sequences by many algorithms.\n\nFeatures:\n\n- 30+ algorithms\n- Pure python implementation\n- Simple usage\n- More than two sequences comparing\n- Some algorithms have more than one implementation in one class.\n- Optional numpy usage for maximum speed.\n\n## Algorithms\n\n### Edit based\n\n| Algorithm                                                                                 | Class                | Functions              |\n|-------------------------------------------------------------------------------------------|----------------------|------------------------|\n| [Hamming](https://en.wikipedia.org/wiki/Hamming_distance)                                 | `Hamming`            | `hamming`              |\n| [MLIPNS](http://www.sial.iias.spb.su/files/386-386-1-PB.pdf)                              | `MLIPNS`             | `mlipns`               |\n| [Levenshtein](https://en.wikipedia.org/wiki/Levenshtein_distance)                         | `Levenshtein`        | `levenshtein`          |\n| [Damerau-Levenshtein](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) | `DamerauLevenshtein` | `damerau_levenshtein`  |\n| [Jaro-Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)               | `JaroWinkler`        | `jaro_winkler`, `jaro` |\n| [Strcmp95](http://cpansearch.perl.org/src/SCW/Text-JaroWinkler-0.1/strcmp95.c)            | `StrCmp95`           | `strcmp95`             |\n| [Needleman-Wunsch](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm)      | `NeedlemanWunsch`    | `needleman_wunsch`     |\n| [Gotoh](http://bioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/LOA/Lec6-Sequence-Alignment-Affine-Gaps-Gotoh1982.pdf) | `Gotoh`              | `gotoh`                |\n| [Smith-Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)          | `SmithWaterman`      | `smith_waterman`       |\n\n### Token based\n\n| Algorithm                                                                                 | Class                | Functions     |\n|-------------------------------------------------------------------------------------------|----------------------|---------------|\n| [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index)                              | `Jaccard`            | `jaccard`     |\n| [Sørensen–Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) | `Sorensen`   | `sorensen`, `sorensen_dice`, `dice` |\n| [Tversky index](https://en.wikipedia.org/wiki/Tversky_index)                              | `Tversky`            | `tversky`    |\n| [Overlap coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient)                  | `Overlap`            | `overlap`    |\n| [Tanimoto distance](https://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_similarity_and_distance) | `Tanimoto`   | `tanimoto`   |\n| [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)                      | `Cosine`             | `cosine`     |\n| [Monge-Elkan](https://www.academia.edu/200314/Generalized_Monge-Elkan_Method_for_Approximate_Text_String_Comparison) | `MongeElkan` | `monge_elkan` |\n| [Bag distance](https://github.com/Yomguithereal/talisman/blob/master/src/metrics/bag.js) | `Bag`        | `bag`        |\n\n### Sequence based\n\n| Algorithm | Class | Functions |\n|-----------|-------|-----------|\n| [longest common subsequence similarity](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem)          | `LCSSeq` | `lcsseq` |\n| [longest common substring similarity](https://docs.python.org/2/library/difflib.html#difflib.SequenceMatcher)      | `LCSStr` | `lcsstr` |\n| [Ratcliff-Obershelp similarity](https://en.wikipedia.org/wiki/Gestalt_Pattern_Matching) | `RatcliffObershelp` | `ratcliff_obershelp` |\n\n### Compression based\n\n[Normalized compression distance](https://en.wikipedia.org/wiki/Normalized_compression_distance#Normalized_compression_distance) with different compression algorithms.\n\nClassic compression algorithms:\n\n| Algorithm                                                                  | Class       | Function     |\n|----------------------------------------------------------------------------|-------------|--------------|\n| [Arithmetic coding](https://en.wikipedia.org/wiki/Arithmetic_coding)       | `ArithNCD`  | `arith_ncd`  |\n| [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)                   | `RLENCD`    | `rle_ncd`    |\n| [BWT RLE](https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform) | `BWTRLENCD` | `bwtrle_ncd` |\n\nNormal compression algorithms:\n\n| Algorithm                                                                  | Class        | Function      |\n|----------------------------------------------------------------------------|--------------|---------------|\n| Square Root                                                                | `SqrtNCD`    | `sqrt_ncd`    |\n| [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory))      | `EntropyNCD` | `entropy_ncd` |\n\nWork in progress algorithms that compare two strings as array of bits:\n\n| Algorithm                                  | Class     | Function   |\n|--------------------------------------------|-----------|------------|\n| [BZ2](https://en.wikipedia.org/wiki/Bzip2) | `BZ2NCD`  | `bz2_ncd`  |\n| [LZMA](https://en.wikipedia.org/wiki/LZMA) | `LZMANCD` | `lzma_ncd` |\n| [ZLib](https://en.wikipedia.org/wiki/Zlib) | `ZLIBNCD` | `zlib_ncd` |\n\nSee [blog post](https://articles.life4web.ru/other/ncd/) for more details about NCD.\n\n### Phonetic\n\n| Algorithm                                                                    | Class    | Functions |\n|------------------------------------------------------------------------------|----------|-----------|\n| [MRA](https://en.wikipedia.org/wiki/Match_rating_approach)                   | `MRA`    | `mra`     |\n| [Editex](https://anhaidgroup.github.io/py_stringmatching/v0.3.x/Editex.html) | `Editex` | `editex`  |\n\n### Simple\n\n| Algorithm           | Class      | Functions  |\n|---------------------|------------|------------|\n| Prefix similarity   | `Prefix`   | `prefix`   |\n| Postfix similarity  | `Postfix`  | `postfix`  |\n| Length distance     | `Length`   | `length`   |\n| Identity similarity | `Identity` | `identity` |\n| Matrix similarity   | `Matrix`   | `matrix`   |\n\n## Installation\n\n### Stable\n\nOnly pure python implementation:\n\n```bash\npip install textdistance\n```\n\nWith extra libraries for maximum speed:\n\n```bash\npip install \"textdistance[extras]\"\n```\n\nWith all libraries (required for [benchmarking](#benchmarks) and [testing](#running-tests)):\n\n```bash\npip install \"textdistance[benchmark]\"\n```\n\nWith algorithm specific extras:\n\n```bash\npip install \"textdistance[Hamming]\"\n```\n\nAlgorithms with available extras: `DamerauLevenshtein`, `Hamming`, `Jaro`, `JaroWinkler`, `Levenshtein`.\n\n### Dev\n\nVia pip:\n\n```bash\npip install -e git+https://github.com/life4/textdistance.git#egg=textdistance\n```\n\nOr clone repo and install with some extras:\n\n```bash\ngit clone https://github.com/life4/textdistance.git\npip install -e \".[benchmark]\"\n```\n\n## Usage\n\nAll algorithms have 2 interfaces:\n\n1. Class with algorithm-specific params for customizing.\n1. Class instance with default params for quick and simple usage.\n\nAll algorithms have some common methods:\n\n1. `.distance(*sequences)` -- calculate distance between sequences.\n1. `.similarity(*sequences)` -- calculate similarity for sequences.\n1. `.maximum(*sequences)` -- maximum possible value for distance and similarity. For any sequence: `distance + similarity == maximum`.\n1. `.normalized_distance(*sequences)` -- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.\n1. `.normalized_similarity(*sequences)` -- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.\n\nMost common init arguments:\n\n1. `qval` -- q-value for split sequences into q-grams. Possible values:\n    - 1 (default) -- compare sequences by chars.\n    - 2 or more -- transform sequences to q-grams.\n    - None -- split sequences by words.\n1. `as_set` -- for token-based algorithms:\n    - True -- `t` and `ttt` is equal.\n    - False (default) -- `t` and `ttt` is different.\n\n## Examples\n\nFor example, [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance):\n\n```python\nimport textdistance\n\ntextdistance.hamming('test', 'text')\n# 1\n\ntextdistance.hamming.distance('test', 'text')\n# 1\n\ntextdistance.hamming.similarity('test', 'text')\n# 3\n\ntextdistance.hamming.normalized_distance('test', 'text')\n# 0.25\n\ntextdistance.hamming.normalized_similarity('test', 'text')\n# 0.75\n\ntextdistance.Hamming(qval=2).distance('test', 'text')\n# 2\n\n```\n\nAny other algorithms have same interface.\n\n## Articles\n\nA few articles with examples how to use textdistance in the real world:\n\n- [Guide to Fuzzy Matching with Python](http://theautomatic.net/2019/11/13/guide-to-fuzzy-matching-with-python/)\n- [String similarity — the basic know your algorithms guide!](https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227)\n- [Normalized compression distance](https://articles.life4web.ru/other/ncd/)\n\n## Extra libraries\n\nFor main algorithms textdistance try to call known external libraries (fastest first) if available (installed in your system) and possible (this implementation can compare this type of sequences). [Install](#installation) textdistance with extras for this feature.\n\nYou can disable this by passing `external=False` argument on init:\n\n```python3\nimport textdistance\nhamming = textdistance.Hamming(external=False)\nhamming('text', 'testit')\n# 3\n```\n\nSupported libraries:\n\n1. [jellyfish](https://github.com/jamesturk/jellyfish)\n1. [py_stringmatching](https://github.com/anhaidgroup/py_stringmatching)\n1. [pylev](https://github.com/toastdriven/pylev)\n1. [Levenshtein](https://github.com/maxbachmann/Levenshtein)\n1. [pyxDamerauLevenshtein](https://github.com/gfairchild/pyxDamerauLevenshtein)\n\nAlgorithms:\n\n1. DamerauLevenshtein\n1. Hamming\n1. Jaro\n1. JaroWinkler\n1. Levenshtein\n\n## Benchmarks\n\nWithout extras installation:\n\n| algorithm          | library               |    time |\n|--------------------|-----------------------|---------|\n| DamerauLevenshtein | rapidfuzz             | 0.00312 |\n| DamerauLevenshtein | jellyfish             | 0.00591 |\n| DamerauLevenshtein | pyxdameraulevenshtein | 0.03335 |\n| DamerauLevenshtein | **textdistance**      | 0.83524 |\n| Hamming            | Levenshtein           | 0.00038 |\n| Hamming            | rapidfuzz             | 0.00044 |\n| Hamming            | jellyfish             | 0.00091 |\n| Hamming            | **textdistance**      | 0.03531 |\n| Jaro               | rapidfuzz             | 0.00092 |\n| Jaro               | jellyfish             | 0.00191 |\n| Jaro               | **textdistance**      | 0.07365 |\n| JaroWinkler        | rapidfuzz             | 0.00094 |\n| JaroWinkler        | jellyfish             | 0.00195 |\n| JaroWinkler        | **textdistance**      | 0.07501 |\n| Levenshtein        | rapidfuzz             | 0.00099 |\n| Levenshtein        | Levenshtein           | 0.00122 |\n| Levenshtein        | jellyfish             | 0.00254 |\n| Levenshtein        | pylev                 | 0.15688 |\n| Levenshtein        | **textdistance**      | 0.53902 |\n\nTotal: 24 libs.\n\nYeah, so slow. Use TextDistance on production only with extras.\n\nTextdistance use benchmark's results for algorithm's optimization and try to call fastest external lib first (if possible).\n\nYou can run benchmark manually on your system:\n\n```bash\npip install textdistance[benchmark]\npython3 -m textdistance.benchmark\n```\n\nTextDistance show benchmarks results table for your system and save libraries priorities into `libraries.json` file in TextDistance's folder. This file will be used by textdistance for calling fastest algorithm implementation. Default [libraries.json](textdistance/libraries.json) already included in package.\n\n## Running tests\n\nAll you need is [task](https://taskfile.dev/). See [Taskfile.yml](./Taskfile.yml) for the list of available commands. For example, to run tests including third-party libraries usage, execute `task pytest-external:run`.\n\n## Contributing\n\nPRs are welcome!\n\n- Found a bug? Fix it!\n- Want to add more algorithms? Sure! Just make it with the same interface as other algorithms in the lib and add some tests.\n- Can make something faster? Great! Just avoid external dependencies and remember that everything should work not only with strings.\n- Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).\n- Have no time to code? Tell your friends and subscribers about `textdistance`. More users, more contributions, more amazing features.\n\nThank you :heart:\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flife4%2Ftextdistance","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flife4%2Ftextdistance","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flife4%2Ftextdistance/lists"}