{"id":16058076,"url":"https://github.com/chinnichaitanya/spellwise","last_synced_at":"2025-03-17T16:32:05.409Z","repository":{"id":57469827,"uuid":"320058358","full_name":"chinnichaitanya/spellwise","owner":"chinnichaitanya","description":"🚀 Extremely fast fuzzy matcher \u0026 spelling checker in Python!","archived":false,"fork":false,"pushed_at":"2021-05-30T08:17:12.000Z","size":447,"stargazers_count":21,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-10T03:06:12.509Z","etag":null,"topics":["caverphone","editex","levenshtein","natural-language-processing","nlp","spellcheck","spelling-correction","trie","typox"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/spellwise/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chinnichaitanya.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-09T19:26:54.000Z","updated_at":"2024-06-27T14:50:02.000Z","dependencies_parsed_at":"2022-09-19T15:00:45.126Z","dependency_job_id":null,"html_url":"https://github.com/chinnichaitanya/spellwise","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chinnichaitanya%2Fspellwise","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chinnichaitanya%2Fspellwise/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chinnichaitanya%2Fspellwise/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chinnichaitanya%2Fspellwise/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chinnichaitanya","download_url":"https://codeload.github.com/chinnichaitanya/spellwise/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221697009,"owners_count":16865523,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["caverphone","editex","levenshtein","natural-language-processing","nlp","spellcheck","spelling-correction","trie","typox"],"created_at":"2024-10-09T03:06:16.877Z","updated_at":"2024-10-27T15:18:16.551Z","avatar_url":"https://github.com/chinnichaitanya.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spellwise\n\n🚀 Extremely fast spelling checker and suggester in Python!\n\n\u003ca href=\"https://pypi.org/project/spellwise/\"\u003e\u003cimg alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/spellwise\"\u003e\u003c/a\u003e\n[![PyPI version](https://badge.fury.io/py/spellwise.svg)](https://badge.fury.io/py/spellwise)\n\u003ca href=\"https://pepy.tech/project/spellwise\"\u003e\u003cimg alt=\"Downloads\" src=\"https://static.pepy.tech/badge/spellwise\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pypi.org/project/spellwise/#files\"\u003e\u003cimg alt=\"PyPI - Wheel\" src=\"https://img.shields.io/pypi/wheel/spellwise\"\u003e\u003c/a\u003e\n[![License: MIT](https://img.shields.io/pypi/l/spellwise)](https://github.com/chinnichaitanya/spellwise/blob/master/LICENSE)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)\n\nThe following algorithms are supported currently,\n\n- Edit-distance, [Hall and Dowling (1980)](https://dl.acm.org/doi/10.1145/356827.356830) (based on [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) algorithm)\n- Editex, [Zobel and Dart (1996)](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.2138\u0026rep=rep1\u0026type=pdf) (for suggesting phonetically similar words)\n- Soundex (https://nlp.stanford.edu/IR-book/html/htmledition/phonetic-correction-1.html) (for identifying phonetically similar words)\n- Caverphone 1.0 and Caverphone 2.0, [David Hood (2002)](https://caversham.otago.ac.nz/files/working/ctp060902.pdf) (to identify English names which sound phonetically similar)\n- QWERTY Keyboard layout Typographic based correction algorithm (Typox), inspired by [Ahmad, Indrayana, Wibisono, and Ijtihadie (2017)](https://ieeexplore.ieee.org/document/8257147). This implementation might not be the exact one specified in the paper since it is not available to read for free\n\nAll the above algorithms use an underlying [Trie](https://en.wikipedia.org/wiki/Trie) based dictionary for efficient storage and fast computation! Implementations of all the algorithms are inspired by the amazing article [Fast and Easy Levenshtein distance using a Trie, by Steve Hanov](http://stevehanov.ca/blog/?id=114).\n\n## 📦 Installation\n\nThe easiest way to install `spellwise` is through `pip`.\n\n```shell\npip install spellwise\n\n```\n\n## 🧑‍💻 Usage\n\nCurrently, there are five algorithms available for use with the following classnames,\n\n- `Levenshtein`\n- `Editex`\n- `Soundex`\n- `CaverphoneOne`\n- `CaverphoneTwo`\n- `Typox`\n\nPlease check the [`examples/`](https://github.com/chinnichaitanya/python-spell-checker/tree/master/examples) folder for specific usage of each algorithm. But in a general sense, each algorithm has three parts,\n\n- Initialization (initialize the class object for the algorithm to use)\n- Index correct words/names (add correct words or names to the dictionary)\n- Fetch suggestions (inference)\n\n```python\nfrom spellwise import (CaverphoneOne, CaverphoneTwo, Editex, Levenshtein,\n                       Soundex, Typox)\n\n# (1) Initialize the desired algorithm\nalgorithm = Editex() # this can be CaverphoneOne, CaverphoneTwo, Levenshtein or Typox as well\n\n# (2) Index the words/names to the algorithm\n# Indexing can be done by adding words from a file\nalgorithm.add_from_path(\"\u003cpath-to-the-dictionary-file\u003e\")\n# or by adding them manually\nalgorithm.add_words([\"spell\", \"spelling\", \"check\"])\n\n# (3) Fetch the suggestions for the word\nsuggestions = algorithm.get_suggestions(\"spellin\")\n# The `suggestions` is a list of dict with fields `word` and `distance`\n# [{\"word\": ..., \"distance\": ...}, ...]\nprint(suggestions)\n\n# Output would be similar to the following,\n# [{'word': 'spelling', 'distance': 2}]\n\n```\n\nThe default maximum distance considered varies for different algorithms. It can be changed while fetching the suggestions,\n\n```python\n# Fetch suggestions with maximum distance 4\nsuggestions = algorithm.get_suggestions(\"spellin\", max_distance=4)\n# Print the suggestions\nprint(suggestions)\n\n# Output would be similar to the following,\n# [{'word': 'spelling', 'distance': 2}, {'word': 'spell', 'distance': 4}]\n\n```\n\n## 💡 Analysis of each algorithm\n\nThere are many algorithms currently available in the package, each suitable for different purposes.\nWe will discuss each algorithm in specific in the following sections.\n\n### (1) Levenshtein\n\nThe `Levenshtein` algorithm is the baseline and most popular method to identify the closest correct words given the misspelt word, based on the edit-distance (number of insertions, deletions and replacements) between the given word and the correctly spelt word.\n\n```python\nfrom spellwise import Levenshtein\n\n# Initialise the algorithm\nlevenshtein = Levenshtein()\n# Index the words from a dictionary\nlevenshtein.add_from_path(\"examples/data/american-english\")\n\n# Fetch suggestions\nsuggestions = levenshtein.get_suggestions(\"run\")\n# Print the top 10 suggestions\nprint(\"Word \\t Distance\")\nprint(\"=================\")\nfor suggestion in suggestions[0:10]:\n    print(\"{} \\t {}\".format(suggestion.get(\"word\"), suggestion.get(\"distance\")))\n\n```\n\nLevenshtein provides the following,\n\n```shell\nWord \t Distance\n=================\nrun \t 0\nbun \t 1\ndun \t 1\nfun \t 1\ngun \t 1\nhun \t 1\njun \t 1\njun \t 1\nmun \t 1\nnun \t 1\n\n```\n\n### (2) Editex\n\nThe `Editex` algorithm provides suggestions of words which are phonetically closed to the given word. It also uses the edit-distance but has a different replacement or deletion costs depending on whether the two letters belong to the same phonetic group or not.\n\n```python\nfrom spellwise import Editex\n\n# Initialise the algorithm\neditex = Editex()\n# Index the words from a dictionary\neditex.add_from_path(\"examples/data/american-english\")\n\n# Fetch suggestions\nsuggestions = editex.get_suggestions(\"run\")\n# Print the top 10 suggestions\nprint(\"Word \\t Distance\")\nprint(\"=================\")\nfor suggestion in suggestions[0:10]:\n    print(\"{} \\t {}\".format(suggestion.get(\"word\"), suggestion.get(\"distance\")))\n\n```\n\nEditex suggests the following,\n\n```shell\nWord \t Distance\n=================\nrun \t 0\nran \t 1\nron \t 1\nruin \t 1\nrum \t 1\nbun \t 2\ndun \t 2\ndunn \t 2\nfun \t 2\ngun \t 2\n\n```\n\nNotice that the `Levenshtein` algorithm computes the distance between `run` and `bun` as 1 (since there is only one replacement necessary). On the other hand, `Editex` algorithm computes this distance as 2 since phonetically, the words are farther apart.\n\nAs mentioned above, the Editex algorithm uses different costs for replacement and deletion. These values can be modified for fetching different results.\n\n```python\nfrom spellwise import Editex\n\n# Initialise the algorithm\neditex = Editex(group_cost=0.5, non_group_cost=3) # configure the group cost and non-group cost\n# Index the words from a dictionary\neditex.add_from_path(\"examples/data/american-english\")\n\n# Fetch suggestions\nsuggestions = editex.get_suggestions(\"run\")\n# Print the top 10 suggestions\nprint(\"Word \\t Distance\")\nprint(\"=================\")\nfor suggestion in suggestions[0:10]:\n    print(\"{} \\t {}\".format(suggestion.get(\"word\"), suggestion.get(\"distance\")))\n\n```\n\nConfiguring `group_cost=0.5` and `non_group_cost=3` in the above example results in the following suggestions,\n\n```shell\nWord \t Distance\n=================\nrun \t 0\nran \t 0.5\nron \t 0.5\nruin \t 0.5\nrum \t 0.5\nlan \t 1.0\nlen \t 1.0\nlin \t 1.0\nlon \t 1.0\nloon \t 1.0\n\n```\n\n### (3) Soundex\n\nThe Soundex algorithm, similar to Editex aims to provide phonetically similar words for the give word. It is one of the initial phonetic matching algorithms and many variations exists.\n\n```python\nfrom spellwise import Soundex\n\n# Initialise the algorithm\nsoundex = Soundex()\n# Index the words from a dictionary\nsoundex.add_from_path(\"examples/data/american-english\")\n\n# Fetch suggestions\nsuggestions = soundex.get_suggestions(\"run\")\n# Print the top 10 suggestions\nprint(\"Word \\t Distance\")\nprint(\"=================\")\nfor suggestion in suggestions[0:10]:\n    print(\"{} \\t {}\".format(suggestion.get(\"word\"), suggestion.get(\"distance\")))\n\n```\n\nSoundex suggests the following,\n\n```shell\nWord \t Distance\n=================\nrain \t 0\nrainy \t 0\nram \t 0\nram \t 0\nrama \t 0\nramie \t 0\nran \t 0\nranee \t 0\nrayon \t 0\nream \t 0\n\n```\n\n### (4) Caverphone 1.0 and Caverphone 2.0\n\nThe Caverphone algorithm was developed as a part of the Caversham project to phonetically identify the names of different instances of the same person from various sources. In other words, it is used for phonetically identifying duplicate entries of an entity or a word. The difference between the v1 and v2 of the algorithm is in the pre-processing of words during indexing.\n\n```python\nfrom spellwise import CaverphoneTwo # or CaverphoneOne\n\n# Initialise the algorithm\ncaverphone = CaverphoneTwo()\n# Index the words from a dictionary\ncaverphone.add_from_path(\"examples/data/american-english\")\n\n# Fetch suggestions\nsuggestions = caverphone.get_suggestions(\"run\")\n# Print the top 10 suggestions\nprint(\"Word \\t Distance\")\nprint(\"=================\")\nfor suggestion in suggestions[0:10]:\n    print(\"{} \\t {}\".format(suggestion.get(\"word\"), suggestion.get(\"distance\")))\n\n```\n\nCaverphone v2 provides the following suggestions,\n\n```shell\nWord \t Distance\n=================\nrain \t 0\nran \t 0\nrein \t 0\nrene \t 0\nroan \t 0\nron \t 0\nruin \t 0\nrun \t 0\nrune \t 0\nwren \t 0\n\n```\n\n### (5) Typox\n\nThe `Typox` is a Typographic based correction algorithm optimised for correcting typos in QWERTY keyboard. This is similar to the `Editex` algorithm, except that the letters are grouped based on their locations on the keyboard, instead of grouping them phonetically. The original paper is not available to read for free, and hence this might not be its exact implementation.\n\n```python\nfrom spellwise import Typox\n\n# Initialise the algorithm\ntypox = Typox()\n# Index the words from a dictionary\ntypox.add_from_path(\"examples/data/american-english\")\n\n# Fetch suggestions\nsuggestions = typox.get_suggestions(\"ohomr\")\n# Print the top 10 suggestions\nprint(\"Word \\t Distance\")\nprint(\"=================\")\nfor suggestion in suggestions[0:10]:\n    print(\"{} \\t {}\".format(suggestion.get(\"word\"), suggestion.get(\"distance\")))\n\n```\n\nTypox provides the following words,\n\n```shell\nWord \t Distance\n=================\nhome \t 2\nphone \t 2\n```\n\nNotice that `Typox` did not suggest words like `choke`, `come`, `chore`, `chose` etc., (which `Levenshtein` would suggest) even though they are of edit-distance 2 with the word `ohome`. But it rather suggests closest words based on the QWERTY keyboard layout which are `phone` and `home`.\n\nAs mentioned above, the Typox algorithm is similar to Editex and uses different costs for replacement and deletion. These values can be modified for fetching different results.\n\n```python\nfrom spellwise import Typox\n\n# Initialise the algorithm\ntypox = Typox(group_cost=0.5, non_group_cost=3) # configure the group cost and non-group cost\n# Index the words from a dictionary\ntypox.add_from_path(\"examples/data/american-english\")\n\n# Fetch suggestions\nsuggestions = typox.get_suggestions(\"ohomr\")\n# Print the top 10 suggestions\nprint(\"Word \\t Distance\")\nprint(\"=================\")\nfor suggestion in suggestions[0:10]:\n    print(\"{} \\t {}\".format(suggestion.get(\"word\"), suggestion.get(\"distance\")))\n\n```\n\nTypox provides the following suggestion for the word `ohomr` after setting the `group_cost=0.5` and `non_group_cost=3`.\n\n```shell\nWord \t Distance\n=================\nphone \t 1.5\nphoned \t 2.0\nphones \t 2.0\n\n```\n\n## ⚡️ Memory and Time profiling\n\nThe following are the usage statistics on a MacBook Pro, 2.4 GHz Quad-Core Intel Core i5 with 16 GB RAM.\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003cth\u003eAlgorithm\u003c/th\u003e\n        \u003cth\u003eNo. of words\u003c/th\u003e\n        \u003cth\u003eCorpus size on disk\u003c/th\u003e\n        \u003cth\u003eMemory used\u003c/th\u003e\n        \u003cth\u003eTime to index\u003c/th\u003e\n        \u003cth\u003eTime to inference\u003c/th\u003e\n        \u003cth\u003eRemarks\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eLevenshtein\u003c/td\u003e\n        \u003ctd\u003e119,095\u003c/td\u003e\n        \u003ctd\u003e1.1 MB\u003c/td\u003e\n        \u003ctd\u003e~ 127 MB\u003c/td\u003e\n        \u003ctd\u003e~ 1160 milliseconds\u003c/td\u003e\n        \u003ctd\u003e~ 36 milliseconds\u003c/td\u003e\n        \u003ctd\u003e\n            \u003cul\u003e\n                \u003cli\u003eFor word \"hallo\"\u003c/li\u003e\n                \u003cli\u003eWith max distance 2\u003c/li\u003e\n            \u003c/ul\u003e\n        \u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eEditex\u003c/td\u003e\n        \u003ctd\u003e119,095\u003c/td\u003e\n        \u003ctd\u003e1.1 MB\u003c/td\u003e\n        \u003ctd\u003e~ 127 MB\u003c/td\u003e\n        \u003ctd\u003e~ 1200 milliseconds\u003c/td\u003e\n        \u003ctd\u003e~ 90 milliseconds\u003c/td\u003e\n        \u003ctd\u003e\n            \u003cul\u003e\n                \u003cli\u003eFor word \"hallo\"\u003c/li\u003e\n                \u003cli\u003eWith max distance 2\u003c/li\u003e\n            \u003c/ul\u003e\n        \u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eSoundex\u003c/td\u003e\n        \u003ctd\u003e119,095\u003c/td\u003e\n        \u003ctd\u003e1.1 MB\u003c/td\u003e\n        \u003ctd\u003e~ 16 MB\u003c/td\u003e\n        \u003ctd\u003e~ 1130 milliseconds\u003c/td\u003e\n        \u003ctd\u003e~ 0.18 milliseconds (yes right!)\u003c/td\u003e\n        \u003ctd\u003e\n            \u003cul\u003e\n                \u003cli\u003eFor word \"hallo\"\u003c/li\u003e\n            \u003c/ul\u003e\n        \u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCaverphone 1.0\u003c/td\u003e\n        \u003ctd\u003e119,095\u003c/td\u003e\n        \u003ctd\u003e1.1 MB\u003c/td\u003e\n        \u003ctd\u003e~ 36.7 MB\u003c/td\u003e\n        \u003ctd\u003e~ 1700 milliseconds\u003c/td\u003e\n        \u003ctd\u003e~ 0.2 milliseconds (yes right!)\u003c/td\u003e\n        \u003ctd\u003e\n            \u003cul\u003e\n                \u003cli\u003eFor word \"hallo\"\u003c/li\u003e\n            \u003c/ul\u003e\n        \u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCaverphone 2.0\u003c/td\u003e\n        \u003ctd\u003e119,095\u003c/td\u003e\n        \u003ctd\u003e1.1 MB\u003c/td\u003e\n        \u003ctd\u003e~ 99 MB\u003c/td\u003e\n        \u003ctd\u003e~ 2400 milliseconds\u003c/td\u003e\n        \u003ctd\u003e~ 0.4 milliseconds (yes right!)\u003c/td\u003e\n        \u003ctd\u003e\n            \u003cul\u003e\n                \u003cli\u003eFor word \"hallo\"\u003c/li\u003e\n            \u003c/ul\u003e\n        \u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eTypox\u003c/td\u003e\n        \u003ctd\u003e119,095\u003c/td\u003e\n        \u003ctd\u003e1.1 MB\u003c/td\u003e\n        \u003ctd\u003e~ 127 MB\u003c/td\u003e\n        \u003ctd\u003e~ 1360 milliseconds\u003c/td\u003e\n        \u003ctd\u003e~ 200 milliseconds\u003c/td\u003e\n        \u003ctd\u003e\n            \u003cul\u003e\n                \u003cli\u003eFor word \"hallo\"\u003c/li\u003e\n                \u003cli\u003eWith max distance 2\u003c/li\u003e\n            \u003c/ul\u003e\n        \u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n## 🙌 Contributing\n\nPlease feel free to raise PRs! 😃\n\nThere are so many algorithms to be added and improvements to be made to this package.\nThis package is still in an early version and would love to have your contributions!\n\n## 📝 References\n\n- https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.2138\u0026rep=rep1\u0026type=pdf\n- https://scholar.harvard.edu/jfeigenbaum/software/editex\n- https://github.com/J535D165/FEBRL-fork-v0.4.2/blob/master/stringcmp.py\n- https://caversham.otago.ac.nz/files/working/ctp060902.pdf\n- https://en.wikipedia.org/wiki/Caverphone\n- https://ieeexplore.ieee.org/document/8257147\n- https://www.semanticscholar.org/paper/Edit-distance-weighting-modification-using-phonetic-Ahmad-Indrayana/0d74db8a20f7b46b98c2c77750b9b973a3e4a7b2\n- https://nlp.stanford.edu/IR-book/html/htmledition/phonetic-correction-1.html\n- http://stevehanov.ca/blog/?id=114\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchinnichaitanya%2Fspellwise","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchinnichaitanya%2Fspellwise","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchinnichaitanya%2Fspellwise/lists"}