{"id":15014117,"url":"https://github.com/r1j1t/contextualspellcheck","last_synced_at":"2025-04-13T18:38:15.564Z","repository":{"id":41178594,"uuid":"254703118","full_name":"R1j1t/contextualSpellCheck","owner":"R1j1t","description":"✔️Contextual word checker for better suggestions (not actively maintained)","archived":false,"fork":false,"pushed_at":"2025-01-31T22:43:26.000Z","size":2569,"stargazers_count":414,"open_issues_count":9,"forks_count":64,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-04-12T08:28:33.334Z","etag":null,"topics":["bert","chatbot","help-wanted","natural-language-processing","nlp","oov","preprocessing","python","python-spelling-corrector","spacy","spacy-extension","spellcheck","spellchecker","spelling-correction","spelling-corrections"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/R1j1t.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-04-10T18:08:33.000Z","updated_at":"2025-04-12T02:07:28.000Z","dependencies_parsed_at":"2023-12-19T12:05:47.774Z","dependency_job_id":"37ea87e5-880c-4d09-b446-2402f202dd63","html_url":"https://github.com/R1j1t/contextualSpellCheck","commit_stats":{"total_commits":148,"total_committers":7,"mean_commits":"21.142857142857142","dds":0.1351351351351351,"last_synced_commit":"dfca557a71df7b1b93cdbd0dbb5ed29efb0b4e87"},"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/R1j1t%2FcontextualSpellCheck","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/R1j1t%2FcontextualSpellCheck/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/R1j1t%2FcontextualSpellCheck/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/R1j1t%2FcontextualSpellCheck/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/R1j1t","download_url":"https://codeload.github.com/R1j1t/contextualSpellCheck/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248762719,"owners_count":21157775,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","chatbot","help-wanted","natural-language-processing","nlp","oov","preprocessing","python","python-spelling-corrector","spacy","spacy-extension","spellcheck","spellchecker","spelling-correction","spelling-corrections"],"created_at":"2024-09-24T19:45:12.986Z","updated_at":"2025-04-13T18:38:15.503Z","avatar_url":"https://github.com/R1j1t.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spellCheck\n\u003ca href=\"https://github.com/R1j1t/contextualSpellCheck\"\u003e\u003cimg src=\"https://user-images.githubusercontent.com/22280243/82138959-2852cd00-9842-11ea-918a-49b2a7873ef6.png\" width=\"276\" height=\"120\" align=\"right\" /\u003e\u003c/a\u003e\n\nContextual word checker for better suggestions\n\n[![license](https://img.shields.io/github/license/r1j1t/contextualSpellCheck)](https://github.com/R1j1t/contextualSpellCheck/blob/master/LICENSE) \n[![PyPI](https://img.shields.io/pypi/v/contextualSpellCheck?color=green)](https://pypi.org/project/contextualSpellCheck/) \n[![Python-Version](https://img.shields.io/badge/Python-3.6+-green)](https://github.com/R1j1t/contextualSpellCheck#install)\n[![Downloads](https://pepy.tech/badge/contextualspellcheck/week)](https://pepy.tech/project/contextualspellcheck)\n[![GitHub contributors](https://img.shields.io/github/contributors/r1j1t/contextualSpellCheck)](https://github.com/R1j1t/contextualSpellCheck/graphs/contributors)\n[![Help Wanted](https://img.shields.io/badge/Help%20Wanted-Task%20List-violet)](https://github.com/R1j1t/contextualSpellCheck#task-list)\n[![DOI](https://zenodo.org/badge/254703118.svg)](https://zenodo.org/badge/latestdoi/254703118)\n\n## Types of spelling mistakes\n\nIt is essential to understand that identifying whether a candidate is a spelling error is a big task.\n\n\u003e Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE.\n\u003e\n\u003e -- [Monojit Choudhury et. al. (2007)][1]\n\nThis package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. To improve this package, I would like to extend the functionality to identify RWE, optimising the package, and improving the documentation.\n\n## Install\n\nThe package can be installed using [pip](https://pypi.org/project/contextualSpellCheck/). You would require python 3.6+\n\n```bash\npip install contextualSpellCheck\n```\n\n## Usage\n\n**Note:** For use in other languages check [`examples`](https://github.com/R1j1t/contextualSpellCheck/tree/master/examples) folder.\n\n### How to load the package in spacy pipeline\n\n```python\n\u003e\u003e\u003e import contextualSpellCheck\n\u003e\u003e\u003e import spacy\n\u003e\u003e\u003e nlp = spacy.load(\"en_core_web_sm\") \n\u003e\u003e\u003e \n\u003e\u003e\u003e ## We require NER to identify if a token is a PERSON\n\u003e\u003e\u003e ## also require parser because we use `Token.sent` for context\n\u003e\u003e\u003e nlp.pipe_names\n['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']\n\u003e\u003e\u003e contextualSpellCheck.add_to_pipe(nlp)\n\u003e\u003e\u003e nlp.pipe_names\n['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']\n\u003e\u003e\u003e \n\u003e\u003e\u003e doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.')\n\u003e\u003e\u003e doc._.outcome_spellCheck\n'Income was $9.4 million compared to the prior year of $2.7 million.'\n```\n\nOr you can add to spaCy pipeline manually!\n\n```python\n\u003e\u003e\u003e import spacy\n\u003e\u003e\u003e import contextualSpellCheck\n\u003e\u003e\u003e \n\u003e\u003e\u003e nlp = spacy.load(\"en_core_web_sm\")\n\u003e\u003e\u003e nlp.pipe_names\n['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']\n\u003e\u003e\u003e # You can pass the optional parameters to the contextualSpellCheck\n\u003e\u003e\u003e # eg. pass max edit distance use config={\"max_edit_dist\": 3}\n\u003e\u003e\u003e nlp.add_pipe(\"contextual spellchecker\")\n\u003ccontextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x1049f82b0\u003e\n\u003e\u003e\u003e nlp.pipe_names\n['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']\n\u003e\u003e\u003e \n\u003e\u003e\u003e doc = nlp(\"Income was $9.4 milion compared to the prior year of $2.7 milion.\")\n\u003e\u003e\u003e print(doc._.performed_spellCheck)\nTrue\n\u003e\u003e\u003e print(doc._.outcome_spellCheck)\nIncome was $9.4 million compared to the prior year of $2.7 million.\n```\n\nAfter adding `contextual spellchecker` in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using [extensions](#Extensions).\n\n### Using the pipeline\n\n```python\n\u003e\u003e\u003e doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.')\n\u003e\u003e\u003e \n\u003e\u003e\u003e # Doc Extention\n\u003e\u003e\u003e print(doc._.contextual_spellCheck)\nTrue\n\u003e\u003e\u003e print(doc._.performed_spellCheck)\nTrue\n\u003e\u003e\u003e print(doc._.suggestions_spellCheck)\n{milion: 'million', milion: 'million'}\n\u003e\u003e\u003e print(doc._.outcome_spellCheck)\nIncome was $9.4 million compared to the prior year of $2.7 million.\n\u003e\u003e\u003e print(doc._.score_spellCheck)\n{milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]}\n\u003e\u003e\u003e \n\u003e\u003e\u003e # Token Extention\n\u003e\u003e\u003e print(doc[4]._.get_require_spellCheck)\nTrue\n\u003e\u003e\u003e print(doc[4]._.get_suggestion_spellCheck)\n'million'\n\u003e\u003e\u003e print(doc[4]._.score_spellCheck)\n[('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)]\n\u003e\u003e\u003e \n\u003e\u003e\u003e # Span Extention\n\u003e\u003e\u003e print(doc[2:6]._.get_has_spellCheck)\nTrue\n\u003e\u003e\u003e print(doc[2:6]._.score_spellCheck)\n{$: [], 9.4: [], milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], compared: []}\n```\n\n## Extensions\n\nTo make the usage easy, `contextual spellchecker` provides custom spacy extensions which your code can consume. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the `doc`, `span` and `token` level. The below tables summarise the extensions.\n\n### `spaCy.Doc` level extensions\n\n| Extension | Type | Description | Default |\n|------------------------------|---------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|---------|\n| doc._.contextual_spellCheck | `Boolean` | To check whether contextualSpellCheck is added as extension | `True` |\n| doc._.performed_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction | `False` |\n| doc._.suggestions_spellCheck | `{Spacy.Token:str}` | if corrections are performed, it returns the mapping of misspell token (`spaCy.Token`) with suggested word(`str`) | `{}` |\n| doc._.outcome_spellCheck | `str` | corrected sentence(`str`) as output | `\"\"` |\n| doc._.score_spellCheck | `{Spacy.Token:List(str,float)}` | if corrections are identified, it returns the mapping of misspell token (`spaCy.Token`) with suggested words(`str`) and probability of that correction | `None` |\n\n### `spaCy.Span` level extensions\n| Extension | Type | Description | Default |\n|-------------------------------|---------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|\n| span._.get_has_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction in this span | `False` |\n| span._.score_spellCheck | `{Spacy.Token:List(str,float)}` | if corrections are identified, it returns the mapping of misspell token (`spaCy.Token`) with suggested words(`str`) and probability of that correction for tokens in this `span` | `{spaCy.Token: []}` |\n\n### `spaCy.Token` level extensions\n\n| Extension | Type | Description | Default |\n|-----------------------------------|-----------------|-------------------------------------------------------------------------------------------------------------|---------|\n| token._.get_require_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction on this `token` | `False` |\n| token._.get_suggestion_spellCheck | `str` | if corrections are performed, it returns the suggested word(`str`) | `\"\"` |\n| token._.score_spellCheck | `[(str,float)]` | if corrections are identified, it returns suggested words(`str`) and probability(`float`) of that correction | `[]` |\n\n## API\n\nAt present, there is a simple GET API to get you started. You can run the app in your local and play with it.\n\nQuery: You can use the endpoint http://127.0.0.1:5000/?query=YOUR-QUERY\nNote: Your browser can handle the text encoding\n\n```\nGET Request: http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion.\n```\n\nResponse:\n\n```json\n{\n    \"success\": true,\n    \"input\": \"Income was $9.4 milion compared to the prior year of $2.7 milion.\",\n    \"corrected\": \"Income was $9.4 milion compared to the prior year of $2.7 milion.\",\n    \"suggestion_score\": {\n        \"milion\": [\n            [\n                \"million\",\n                0.59422\n            ],\n            [\n                \"billion\",\n                0.24349\n            ],\n            ...\n        ],\n        \"milion:1\": [\n            [\n                \"billion\",\n                0.65934\n            ],\n            [\n                \"million\",\n                0.26185\n            ],\n            ...\n        ]\n    }\n}\n```\n\n## Task List\n\n- [ ] use cython for part of the code to improve performance ([#39](https://github.com/R1j1t/contextualSpellCheck/issues/39))\n- [ ] Improve metric for candidate selection ([#40](https://github.com/R1j1t/contextualSpellCheck/issues/40))\n- [ ] Add examples for other langauges ([#41](https://github.com/R1j1t/contextualSpellCheck/issues/41))\n- [ ] Update the logic of misspell identification (OOV) ([#44](https://github.com/R1j1t/contextualSpellCheck/issues/44))\n- [ ] better candidate generation (solved by [#44](https://github.com/R1j1t/contextualSpellCheck/issues/44)?)\n- [ ] add metric by testing on datasets\n- [ ] Improve documentation\n- [ ] Improve logging in code\n- [ ] Add support for Real Word Error (RWE) (Big Task)\n- [ ] add multi mask out capability\n\n\u003cdetails\u003e\u003csummary\u003eCompleted Task\u003c/summary\u003e\n\u003cp\u003e\n\n- [x] specify maximum edit distance for `candidateRanking`\n- [x] allow user to specify bert model\n- [x] Include transformers deTokenizer to get better suggestions\n- [x] dependency version in setup.py ([#38](https://github.com/R1j1t/contextualSpellCheck/issues/38))\n\n\u003c/p\u003e\n\u003c/details\u003e\n\n## Support and contribution\n\nIf you like the project, please ⭑ the project and show your support! Also, if you feel, the current behaviour is not as expected, please feel free to raise an [issue](https://github.com/R1j1t/contextualSpellCheck/issues). If you can help with any of the above tasks, please open a [PR](https://github.com/R1j1t/contextualSpellCheck/pulls) with necessary changes to documentation and tests.\n\n## Cite\n\nIf you are using contextualSpellCheck in your academic work, please consider citing the library using the below BibTex entry:\n\n```bibtex\n@misc{Goel_Contextual_Spell_Check_2021,\nauthor = {Goel, Rajat},\ndoi = {10.5281/zenodo.4642379},\nmonth = {3},\ntitle = {{Contextual Spell Check}},\nurl = {https://github.com/R1j1t/contextualSpellCheck},\nyear = {2021}\n}\n```\n\n\n\n## Reference\n\nBelow are some of the projects/work I referred to while developing this package\n\n1. Explosion AI.Architecture. May 2020. url:https://spacy.io/api.\n2. Monojit Choudhury et al. “How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach”. In:arXiv preprint physics/0703198(2007).\n3. Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transform-ers for Language Understanding. 2019. arXiv:1810.04805 [cs.CL].\n4. Hugging  Face.Fast Coreference Resolution in spaCy with Neural Net-works. May 2020. url:https://github.com/huggingface/neuralcoref.\n5. Ines.Chapter 3: Processing Pipelines. May 20202. url:https://course.spacy.io/en/chapter3.\n6. Eric Mays, Fred J Damerau, and Robert L Mercer. “Context based spellingcorrection”. In:Information Processing \u0026 Management27.5 (1991), pp. 517–522.\n7. Peter Norvig. How to Write a Spelling Corrector. May 2020. url:http://norvig.com/spell-correct.html.\n8. Yifu  Sun  and  Haoming  Jiang.Contextual Text Denoising with MaskedLanguage Models. 2019. arXiv:1910.14080 [cs.CL].\n9. Thomas  Wolf  et  al.  “Transformers:  State-of-the-Art  Natural  LanguageProcessing”. In:Proceedings of the 2020 Conference on Empirical Methodsin Natural Language Processing: System Demonstrations. Online: Associ-ation for Computational Linguistics, Oct. 2020, pp. 38–45. url:https://www.aclweb.org/anthology/2020.emnlp-demos.6.\n\n[1]: \u003chttp://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9DA285DCE83A0339F?doi=10.1.1.146.4390\u0026rep=rep1\u0026type=pdf\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fr1j1t%2Fcontextualspellcheck","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fr1j1t%2Fcontextualspellcheck","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fr1j1t%2Fcontextualspellcheck/lists"}