{"id":13988747,"url":"https://github.com/filyp/autocorrect","last_synced_at":"2025-12-12T00:40:52.658Z","repository":{"id":37528454,"uuid":"267934308","full_name":"filyp/autocorrect","owner":"filyp","description":"Spelling corrector in python","archived":false,"fork":false,"pushed_at":"2025-07-04T10:24:04.000Z","size":3993,"stargazers_count":484,"open_issues_count":5,"forks_count":93,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-07-13T18:58:07.692Z","etag":null,"topics":["autocorrect","autocorrection","czech","english","languages","levenshtein-distance","multilanguage","multilingual","nlp","ocr","polish","portuguese","python","russian","spanish","spellchecker","spelling","spelling-corrector","turkish","ukrainian"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/filyp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-05-29T19:03:39.000Z","updated_at":"2025-07-04T10:24:08.000Z","dependencies_parsed_at":"2024-01-15T16:52:30.272Z","dependency_job_id":null,"html_url":"https://github.com/filyp/autocorrect","commit_stats":{"total_commits":223,"total_committers":19,"mean_commits":"11.736842105263158","dds":"0.30493273542600896","last_synced_commit":"e49c4cdd01e3d482be17a0d03ed02dbc3d7b83cb"},"previous_names":["fsondej/autocorrect"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/filyp/autocorrect","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/filyp%2Fautocorrect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/filyp%2Fautocorrect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/filyp%2Fautocorrect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/filyp%2Fautocorrect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/filyp","download_url":"https://codeload.github.com/filyp/autocorrect/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/filyp%2Fautocorrect/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266465078,"owners_count":23933058,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["autocorrect","autocorrection","czech","english","languages","levenshtein-distance","multilanguage","multilingual","nlp","ocr","polish","portuguese","python","russian","spanish","spellchecker","spelling","spelling-corrector","turkish","ukrainian"],"created_at":"2024-08-09T13:01:20.376Z","updated_at":"2025-10-21T19:40:54.279Z","avatar_url":"https://github.com/filyp.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Autocorrect\n[![Downloads](https://pepy.tech/badge/autocorrect?label=PyPI%20downloads)](https://pepy.tech/project/autocorrect)\n[![Average time to resolve an issue](http://isitmaintained.com/badge/resolution/fsondej/autocorrect.svg)](http://isitmaintained.com/project/fsondej/autocorrect \"Average time to resolve an issue\")\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\nSpelling corrector in python. Currently supports English, Polish, Turkish, Russian, Ukrainian, Czech, Portuguese, Greek, Italian, Vietnamese, French and Spanish, but you can easily add new languages.\n\nBased on: https://github.com/phatpiglet/autocorrect and [Peter Norvig's spelling corrector](https://norvig.com/spell-correct.html).\n\n# Installation\n```bash\npip install autocorrect\n```\n\n# Examples\n\nAutocorrect full sentences:\n\n```python\n\u003e\u003e\u003e from autocorrect import Speller\n\u003e\u003e\u003e spell = Speller()\n\u003e\u003e\u003e spell(\"I'm not sleapy and tehre is no place I'm giong to.\")\n\"I'm not sleepy and there is no place I'm going to.\"\n```\n\nUse other languages:\n\n```python\n\u003e\u003e\u003e spell = Speller('pl')\n\u003e\u003e\u003e spell('ptaaki latatją kluczmm')\n'ptaki latają kluczem'\n```\n\nGet multiple correction candidates for a single word:\n\n```python\n\u003e\u003e\u003e spell.get_candidates(\"tehre\")\n[(5437024, 'there'), (5860, 'terre')]\n```\nThe numbers are frequencies of a word, so the higher, the better.\n\n# Speed\n```python\n%timeit spell(\"I'm not sleapy and tehre is no place I'm giong to.\")\n373 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n%timeit spell(\"There is no comin to consiousnes without pain.\")\n150 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n```\n\nAs you see, for some words correction can take ~200ms. If speed is important for your use case (e.g. chatbot) you may want to use option 'fast':\n```python\nspell = Speller(fast=True)\n%timeit spell(\"There is no comin to consiousnes without pain.\")\n344 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n```\nNow, the correction should always work in microseconds, but words with double typos (like 'consiousnes') won't be corrected.\n\n# OCR\nWhen cleaning up OCR, replacements are the large majority of errors. If this is the case, you may want to use the option 'only_replacements':\n```python\nspell = Speller(only_replacements=True)\n```\n\n# Custom word sets\nIf you wish to use your own set of words for autocorrection, you can pass an `nlp_data` argument: \n\n```python\nspell = Speller(nlp_data=your_word_frequency_dict)\n```\nWhere `your_word_frequency_dict` is a dictionary which maps words to their average frequencies in your text. If you want to change the default word set only a bit, you can just edit `spell.nlp_data` parameter, after `spell` was initialized.\n\n# Adding new languages\n\n## A simpler but untested way - wordfreq\n\nIt should be possible to get word frequencies from the [wordfreq](https://github.com/rspeer/wordfreq/) package. You should be able to provide this word frequency data to autocorrector through `nlp_data` parameter. You will also need to generate appropriate alphabet (see `constants.py`).\n\n## A more complicated but tested way - wikipedia text\n\n*Note: I will no longer accept PRs to add individual languages using this method. A more sensible approach would be to try using the wordfreq method, and adding many languages at once in some general way. But I don't have time to implement this myself.*\n\nFirst, define special letters, by adding entries in `word_regexes` and `alphabets` dicts in autocorrect/constants.py.\n\nNow, you need a bunch of text. Easiest way is to download wikipedia.\nFor example for Russian you would go to:\nhttps://dumps.wikimedia.org/ruwiki/latest/\nand download ruwiki-latest-pages-articles.xml.bz2\n\n```\nbzip2 -d ruiwiki-latest-pages-articles.xml.bz2\n```\n\nAfter that:\n\nFirst, edit the `autocorrect.constants` dictionaries in order to accommodate regexes and dictionaries for your language.\n\nThen:\n\n```python\n\u003e\u003e\u003e from autocorrect.word_count import count_words\n\u003e\u003e\u003e count_words('ruwiki-latest-pages-articles.xml', 'ru')\n```\n\n```\ntar -zcvf autocorrect/data/ru.tar.gz word_count.json\n```\n\nFor the correction to work well, you need to cut out rarely used words. First, in test_all.py, write test words for your language, and add them to optional_language_tests the same way as it's done for other languages. It's good to have at least 30 words. Now run:\n```\npython test_all.py find_threshold ru\n```\n and see which threshold value has the least badly corrected words. After that, manually delete all the words with less occurences than the threshold value you found, from the file in hi.tar.gz (it's already sorted so it should be easy).\n\nTo distribute this language support to others, you will need to upload your tar.gz file to IPFS (for example with Pinata, which will pin this file so it doesn't disappear), and then add it's path to `ipfs_paths` in `constants.py`. (tip: first put this file inside the folder, and upload the folder to IPFS, for the downloaded file to have the correct filename)\n\nGood luck!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffilyp%2Fautocorrect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffilyp%2Fautocorrect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffilyp%2Fautocorrect/lists"}