{"id":13699542,"url":"https://github.com/bakwc/JamSpell","last_synced_at":"2025-05-04T16:34:53.689Z","repository":{"id":27911290,"uuid":"110459622","full_name":"bakwc/JamSpell","owner":"bakwc","description":"Modern spell checking library - accurate, fast, multi-language","archived":false,"fork":false,"pushed_at":"2024-05-23T18:36:55.000Z","size":711,"stargazers_count":595,"open_issues_count":22,"forks_count":99,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-05-29T21:33:50.527Z","etag":null,"topics":["cpp","csharp","java","ngrams","nlp","python","ruby","spellcheck","spellchecker","spelling-correction"],"latest_commit_sha":null,"homepage":"https://jamspell.com/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bakwc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-11-12T18:52:53.000Z","updated_at":"2024-06-18T19:47:16.518Z","dependencies_parsed_at":"2024-06-18T19:47:15.201Z","dependency_job_id":"9e5c37e4-48bb-4df1-8b8e-7cab3c88ab68","html_url":"https://github.com/bakwc/JamSpell","commit_stats":{"total_commits":206,"total_committers":17,"mean_commits":"12.117647058823529","dds":0.2184466019417476,"last_synced_commit":"1ece237a89b75a018b9e3093e7213bd0a112cb47"},"previous_names":["bakwc/openspell"],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bakwc%2FJamSpell","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bakwc%2FJamSpell/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bakwc%2FJamSpell/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bakwc%2FJamSpell/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bakwc","download_url":"https://codeload.github.com/bakwc/JamSpell/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224398825,"owners_count":17304661,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","csharp","java","ngrams","nlp","python","ruby","spellcheck","spellchecker","spelling-correction"],"created_at":"2024-08-02T20:00:35.962Z","updated_at":"2024-11-13T05:31:31.772Z","avatar_url":"https://github.com/bakwc.png","language":"C++","readme":"# JamSpell\n\n[![Build Status][travis-image]][travis] [![Release][release-image]][releases]\n\n[travis-image]: https://travis-ci.org/bakwc/JamSpell.svg?branch=master\n[travis]: https://travis-ci.org/bakwc/JamSpell\n\n[release-image]: https://img.shields.io/badge/release-0.0.12-blue.svg?style=flat\n[releases]: https://github.com/bakwc/JamSpell/releases\n\nJamSpell is a spell checking library with following features:\n\n- **accurate** - it considers words surroundings (context) for better correction\n- **fast** - near 5K words per second\n- **multi-language** - it's written in C++ and available for many languages with swig bindings\n\n[Colab example](https://colab.research.google.com/drive/1aFk8-7nq3oAp402jjLGLpEb2Nzq210Eo)\n\n## JamSpellPro\n[jamspell.com](https://jamspell.com) - check out a new jamspell version with following features\n - Improved accuracy ([catboost](https://catboost.ai) gradient boosted decision trees candidates ranking model)\n - Splits merged words\n - Pre-trained models for many languages (small, medium, large) for:  \n`en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no`\n - Ability to add words / sentences at runtime\n - Fine-tuning / additional training\n - Memory optimization for training large models\n - Static dictionary support\n - Built-in `Java, C#, Ruby` support\n - Windows support\n\n## Content\n- [Benchmarks](#benchmarks)\n- [Usage](#usage)\n  - [Python](#python)\n  - [C++](#c)\n  - [Other languages](#other-languages)\n  - [HTTP API](#http-api)\n- [Train](#train)\n\n## Benchmarks\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eErrors\u003c/td\u003e\n    \u003ctd\u003eTop 7 Errors\u003c/td\u003e\n    \u003ctd\u003eFix Rate\u003c/td\u003e\n    \u003ctd\u003eTop 7 Fix Rate\u003c/td\u003e\n    \u003ctd\u003eBroken\u003c/td\u003e\n    \u003ctd\u003eSpeed\u003cbr\u003e\n(words/second)\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eJamSpell\u003c/td\u003e\n    \u003ctd\u003e3.25%\u003c/td\u003e\n    \u003ctd\u003e1.27%\u003c/td\u003e\n    \u003ctd\u003e79.53%\u003c/td\u003e\n    \u003ctd\u003e84.10%\u003c/td\u003e\n    \u003ctd\u003e0.64%\u003c/td\u003e\n    \u003ctd\u003e4854\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eNorvig\u003c/td\u003e\n    \u003ctd\u003e7.62%\u003c/td\u003e\n    \u003ctd\u003e5.00%\u003c/td\u003e\n    \u003ctd\u003e46.58%\u003c/td\u003e\n    \u003ctd\u003e66.51%\u003c/td\u003e\n    \u003ctd\u003e0.69%\u003c/td\u003e\n    \u003ctd\u003e395\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eHunspell\u003c/td\u003e\n    \u003ctd\u003e13.10%\u003c/td\u003e\n    \u003ctd\u003e10.33%\u003c/td\u003e\n    \u003ctd\u003e47.52%\u003c/td\u003e\n    \u003ctd\u003e68.56%\u003c/td\u003e\n    \u003ctd\u003e7.14%\u003c/td\u003e\n    \u003ctd\u003e163\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eDummy\u003c/td\u003e\n    \u003ctd\u003e13.14%\u003c/td\u003e\n    \u003ctd\u003e13.14%\u003c/td\u003e\n    \u003ctd\u003e0.00%\u003c/td\u003e\n    \u003ctd\u003e0.00%\u003c/td\u003e\n    \u003ctd\u003e0.00%\u003c/td\u003e\n    \u003ctd\u003e-\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nModel was trained on [300K wikipedia sentences + 300K news sentences (english)](http://wortschatz.uni-leipzig.de/en/download/). 95% was used for train, 5% was used for evaluation. [Errors model](https://github.com/bakwc/JamSpell/blob/master/evaluate/typo_model.py) was used to generate errored text from the original one. JamSpell corrector was compared with [Norvig's one](http://norvig.com/spell-correct.html), [Hunspell](http://hunspell.github.io/) and a dummy one (no corrections).\n\nWe used following metrics:\n- **Errors** - percent of words with errors after spell checker processed\n- **Top 7 Errors** - percent of words missing in top7 candidated\n- **Fix Rate** - percent of errored words fixed by spell checker\n- **Top 7 Fix Rate** - percent of errored words fixed by one of top7 candidates\n- **Broken** - percent of non-errored words broken by spell checker\n- **Speed** - number of words per second\n\nTo ensure that our model is not too overfitted for wikipedia+news we checked it on \"The Adventures of Sherlock Holmes\" text:\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eErrors\u003c/td\u003e\n    \u003ctd\u003eTop 7 Errors\u003c/td\u003e\n    \u003ctd\u003eFix Rate\u003c/td\u003e\n    \u003ctd\u003eTop 7 Fix Rate\u003c/td\u003e\n    \u003ctd\u003eBroken\u003c/td\u003e\n    \u003ctd\u003eSpeed\n(words per second)\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eJamSpell\u003c/td\u003e\n    \u003ctd\u003e3.56%\u003c/td\u003e\n    \u003ctd\u003e1.27%\u003c/td\u003e\n    \u003ctd\u003e72.03%\u003c/td\u003e\n    \u003ctd\u003e79.73%\u003c/td\u003e\n    \u003ctd\u003e0.50%\u003c/td\u003e\n    \u003ctd\u003e5524\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eNorvig\u003c/td\u003e\n    \u003ctd\u003e7.60%\u003c/td\u003e\n    \u003ctd\u003e5.30%\u003c/td\u003e\n    \u003ctd\u003e35.43%\u003c/td\u003e\n    \u003ctd\u003e56.06%\u003c/td\u003e\n    \u003ctd\u003e0.45%\u003c/td\u003e\n    \u003ctd\u003e647\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eHunspell\u003c/td\u003e\n    \u003ctd\u003e9.36%\u003c/td\u003e\n    \u003ctd\u003e6.44%\u003c/td\u003e\n    \u003ctd\u003e39.61%\u003c/td\u003e\n    \u003ctd\u003e65.77%\u003c/td\u003e\n    \u003ctd\u003e2.95%\u003c/td\u003e\n    \u003ctd\u003e284\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eDummy\u003c/td\u003e\n    \u003ctd\u003e11.16%\u003c/td\u003e\n    \u003ctd\u003e11.16%\u003c/td\u003e\n    \u003ctd\u003e0.00%\u003c/td\u003e\n    \u003ctd\u003e0.00%\u003c/td\u003e\n    \u003ctd\u003e0.00%\u003c/td\u003e\n    \u003ctd\u003e-\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nMore details about reproducing available in \"[Train](#train)\" section.\n\n## Usage\n### Python\n1. Install ```swig3``` (usually it is in your distro package manager)\n\n2. Install ```jamspell```:\n```bash\npip install jamspell\n```\n3. [Download](#download-models) or [train](#train) language model\n\n4. Use it:\n\n```python\nimport jamspell\n\ncorrector = jamspell.TSpellCorrector()\ncorrector.LoadLangModel('en.bin')\n\ncorrector.FixFragment('I am the begt spell cherken!')\n# u'I am the best spell checker!'\n\ncorrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)\n# (u'best', u'beat', u'belt', u'bet', u'bent', ... )\n\ncorrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)\n# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)\n```\n\n### C++\n1. Add `jamspell` and `contrib` dirs to your project\n\n2. Use it:\n\n```cpp\n#include \u003cjamspell/spell_corrector.hpp\u003e\n\nint main(int argc, const char** argv) {\n\n    NJamSpell::TSpellCorrector corrector;\n    corrector.LoadLangModel(\"model.bin\");\n\n    corrector.FixFragment(L\"I am the begt spell cherken!\");\n    // \"I am the best spell checker!\"\n\n    corrector.GetCandidates({L\"i\", L\"am\", L\"the\", L\"begt\", L\"spell\", L\"cherken\"}, 3);\n    // \"best\", \"beat\", \"belt\", \"bet\", \"bent\", ... )\n\n    corrector.GetCandidates({L\"i\", L\"am\", L\"the\", L\"begt\", L\"spell\", L\"cherken\"}, 3);\n    // \"checker\", \"chicken\", \"checked\", \"wherein\", \"coherent\", ... )\n    return 0;\n}\n```\n\n### Other languages\nYou can generate extensions for other languages using [swig tutorial](http://www.swig.org/tutorial.html). The swig interface file is `jamspell.i`. Pull requests with build scripts are welcome.\n\n## HTTP API\n* Install ```cmake```\n\n* Clone and build jamspell (it includes http server):\n```bash\ngit clone https://github.com/bakwc/JamSpell.git\ncd JamSpell\nmkdir build\ncd build\ncmake ..\nmake\n```\n* [Download](#download-models) or [train](#train) language model\n* Run http server:\n```bash\n./web_server/web_server en.bin localhost 8080\n```\n* **GET** Request example:\n```bash\n$ curl \"http://localhost:8080/fix?text=I am the begt spell cherken\"\nI am the best spell checker\n```\n* **POST** Request example\n```bash\n$ curl -d \"I am the begt spell cherken\" http://localhost:8080/fix\nI am the best spell checker\n```\n* Candidate example\n```bash\ncurl \"http://localhost:8080/candidates?text=I am the begt spell cherken\"\n# or\ncurl -d \"I am the begt spell cherken\" http://localhost:8080/candidates\n```\n```javascript\n{\n    \"results\": [\n        {\n            \"candidates\": [\n                \"best\",\n                \"beat\",\n                \"belt\",\n                \"bet\",\n                \"bent\",\n                \"beet\",\n                \"beit\"\n            ],\n            \"len\": 4,\n            \"pos_from\": 9\n        },\n        {\n            \"candidates\": [\n                \"checker\",\n                \"chicken\",\n                \"checked\",\n                \"wherein\",\n                \"coherent\",\n                \"cheered\",\n                \"cherokee\"\n            ],\n            \"len\": 7,\n            \"pos_from\": 20\n        }\n    ]\n}\n```\nHere `pos_from` - misspelled word first letter position, `len` - misspelled word len\n\n## Train\nTo train custom model you need:\n\n1. Install ```cmake```\n\n2. Clone and build jamspell:\n```bash\ngit clone https://github.com/bakwc/JamSpell.git\ncd JamSpell\nmkdir build\ncd build\ncmake ..\nmake\n```\n\n3. Prepare a utf-8 text file with sentences to train at (eg. [```sherlockholmes.txt```](https://github.com/bakwc/JamSpell/blob/master/test_data/sherlockholmes.txt)) and another file with language alphabet (eg. [```alphabet_en.txt```](https://github.com/bakwc/JamSpell/blob/master/test_data/alphabet_en.txt))\n\n4. Train model:\n```bash\n./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin\n```\n5. To evaluate spellchecker you can use ```evaluate/evaluate.py``` script:\n```bash\npython evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt\n```\n6. You can use ```evaluate/generate_dataset.py``` to generate you train/test data. It supports txt files, [Leipzig Corpora Collection](http://wortschatz.uni-leipzig.de/en/download/) format and fb2 books.\n\n## Download models\nHere is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See [Train](#train) section above.\n\n - [en.tar.gz](https://github.com/bakwc/JamSpell-models/raw/master/en.tar.gz) (35Mb)\n - [fr.tar.gz](https://github.com/bakwc/JamSpell-models/raw/master/fr.tar.gz) (31Mb)\n - [ru.tar.gz](https://github.com/bakwc/JamSpell-models/raw/master/ru.tar.gz) (38Mb)\n","funding_links":[],"categories":["C++","Spelling correction"],"sub_categories":["Other"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbakwc%2FJamSpell","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbakwc%2FJamSpell","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbakwc%2FJamSpell/lists"}