https://github.com/bakwc/JamSpell

Modern spell checking library - accurate, fast, multi-language
https://github.com/bakwc/JamSpell

cpp csharp java ngrams nlp python ruby spellcheck spellchecker spelling-correction

Last synced: 2 months ago
JSON representation

Modern spell checking library - accurate, fast, multi-language

Host: GitHub
URL: https://github.com/bakwc/JamSpell
Owner: bakwc
License: mit
Created: 2017-11-12T18:52:53.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2024-05-23T18:36:55.000Z (about 1 year ago)
Last Synced: 2024-05-29T21:33:50.527Z (about 1 year ago)
Topics: cpp, csharp, java, ngrams, nlp, python, ruby, spellcheck, spellchecker, spelling-correction
Language: C++
Homepage: https://jamspell.com/
Size: 694 KB
Stars: 595
Watchers: 11
Forks: 99
Open Issues: 22
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # JamSpell

[![Build Status][travis-image]][travis] [![Release][release-image]][releases]

[travis-image]: https://travis-ci.org/bakwc/JamSpell.svg?branch=master

[travis]: https://travis-ci.org/bakwc/JamSpell

[release-image]: https://img.shields.io/badge/release-0.0.12-blue.svg?style=flat

[releases]: https://github.com/bakwc/JamSpell/releases

JamSpell is a spell checking library with following features:

- **accurate** - it considers words surroundings (context) for better correction

- **fast** - near 5K words per second

- **multi-language** - it's written in C++ and available for many languages with swig bindings

[Colab example](https://colab.research.google.com/drive/1aFk8-7nq3oAp402jjLGLpEb2Nzq210Eo)

## JamSpellPro

[jamspell.com](https://jamspell.com) - check out a new jamspell version with following features

 - Improved accuracy ([catboost](https://catboost.ai) gradient boosted decision trees candidates ranking model)

 - Splits merged words

 - Pre-trained models for many languages (small, medium, large) for:  

`en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no`

 - Ability to add words / sentences at runtime

 - Fine-tuning / additional training

 - Memory optimization for training large models

 - Static dictionary support

 - Built-in `Java, C#, Ruby` support

 - Windows support

## Content

- [Benchmarks](#benchmarks)

- [Usage](#usage)

  - [Python](#python)

  - [C++](#c)

  - [Other languages](#other-languages)

  - [HTTP API](#http-api)

- [Train](#train)

## Benchmarks

  

    

    Errors

    Top 7 Errors

    Fix Rate

    Top 7 Fix Rate

    Broken

    Speed


(words/second)

  

  

    JamSpell

    3.25%

    1.27%

    79.53%

    84.10%

    0.64%

    4854

  

  

    Norvig

    7.62%

    5.00%

    46.58%

    66.51%

    0.69%

    395

  

  

    Hunspell

    13.10%

    10.33%

    47.52%

    68.56%

    7.14%

    163

  

  

    Dummy

    13.14%

    13.14%

    0.00%

    0.00%

    0.00%

    -

  

Model was trained on [300K wikipedia sentences + 300K news sentences (english)](http://wortschatz.uni-leipzig.de/en/download/). 95% was used for train, 5% was used for evaluation. [Errors model](https://github.com/bakwc/JamSpell/blob/master/evaluate/typo_model.py) was used to generate errored text from the original one. JamSpell corrector was compared with [Norvig's one](http://norvig.com/spell-correct.html), [Hunspell](http://hunspell.github.io/) and a dummy one (no corrections).

We used following metrics:

- **Errors** - percent of words with errors after spell checker processed

- **Top 7 Errors** - percent of words missing in top7 candidated

- **Fix Rate** - percent of errored words fixed by spell checker

- **Top 7 Fix Rate** - percent of errored words fixed by one of top7 candidates

- **Broken** - percent of non-errored words broken by spell checker

- **Speed** - number of words per second

To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:

  

    

    Errors

    Top 7 Errors

    Fix Rate

    Top 7 Fix Rate

    Broken

    Speed

(words per second)

  

  

    JamSpell

    3.56%

    1.27%

    72.03%

    79.73%

    0.50%

    5524

  

  

    Norvig

    7.60%

    5.30%

    35.43%

    56.06%

    0.45%

    647

  

  

    Hunspell

    9.36%

    6.44%

    39.61%

    65.77%

    2.95%

    284

  

  

    Dummy

    11.16%

    11.16%

    0.00%

    0.00%

    0.00%

    -

  

More details about reproducing available in "[Train](#train)" section.

## Usage

### Python

1. Install ```swig3``` (usually it is in your distro package manager)

2. Install ```jamspell```:

```bash

pip install jamspell

```

3. [Download](#download-models) or [train](#train) language model

4. Use it:

```python

import jamspell

corrector = jamspell.TSpellCorrector()

corrector.LoadLangModel('en.bin')

corrector.FixFragment('I am the begt spell cherken!')

# u'I am the best spell checker!'

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)

# (u'best', u'beat', u'belt', u'bet', u'bent', ... )

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)

# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

```

### C++

1. Add `jamspell` and `contrib` dirs to your project

2. Use it:

```cpp

#include 

int main(int argc, const char** argv) {

    NJamSpell::TSpellCorrector corrector;

    corrector.LoadLangModel("model.bin");

    corrector.FixFragment(L"I am the begt spell cherken!");

    // "I am the best spell checker!"

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);

    // "best", "beat", "belt", "bet", "bent", ... )

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);

    // "checker", "chicken", "checked", "wherein", "coherent", ... )

    return 0;

}

```

### Other languages

You can generate extensions for other languages using [swig tutorial](http://www.swig.org/tutorial.html). The swig interface file is `jamspell.i`. Pull requests with build scripts are welcome.

## HTTP API

* Install ```cmake```

* Clone and build jamspell (it includes http server):

```bash

git clone https://github.com/bakwc/JamSpell.git

cd JamSpell

mkdir build

cd build

cmake ..

make

```

* [Download](#download-models) or [train](#train) language model

* Run http server:

```bash

./web_server/web_server en.bin localhost 8080

```

* **GET** Request example:

```bash

$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"

I am the best spell checker

```

* **POST** Request example

```bash

$ curl -d "I am the begt spell cherken" http://localhost:8080/fix

I am the best spell checker

```

* Candidate example

```bash

curl "http://localhost:8080/candidates?text=I am the begt spell cherken"

# or

curl -d "I am the begt spell cherken" http://localhost:8080/candidates

```

```javascript

{

    "results": [

        {

            "candidates": [

                "best",

                "beat",

                "belt",

                "bet",

                "bent",

                "beet",

                "beit"

            ],

            "len": 4,

            "pos_from": 9

        },

        {

            "candidates": [

                "checker",

                "chicken",

                "checked",

                "wherein",

                "coherent",

                "cheered",

                "cherokee"

            ],

            "len": 7,

            "pos_from": 20

        }

    ]

}

```

Here `pos_from` - misspelled word first letter position, `len` - misspelled word len

## Train

To train custom model you need:

1. Install ```cmake```

2. Clone and build jamspell:

```bash

git clone https://github.com/bakwc/JamSpell.git

cd JamSpell

mkdir build

cd build

cmake ..

make

```

3. Prepare a utf-8 text file with sentences to train at (eg. [```sherlockholmes.txt```](https://github.com/bakwc/JamSpell/blob/master/test_data/sherlockholmes.txt)) and another file with language alphabet (eg. [```alphabet_en.txt```](https://github.com/bakwc/JamSpell/blob/master/test_data/alphabet_en.txt))

4. Train model:

```bash

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin

```

5. To evaluate spellchecker you can use ```evaluate/evaluate.py``` script:

```bash

python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt

```

6. You can use ```evaluate/generate_dataset.py``` to generate you train/test data. It supports txt files, [Leipzig Corpora Collection](http://wortschatz.uni-leipzig.de/en/download/) format and fb2 books.

## Download models

Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See [Train](#train) section above.

 - [en.tar.gz](https://github.com/bakwc/JamSpell-models/raw/master/en.tar.gz) (35Mb)

 - [fr.tar.gz](https://github.com/bakwc/JamSpell-models/raw/master/fr.tar.gz) (31Mb)

 - [ru.tar.gz](https://github.com/bakwc/JamSpell-models/raw/master/ru.tar.gz) (38Mb)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bakwc/JamSpell

Awesome Lists containing this project

README