https://github.com/adrianeboyd/brillmoorespellchecker

Spell checker using Brill and Moore's noisy channel error model
https://github.com/adrianeboyd/brillmoorespellchecker

java spellchecker spelling-correction

Last synced: 6 months ago
JSON representation

Spell checker using Brill and Moore's noisy channel error model

Host: GitHub
URL: https://github.com/adrianeboyd/brillmoorespellchecker
Owner: adrianeboyd
License: apache-2.0
Created: 2017-08-14T10:22:04.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2019-01-09T10:02:58.000Z (almost 7 years ago)
Last Synced: 2025-04-23T04:13:31.553Z (6 months ago)
Topics: java, spellchecker, spelling-correction
Language: Java
Homepage:
Size: 1.34 MB
Stars: 11
Watchers: 2
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          Brill and Moore Noisy Channel Spelling Correction

=================================================

This is a Java implementation of the noisy channel spell checking approach

presented in:

Brill and Moore (2000). [An Improved Error Model for Noisy Channel Spelling

Correction](http://www.aclweb.org/anthology/P00-1037). In _Proceedings of the

ACL 2000_.

The spell checker's error model is trained on a list of pairs of misspellings

with corrections, considering generic character edits up to a specified maximum

edit length (e.g., the edit `ant`→`ent` from the pair

`dependant`→`dependent`).

To use this spell checker you need:

- a list of misspellings with corrections

- a list of potential corrections (i.e., a dictionary of real words)

The spell checker does not know anything about morphology or sentence-initial

capitalization, so it expects all possible forms of a word (inflected,

capitalized, lowercase, mixed case, etc.) to appear in the list of potential

corrections. The command-line wrapper includes flags to expand a provided

dictionary with lowercase and capitalized versions of all words.

Command Line Usage

------------------

### Compile and Package

```

$ mvn package

```

### Run

```

$ java -jar target/brillmoore-0.1-jar-with-dependencies.jar

```

### Usage

```

usage: java -jar brillmoore-0.1-jar-with-dependencies.jar

 -a,--minatoa       minimum a -> a probability (default 0.8)

 -c,--candidates    number of candidates to output (default 10)

 -d,--dict          dictionary file

 -h,--help               this help message

 -l,--lowercase          expand dictionary with lowercase versions of all

                         words

 -p,--train         training file

 -s,--single             add training instances for all single character

                         edits

 -t,--test          testing file

 -u,--capitalized        expand dictionary with capitalized versions of

                         all words

 -w,--window        window for expanding alignments (Brill and

                         Moore's N; default 3)

```

### Data Formats

Tab-separated values are used for input and output.

#### Training/Testing

- counts are optional, assumed to be 1 if no count provided

- the test counts are merely copied into the output for further use

```

misspelling TAB target TAB count

```

#### Dictionary

- without probabilities (one word per line, all words equally likely):

```

word

```

- with probabilities:

```

word TAB probability

```

#### Output

The output echoes the test input columns (misspelling, target, count) and

appends the ranked candidate corrections as pairs of columns containing the

candidate correction and the -log(prob) of the candidate.

```

misspelling TAB target TAB count TAB candidate1 TAB -log(prob1) TAB candidate2 TAB -log(prob2) ...

```

### Example

Sample input files based on the [Aspell common misspellings test

data](http://aspell.net/test/common-all/) are provided in `data/`. See

`data/README.md` for details.

```

$ java -jar target/brillmoore-0.1-jar-with-dependencies.jar -d data/aspell-wordlist-en_USGBsGBz.70-1.txt -p data/aspell-common.train -t data/aspell-common.dev.first10 -c 3 > data/aspell-common.dev.first10.USGBsGBz.70-1.out

```

Sample output:

```

pumkin  pumpkin 1       pumpkin 4.38    pumpkin's       6.67    bumkin  7.32

reorganision    reorganisation  1       reorganisation  2.88    reorganisation's        5.20    reorganisations 7.09

gallaxies       galaxies        1       galaxies        4.01    galaxy's        13.26   galaxy  17.45

superceeded     superseded      1       superseded      7.91    supersede       14.46   succeeded       18.34

millenia        millennia       1       millennia       2.11    millennial      6.23    millennial's    8.52

pseudonyn       pseudonym       1       pseudonym       4.69    pseudonym's     6.98    pseudonyms      8.87

synonymns       synonyms        1       synonyms        6.46    synonym's       8.29    synonym 12.49

prominant       prominent       1       predominant     1.76    prominent       2.71    preeminent      10.01

manouver        maneuver        1       maneuver        1.93    manoeuvre       3.76    maneuver's      4.27

obediance       obedience       1       obedience       1.98    obedience's     4.33    obeisance       10.12

```

Evaluation for sample output:

```

$ data/eval.py data/aspell-common.dev.first10.USGBsGBz.70-1.out

```

```

NotFnd	Found	First	1-5	1-10	1-25	1-50	Any (Max: 3)

--------------------------------------------------------------------

0	10	90.0	100.0	100.0	100.0	100.0	100.0

```

Evaluation for the whole dev set output in

`data/aspell-common.dev.USGBsGBz.70-1.out` considering the first 100

suggestions:

```

NotFnd	Found	First	1-5	1-10	1-25	1-50	Any (Max: 100)

----------------------------------------------------------------------

18	403	84.1	93.1	94.8	95.5	95.7	95.7

```

(Compare to: )

Evaluation with default paramemeters training on all Aspell common misspellings

(`data/aspell-common.all`) and testing on Aspell current test data

(`data/aspell-current.all`), which focuses on difficult misspellings:

```

NotFnd	Found	First	1-5	1-10	1-25	1-50	Any (Max: 100)

----------------------------------------------------------------------

43	504	56.3	78.4	83.7	88.8	91.2	92.1

```

(Compare to: )

_Note:_ some target corrections aren't found in the provided dictionary due to

capitalization (e.g., `The`, `muslims`) and run-on errors (`incase`). The flags

`-l` and `-u` could be used to expand the base word list with lowercase and

capitalized versions respectively.

Java Usage

----------

```

// create a list of pairs of misspellings and corrections

List trainMisspellings = new ArrayList<>();

trainMisspellings.add(new Misspelling("Abril", "April", 1));

// create a dictionary

Map dict = new HashMap<>();

dict.put("April", 1.0);

dict.put("Arzt", 1.0);

dict.put("Altstadt", 1.0);

// set the parameters

int window = 3;

double minAtoA = 0.8;

try {

    // train spell checker

    SpellChecker spellchecker = new SpellChecker(trainMisspellings, dict, window, minAtoA);

    // run spell checker

    List candidates = spellchecker.getRankedCandidates("Abril");

    // iterate over top ten candidates

    for (Candidate cand : candidates.subList(0, Math.min(candidates.size(), 10))) {

        System.out.println(cand.getTarget() + "\t" + cand.getProb());

    }

} catch (ParseException e) {

    System.err.println(e.getMessage());

}

```

### Output

```

April	1.6094379124341005

Altstadt	Infinity

Arzt	Infinity

```

Using Maven

-----------

Install in the local maven archive:

```

$ mvn install

```

Add the maven dependency:

```

	de.unituebingen.sfs

	brillmoore

	0.1

```

Credits

-------

This code includes modified versions of:

- [Trie](https://gist.github.com/rgantt/5711830) by Ryan Gantt ([further documentation](http://code.ryangantt.com/articles/introduction-to-prefix-trees/))

- [Damerau Levenshtein Algorithm](https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java) by Kevin L. Stern

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/adrianeboyd/brillmoorespellchecker

Awesome Lists containing this project

README