https://github.com/adrianeboyd/brillmoorespellchecker
Spell checker using Brill and Moore's noisy channel error model
https://github.com/adrianeboyd/brillmoorespellchecker
java spellchecker spelling-correction
Last synced: 6 months ago
JSON representation
Spell checker using Brill and Moore's noisy channel error model
- Host: GitHub
- URL: https://github.com/adrianeboyd/brillmoorespellchecker
- Owner: adrianeboyd
- License: apache-2.0
- Created: 2017-08-14T10:22:04.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2019-01-09T10:02:58.000Z (almost 7 years ago)
- Last Synced: 2025-04-23T04:13:31.553Z (6 months ago)
- Topics: java, spellchecker, spelling-correction
- Language: Java
- Homepage:
- Size: 1.34 MB
- Stars: 11
- Watchers: 2
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Brill and Moore Noisy Channel Spelling Correction
=================================================This is a Java implementation of the noisy channel spell checking approach
presented in:Brill and Moore (2000). [An Improved Error Model for Noisy Channel Spelling
Correction](http://www.aclweb.org/anthology/P00-1037). In _Proceedings of the
ACL 2000_.The spell checker's error model is trained on a list of pairs of misspellings
with corrections, considering generic character edits up to a specified maximum
edit length (e.g., the edit `ant`→`ent` from the pair
`dependant`→`dependent`).To use this spell checker you need:
- a list of misspellings with corrections
- a list of potential corrections (i.e., a dictionary of real words)The spell checker does not know anything about morphology or sentence-initial
capitalization, so it expects all possible forms of a word (inflected,
capitalized, lowercase, mixed case, etc.) to appear in the list of potential
corrections. The command-line wrapper includes flags to expand a provided
dictionary with lowercase and capitalized versions of all words.Command Line Usage
------------------### Compile and Package
```
$ mvn package
```### Run
```
$ java -jar target/brillmoore-0.1-jar-with-dependencies.jar
```### Usage
```
usage: java -jar brillmoore-0.1-jar-with-dependencies.jar
-a,--minatoa minimum a -> a probability (default 0.8)
-c,--candidates number of candidates to output (default 10)
-d,--dict dictionary file
-h,--help this help message
-l,--lowercase expand dictionary with lowercase versions of all
words
-p,--train training file
-s,--single add training instances for all single character
edits
-t,--test testing file
-u,--capitalized expand dictionary with capitalized versions of
all words
-w,--window window for expanding alignments (Brill and
Moore's N; default 3)
```### Data Formats
Tab-separated values are used for input and output.
#### Training/Testing
- counts are optional, assumed to be 1 if no count provided
- the test counts are merely copied into the output for further use```
misspelling TAB target TAB count
```#### Dictionary
- without probabilities (one word per line, all words equally likely):
```
word
```- with probabilities:
```
word TAB probability
```#### Output
The output echoes the test input columns (misspelling, target, count) and
appends the ranked candidate corrections as pairs of columns containing the
candidate correction and the -log(prob) of the candidate.```
misspelling TAB target TAB count TAB candidate1 TAB -log(prob1) TAB candidate2 TAB -log(prob2) ...
```### Example
Sample input files based on the [Aspell common misspellings test
data](http://aspell.net/test/common-all/) are provided in `data/`. See
`data/README.md` for details.```
$ java -jar target/brillmoore-0.1-jar-with-dependencies.jar -d data/aspell-wordlist-en_USGBsGBz.70-1.txt -p data/aspell-common.train -t data/aspell-common.dev.first10 -c 3 > data/aspell-common.dev.first10.USGBsGBz.70-1.out
```Sample output:
```
pumkin pumpkin 1 pumpkin 4.38 pumpkin's 6.67 bumkin 7.32
reorganision reorganisation 1 reorganisation 2.88 reorganisation's 5.20 reorganisations 7.09
gallaxies galaxies 1 galaxies 4.01 galaxy's 13.26 galaxy 17.45
superceeded superseded 1 superseded 7.91 supersede 14.46 succeeded 18.34
millenia millennia 1 millennia 2.11 millennial 6.23 millennial's 8.52
pseudonyn pseudonym 1 pseudonym 4.69 pseudonym's 6.98 pseudonyms 8.87
synonymns synonyms 1 synonyms 6.46 synonym's 8.29 synonym 12.49
prominant prominent 1 predominant 1.76 prominent 2.71 preeminent 10.01
manouver maneuver 1 maneuver 1.93 manoeuvre 3.76 maneuver's 4.27
obediance obedience 1 obedience 1.98 obedience's 4.33 obeisance 10.12
```Evaluation for sample output:
```
$ data/eval.py data/aspell-common.dev.first10.USGBsGBz.70-1.out
``````
NotFnd Found First 1-5 1-10 1-25 1-50 Any (Max: 3)
--------------------------------------------------------------------
0 10 90.0 100.0 100.0 100.0 100.0 100.0
```Evaluation for the whole dev set output in
`data/aspell-common.dev.USGBsGBz.70-1.out` considering the first 100
suggestions:```
NotFnd Found First 1-5 1-10 1-25 1-50 Any (Max: 100)
----------------------------------------------------------------------
18 403 84.1 93.1 94.8 95.5 95.7 95.7
```(Compare to: )
Evaluation with default paramemeters training on all Aspell common misspellings
(`data/aspell-common.all`) and testing on Aspell current test data
(`data/aspell-current.all`), which focuses on difficult misspellings:```
NotFnd Found First 1-5 1-10 1-25 1-50 Any (Max: 100)
----------------------------------------------------------------------
43 504 56.3 78.4 83.7 88.8 91.2 92.1
```(Compare to: )
_Note:_ some target corrections aren't found in the provided dictionary due to
capitalization (e.g., `The`, `muslims`) and run-on errors (`incase`). The flags
`-l` and `-u` could be used to expand the base word list with lowercase and
capitalized versions respectively.Java Usage
----------```
// create a list of pairs of misspellings and corrections
List trainMisspellings = new ArrayList<>();
trainMisspellings.add(new Misspelling("Abril", "April", 1));// create a dictionary
Map dict = new HashMap<>();
dict.put("April", 1.0);
dict.put("Arzt", 1.0);
dict.put("Altstadt", 1.0);// set the parameters
int window = 3;
double minAtoA = 0.8;try {
// train spell checker
SpellChecker spellchecker = new SpellChecker(trainMisspellings, dict, window, minAtoA);// run spell checker
List candidates = spellchecker.getRankedCandidates("Abril");// iterate over top ten candidates
for (Candidate cand : candidates.subList(0, Math.min(candidates.size(), 10))) {
System.out.println(cand.getTarget() + "\t" + cand.getProb());
}
} catch (ParseException e) {
System.err.println(e.getMessage());
}```
### Output
```
April 1.6094379124341005
Altstadt Infinity
Arzt Infinity
```Using Maven
-----------Install in the local maven archive:
```
$ mvn install
```Add the maven dependency:
```
de.unituebingen.sfs
brillmoore
0.1```
Credits
-------This code includes modified versions of:
- [Trie](https://gist.github.com/rgantt/5711830) by Ryan Gantt ([further documentation](http://code.ryangantt.com/articles/introduction-to-prefix-trees/))
- [Damerau Levenshtein Algorithm](https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java) by Kevin L. Stern