{"id":15518742,"url":"https://github.com/adrianeboyd/brillmoorespellchecker","last_synced_at":"2025-04-23T04:13:36.828Z","repository":{"id":113205396,"uuid":"100255407","full_name":"adrianeboyd/BrillMooreSpellChecker","owner":"adrianeboyd","description":"Spell checker using Brill and Moore's noisy channel error model","archived":false,"fork":false,"pushed_at":"2019-01-09T10:02:58.000Z","size":1405,"stargazers_count":11,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-23T04:13:31.553Z","etag":null,"topics":["java","spellchecker","spelling-correction"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adrianeboyd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-14T10:22:04.000Z","updated_at":"2023-06-09T13:03:44.000Z","dependencies_parsed_at":null,"dependency_job_id":"39e3d047-781e-48fc-93eb-33ddedf7e560","html_url":"https://github.com/adrianeboyd/BrillMooreSpellChecker","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adrianeboyd%2FBrillMooreSpellChecker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adrianeboyd%2FBrillMooreSpellChecker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adrianeboyd%2FBrillMooreSpellChecker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adrianeboyd%2FBrillMooreSpellChecker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adrianeboyd","download_url":"https://codeload.github.com/adrianeboyd/BrillMooreSpellChecker/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250366717,"owners_count":21418772,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","spellchecker","spelling-correction"],"created_at":"2024-10-02T10:19:13.473Z","updated_at":"2025-04-23T04:13:36.819Z","avatar_url":"https://github.com/adrianeboyd.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"Brill and Moore Noisy Channel Spelling Correction\n=================================================\n\nThis is a Java implementation of the noisy channel spell checking approach\npresented in:\n\nBrill and Moore (2000). [An Improved Error Model for Noisy Channel Spelling\nCorrection](http://www.aclweb.org/anthology/P00-1037). In _Proceedings of the\nACL 2000_.\n\nThe spell checker's error model is trained on a list of pairs of misspellings\nwith corrections, considering generic character edits up to a specified maximum\nedit length (e.g., the edit `ant`\u0026rarr;`ent` from the pair\n`dependant`\u0026rarr;`dependent`).\n\nTo use this spell checker you need:\n\n- a list of misspellings with corrections\n- a list of potential corrections (i.e., a dictionary of real words)\n\nThe spell checker does not know anything about morphology or sentence-initial\ncapitalization, so it expects all possible forms of a word (inflected,\ncapitalized, lowercase, mixed case, etc.) to appear in the list of potential\ncorrections. The command-line wrapper includes flags to expand a provided\ndictionary with lowercase and capitalized versions of all words.\n\nCommand Line Usage\n------------------\n\n### Compile and Package\n\n```\n$ mvn package\n```\n\n### Run\n\n```\n$ java -jar target/brillmoore-0.1-jar-with-dependencies.jar\n```\n\n### Usage\n\n```\nusage: java -jar brillmoore-0.1-jar-with-dependencies.jar\n -a,--minatoa \u003carg\u003e      minimum a -\u003e a probability (default 0.8)\n -c,--candidates \u003carg\u003e   number of candidates to output (default 10)\n -d,--dict \u003carg\u003e         dictionary file\n -h,--help               this help message\n -l,--lowercase          expand dictionary with lowercase versions of all\n                         words\n -p,--train \u003carg\u003e        training file\n -s,--single             add training instances for all single character\n                         edits\n -t,--test \u003carg\u003e         testing file\n -u,--capitalized        expand dictionary with capitalized versions of\n                         all words\n -w,--window \u003carg\u003e       window for expanding alignments (Brill and\n                         Moore's N; default 3)\n```\n\n### Data Formats\n\nTab-separated values are used for input and output.\n\n#### Training/Testing\n\n- counts are optional, assumed to be 1 if no count provided\n- the test counts are merely copied into the output for further use\n\n```\nmisspelling TAB target TAB count\n```\n\n#### Dictionary\n\n- without probabilities (one word per line, all words equally likely):\n\n```\nword\n```\n\n- with probabilities:\n\n```\nword TAB probability\n```\n\n#### Output\n\nThe output echoes the test input columns (misspelling, target, count) and\nappends the ranked candidate corrections as pairs of columns containing the\ncandidate correction and the -log(prob) of the candidate.\n\n```\nmisspelling TAB target TAB count TAB candidate1 TAB -log(prob1) TAB candidate2 TAB -log(prob2) ...\n```\n\n### Example\n\nSample input files based on the [Aspell common misspellings test\ndata](http://aspell.net/test/common-all/) are provided in `data/`. See\n`data/README.md` for details.\n\n```\n$ java -jar target/brillmoore-0.1-jar-with-dependencies.jar -d data/aspell-wordlist-en_USGBsGBz.70-1.txt -p data/aspell-common.train -t data/aspell-common.dev.first10 -c 3 \u003e data/aspell-common.dev.first10.USGBsGBz.70-1.out\n```\n\nSample output:\n\n```\npumkin  pumpkin 1       pumpkin 4.38    pumpkin's       6.67    bumkin  7.32\nreorganision    reorganisation  1       reorganisation  2.88    reorganisation's        5.20    reorganisations 7.09\ngallaxies       galaxies        1       galaxies        4.01    galaxy's        13.26   galaxy  17.45\nsuperceeded     superseded      1       superseded      7.91    supersede       14.46   succeeded       18.34\nmillenia        millennia       1       millennia       2.11    millennial      6.23    millennial's    8.52\npseudonyn       pseudonym       1       pseudonym       4.69    pseudonym's     6.98    pseudonyms      8.87\nsynonymns       synonyms        1       synonyms        6.46    synonym's       8.29    synonym 12.49\nprominant       prominent       1       predominant     1.76    prominent       2.71    preeminent      10.01\nmanouver        maneuver        1       maneuver        1.93    manoeuvre       3.76    maneuver's      4.27\nobediance       obedience       1       obedience       1.98    obedience's     4.33    obeisance       10.12\n```\n\nEvaluation for sample output:\n\n```\n$ data/eval.py data/aspell-common.dev.first10.USGBsGBz.70-1.out\n```\n\n```\nNotFnd\tFound\tFirst\t1-5\t1-10\t1-25\t1-50\tAny (Max: 3)\n--------------------------------------------------------------------\n0\t10\t90.0\t100.0\t100.0\t100.0\t100.0\t100.0\n```\n\nEvaluation for the whole dev set output in\n`data/aspell-common.dev.USGBsGBz.70-1.out` considering the first 100\nsuggestions:\n\n```\nNotFnd\tFound\tFirst\t1-5\t1-10\t1-25\t1-50\tAny (Max: 100)\n----------------------------------------------------------------------\n18\t403\t84.1\t93.1\t94.8\t95.5\t95.7\t95.7\n```\n\n(Compare to: \u003chttp://aspell.net/test/common-all/\u003e)\n\nEvaluation with default paramemeters training on all Aspell common misspellings\n(`data/aspell-common.all`) and testing on Aspell current test data\n(`data/aspell-current.all`), which focuses on difficult misspellings:\n\n```\nNotFnd\tFound\tFirst\t1-5\t1-10\t1-25\t1-50\tAny (Max: 100)\n----------------------------------------------------------------------\n43\t504\t56.3\t78.4\t83.7\t88.8\t91.2\t92.1\n```\n\n(Compare to: \u003chttp://aspell.net/test/cur/\u003e)\n\n_Note:_ some target corrections aren't found in the provided dictionary due to\ncapitalization (e.g., `The`, `muslims`) and run-on errors (`incase`). The flags\n`-l` and `-u` could be used to expand the base word list with lowercase and\ncapitalized versions respectively.\n\nJava Usage\n----------\n\n```\n// create a list of pairs of misspellings and corrections\nList\u003cMisspelling\u003e trainMisspellings = new ArrayList\u003c\u003e();\ntrainMisspellings.add(new Misspelling(\"Abril\", \"April\", 1));\n\n// create a dictionary\nMap\u003cString, Double\u003e dict = new HashMap\u003c\u003e();\ndict.put(\"April\", 1.0);\ndict.put(\"Arzt\", 1.0);\ndict.put(\"Altstadt\", 1.0);\n\n// set the parameters\nint window = 3;\ndouble minAtoA = 0.8;\n\ntry {\n    // train spell checker\n    SpellChecker spellchecker = new SpellChecker(trainMisspellings, dict, window, minAtoA);\n\n    // run spell checker\n    List\u003cCandidate\u003e candidates = spellchecker.getRankedCandidates(\"Abril\");\n\n    // iterate over top ten candidates\n    for (Candidate cand : candidates.subList(0, Math.min(candidates.size(), 10))) {\n        System.out.println(cand.getTarget() + \"\\t\" + cand.getProb());\n    }\n} catch (ParseException e) {\n    System.err.println(e.getMessage());\n}\n\n```\n\n### Output\n\n```\nApril\t1.6094379124341005\nAltstadt\tInfinity\nArzt\tInfinity\n```\n\nUsing Maven\n-----------\n\nInstall in the local maven archive:\n\n```\n$ mvn install\n```\n\nAdd the maven dependency:\n\n```\n\u003cdependency\u003e\n\t\u003cgroupId\u003ede.unituebingen.sfs\u003c/groupId\u003e\n\t\u003cartifactId\u003ebrillmoore\u003c/artifactId\u003e\n\t\u003cversion\u003e0.1\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nCredits\n-------\n\nThis code includes modified versions of:\n\n- [Trie](https://gist.github.com/rgantt/5711830) by Ryan Gantt ([further documentation](http://code.ryangantt.com/articles/introduction-to-prefix-trees/))\n- [Damerau Levenshtein Algorithm](https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java) by Kevin L. Stern\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadrianeboyd%2Fbrillmoorespellchecker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadrianeboyd%2Fbrillmoorespellchecker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadrianeboyd%2Fbrillmoorespellchecker/lists"}