{"id":13856977,"url":"https://github.com/clips/clinspell","last_synced_at":"2025-12-30T01:04:41.124Z","repository":{"id":20969053,"uuid":"91394796","full_name":"clips/clinspell","owner":"clips","description":"Clinical spelling correction with word and character n-gram embeddings.","archived":false,"fork":false,"pushed_at":"2022-06-21T21:14:12.000Z","size":2688,"stargazers_count":74,"open_issues_count":5,"forks_count":16,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-08-06T03:02:49.610Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clips.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-15T23:54:04.000Z","updated_at":"2023-10-22T21:39:29.000Z","dependencies_parsed_at":"2022-09-11T07:52:15.597Z","dependency_job_id":null,"html_url":"https://github.com/clips/clinspell","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clips%2Fclinspell","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clips%2Fclinspell/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clips%2Fclinspell/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clips%2Fclinspell/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clips","download_url":"https://codeload.github.com/clips/clinspell/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225912457,"owners_count":17544179,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-05T03:01:21.079Z","updated_at":"2025-12-30T01:04:41.090Z","avatar_url":"https://github.com/clips.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"This repository contains source code for the paper ['Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-gram Embeddings'](http://www.clinjournal.org/sites/clinjournal.org/files/03.unsupervised-context-sensitive_0.pdf), which is published in Volume 7 of [CLIN Journal](http://www.clinjournal.org/biblio/volume). A shorter paper, which focuses exclusively on our English experiments, was presented at the BioNLP 2017 workshop at ACL: ['Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-gram Embeddings.'](http://www.aclweb.org/anthology/W17-2317) The source code offered here contains scripts to extract our manually annotated MIMIC-III data, and to\nrun the experiments described in our paper.\n\n# License\n\nMIT\n\n# Requirements\n\n* Python 3\n* Python 2.7\n* Numpy\n* [pyxdameraulevenshtein](https://github.com/gfairchild/pyxDamerauLevenshtein)\n* [Facebook fastText](https://github.com/facebookresearch/fastText)\n* [fasttext](https://github.com/salestock/fastText.py), a Python interface for Facebook fastText\n\nAll packages are available from pip, except ```fastText```. To install these requirements, just run\n\n```pip install -r requirements.txt```\n\nfrom inside the cloned repository.\n\nIn order to build ```fastText```, use the following:\n\n```\n$ git clone https://github.com/facebookresearch/fastText.git\n$ cd fastText\n$ make\n```\n\nTo extract our manually annotated MIMIC-III test data, you should have access to the [MIMIC-III database](https://mimic.physionet.org).\nIt's important that this is specifically MIMIC-III v1.3: our extraction script only works for this version.\n\n# Usage\n\n## Demo\n\nTo demo the context-sensitive spelling correction model with the best parameters from the experiments,\ngo to the **demo** directory and follow the instructions in the README.\n\n## Extracting the English test data\n\nTo extract the annotated test data, run\n\n```python2.7 extract_test.py [path to NOTEEVENTS.csv file from the MIMIC-III v1.3 database]```\n\nThis script preprocesses the **NOTEEVENTS.csv** data and stores the preprocessed data in the file **mimic_preprocessed.txt**. It then extracts the annotated \ntest data, which is stored to the file **testcorpus.json** in four lists: correct replacements, misspellings, misspelling contexts, and line indices.\n\n## Extracting development data and other resources\n\n### Preprocessing\n\nTo generate development corpora as described in the paper, the data has to be preprocessed. To preprocess English data, run\n\n```python3 preprocess.py [path to raw data] [path to created preprocessed data]```\n\nThis script uses the source code of the English tokenizer from [Pattern](https://github.com/clips/pattern). \n\nTo preprocess Dutch data, you can use the [Ucto](https://languagemachines.github.io/ucto/) tokenizer and, for every line, retain every token which \nmatches \n\n```r'(^[^\\d\\W])[^\\d\\W]*(-[^\\d\\W]*)*([^\\d\\W]$)'```\n\n### Generating frequency lists and neural embeddings\n\nTo extract a frequency list from the preprocessed data, run\n\n```python3 frequencies.py [path to preprocessed data] [language]```\n\nThe [language] argument should always either be **en** if the language is English or **nl** if the language is Dutch. \n\nTo train the fastText vectors as we do, place the preprocessed data in the cloned fastText directory and run\n\n```./fasttext skipgram -input [path to preprocessed data] -output ../data/embeddings_[language] -dim 300```\n\nThis makes an embeddings_[language].vec and embeddings_[language].bin file in the data repository.\nOnly the embeddings_[language].bin file is used by the code.\n\n### Generating development corpora\n\nTo create a development corpus from preprocessed data, run\n\n```python3 make_devcorpus.py [path to preprocessed data] [language] [path to created devcorpus] [window size] [allow oov] [samplesize]```\n\nThe [window size] argument specifies the minimal token window size on each side of a generated development instance.\nThe [allow oov] argument should be False for development setup 1 or 2 from the paper, and True for development setup 3. \nThe [samplesize] argument should contain the number of lines to sample from the data.\n\n## Conducting experiments\n\n### Generating candidates\n\nTo generate candidates for a created development corpus, run\n\n```python3 candidates.py [path to preprocessed data] 2 [name of output] [language]```\n\nTo generate candidates for our extracted test data or other empirically observed data, run\n\n```python3 candidates.py [path to preprocessed data] all [name of output] [language]```\n\n### Ranking experiments\n\nThe ```Development``` class in **ranking_experiments.py** contains all functions to conduct the experiments. \n\nExample:\n\n```\nimport ranking_experiments\n\n# load devcorpus for setup 1, 2 and 3\n\nwith open('devcorpus_setup1.json', 'r') as f:\n        corpusfiles_setup1 = json.load(f)\ndevcorpus_setup1 = corpusfiles_setup1[:3]\n\nwith open('devcorpus_setup2.json', 'r') as f:\n        corpusfiles_setup2 = json.load(f)\ndevcorpus_setup2 = corpusfiles_setup2[:3]\n\nwith open('devcorpus_setup3.json', 'r') as f:\n        corpusfiles_setup3 = json.load(f)\ndevcorpus_setup3 = corpusfiles_setup3[:3]\n\n# load candidates for setup 1, 2 and 3\nwith open('candidates_devcorpus_setup1.json', 'r') as f:\n        candidates_setup1 = json.load(f)\nwith open('candidates_devcorpus_setup2.json', 'r') as f:\n        candidates_setup2 = json.load(f)\nwith open('candidates_devcorpus_setup3.json', 'r') as f:\n        candidates_setup3 = json.load(f)\n\n# perform grid search\nscores_setup1 = Development.grid_search(devcorpus_setup1, candidates_setup1, language='en')\nscores_setup2 = Development.grid_search(devcorpus_setup2, candidates_setup2, language='en')\n\n# search for best averaged parameters\nbest_parameters = Development.define_best_parameters('iv'=[scores_setup1, scores_setup2])\n\n# perform grid search for oov penalty\noov_scores_setup1 = Development.tune_oov(devcorpus_setup1, candidates_list, best_parameters, language='en')\noov_scores_setup2 = Development.tune_oov(devcorpus_setup2, candidates_list, best_parameters, language='en')\noov_scores_setup3 = Development.tune_oov(devcorpus_setup3, candidates_list, best_parameters, language='en')\n\n# search for best averaged oov penalty\nbest_oov = Development.define_best_parameters('iv'=[oov_scores_setup1, oov_scores_setup2], 'oov'=oov_scores_setup3)\n\n# store best parameters\nbest_parameters['oov_penalty'] = best_oov\nwith open('parameters.json', 'w') as f:\n\tjson.dump(best_parameters, f)\n\n# conduct ranking experiments with best parameters on test data\n\nwith open('testcorpus.json', 'r') as f:\n\ttestfiles = json.load(f)\ntestcorpus = [testfiles[0], testfiles[1], testfiles[2]]\n\nwith open('testcandidates.json', 'r') as f:\n        testcandidates = json.load(f)\n\n# ranking experiment and analysis per frequency scenario for our context-sensitive model, noisy channel model, and majority frequency\n\nbest_parameters['ranking_method'] = 'context'\ndev = Development(best_parameters, language='en')\naccuracy_context, correction_list_context = dev.conduct_experiment(testcorpus, testcandidates)\nfrequency_analysis_context = dev.frequency_analysis()\n\nbest_parameters['ranking_method'] = 'noisy_channel'\ndev = Development(best_parameters, language='en')\naccuracy_noisychannel, correction_list_noisychannel = dev.conduct_experiment(testcorpus, testcandidates)\nfrequency_analysis_noisychannel = dev.frequency_analysis()\n\nbest_parameters['ranking_method'] = 'frequency'\ndev = Development(best_parameters, language='en')\naccuracy_frequency, correction_list_frequency = dev.conduct_experiment(testcorpus, testcandidates)\n```\n\n\n\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclips%2Fclinspell","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclips%2Fclinspell","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclips%2Fclinspell/lists"}