Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/clips/clinspell
Clinical spelling correction with word and character n-gram embeddings.
https://github.com/clips/clinspell
Last synced: 3 months ago
JSON representation
Clinical spelling correction with word and character n-gram embeddings.
- Host: GitHub
- URL: https://github.com/clips/clinspell
- Owner: clips
- Created: 2017-05-15T23:54:04.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2022-06-21T21:14:12.000Z (over 2 years ago)
- Last Synced: 2024-05-21T00:49:48.960Z (6 months ago)
- Language: Python
- Homepage:
- Size: 2.56 MB
- Stars: 74
- Watchers: 8
- Forks: 16
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
This repository contains source code for the paper ['Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-gram Embeddings'](http://www.clinjournal.org/sites/clinjournal.org/files/03.unsupervised-context-sensitive_0.pdf), which is published in Volume 7 of [CLIN Journal](http://www.clinjournal.org/biblio/volume). A shorter paper, which focuses exclusively on our English experiments, was presented at the BioNLP 2017 workshop at ACL: ['Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-gram Embeddings.'](http://www.aclweb.org/anthology/W17-2317) The source code offered here contains scripts to extract our manually annotated MIMIC-III data, and to
run the experiments described in our paper.# License
MIT
# Requirements
* Python 3
* Python 2.7
* Numpy
* [pyxdameraulevenshtein](https://github.com/gfairchild/pyxDamerauLevenshtein)
* [Facebook fastText](https://github.com/facebookresearch/fastText)
* [fasttext](https://github.com/salestock/fastText.py), a Python interface for Facebook fastTextAll packages are available from pip, except ```fastText```. To install these requirements, just run
```pip install -r requirements.txt```
from inside the cloned repository.
In order to build ```fastText```, use the following:
```
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make
```To extract our manually annotated MIMIC-III test data, you should have access to the [MIMIC-III database](https://mimic.physionet.org).
It's important that this is specifically MIMIC-III v1.3: our extraction script only works for this version.# Usage
## Demo
To demo the context-sensitive spelling correction model with the best parameters from the experiments,
go to the **demo** directory and follow the instructions in the README.## Extracting the English test data
To extract the annotated test data, run
```python2.7 extract_test.py [path to NOTEEVENTS.csv file from the MIMIC-III v1.3 database]```
This script preprocesses the **NOTEEVENTS.csv** data and stores the preprocessed data in the file **mimic_preprocessed.txt**. It then extracts the annotated
test data, which is stored to the file **testcorpus.json** in four lists: correct replacements, misspellings, misspelling contexts, and line indices.## Extracting development data and other resources
### Preprocessing
To generate development corpora as described in the paper, the data has to be preprocessed. To preprocess English data, run
```python3 preprocess.py [path to raw data] [path to created preprocessed data]```
This script uses the source code of the English tokenizer from [Pattern](https://github.com/clips/pattern).
To preprocess Dutch data, you can use the [Ucto](https://languagemachines.github.io/ucto/) tokenizer and, for every line, retain every token which
matches```r'(^[^\d\W])[^\d\W]*(-[^\d\W]*)*([^\d\W]$)'```
### Generating frequency lists and neural embeddings
To extract a frequency list from the preprocessed data, run
```python3 frequencies.py [path to preprocessed data] [language]```
The [language] argument should always either be **en** if the language is English or **nl** if the language is Dutch.
To train the fastText vectors as we do, place the preprocessed data in the cloned fastText directory and run
```./fasttext skipgram -input [path to preprocessed data] -output ../data/embeddings_[language] -dim 300```
This makes an embeddings_[language].vec and embeddings_[language].bin file in the data repository.
Only the embeddings_[language].bin file is used by the code.### Generating development corpora
To create a development corpus from preprocessed data, run
```python3 make_devcorpus.py [path to preprocessed data] [language] [path to created devcorpus] [window size] [allow oov] [samplesize]```
The [window size] argument specifies the minimal token window size on each side of a generated development instance.
The [allow oov] argument should be False for development setup 1 or 2 from the paper, and True for development setup 3.
The [samplesize] argument should contain the number of lines to sample from the data.## Conducting experiments
### Generating candidates
To generate candidates for a created development corpus, run
```python3 candidates.py [path to preprocessed data] 2 [name of output] [language]```
To generate candidates for our extracted test data or other empirically observed data, run
```python3 candidates.py [path to preprocessed data] all [name of output] [language]```
### Ranking experiments
The ```Development``` class in **ranking_experiments.py** contains all functions to conduct the experiments.
Example:
```
import ranking_experiments# load devcorpus for setup 1, 2 and 3
with open('devcorpus_setup1.json', 'r') as f:
corpusfiles_setup1 = json.load(f)
devcorpus_setup1 = corpusfiles_setup1[:3]with open('devcorpus_setup2.json', 'r') as f:
corpusfiles_setup2 = json.load(f)
devcorpus_setup2 = corpusfiles_setup2[:3]with open('devcorpus_setup3.json', 'r') as f:
corpusfiles_setup3 = json.load(f)
devcorpus_setup3 = corpusfiles_setup3[:3]# load candidates for setup 1, 2 and 3
with open('candidates_devcorpus_setup1.json', 'r') as f:
candidates_setup1 = json.load(f)
with open('candidates_devcorpus_setup2.json', 'r') as f:
candidates_setup2 = json.load(f)
with open('candidates_devcorpus_setup3.json', 'r') as f:
candidates_setup3 = json.load(f)# perform grid search
scores_setup1 = Development.grid_search(devcorpus_setup1, candidates_setup1, language='en')
scores_setup2 = Development.grid_search(devcorpus_setup2, candidates_setup2, language='en')# search for best averaged parameters
best_parameters = Development.define_best_parameters('iv'=[scores_setup1, scores_setup2])# perform grid search for oov penalty
oov_scores_setup1 = Development.tune_oov(devcorpus_setup1, candidates_list, best_parameters, language='en')
oov_scores_setup2 = Development.tune_oov(devcorpus_setup2, candidates_list, best_parameters, language='en')
oov_scores_setup3 = Development.tune_oov(devcorpus_setup3, candidates_list, best_parameters, language='en')# search for best averaged oov penalty
best_oov = Development.define_best_parameters('iv'=[oov_scores_setup1, oov_scores_setup2], 'oov'=oov_scores_setup3)# store best parameters
best_parameters['oov_penalty'] = best_oov
with open('parameters.json', 'w') as f:
json.dump(best_parameters, f)# conduct ranking experiments with best parameters on test data
with open('testcorpus.json', 'r') as f:
testfiles = json.load(f)
testcorpus = [testfiles[0], testfiles[1], testfiles[2]]with open('testcandidates.json', 'r') as f:
testcandidates = json.load(f)# ranking experiment and analysis per frequency scenario for our context-sensitive model, noisy channel model, and majority frequency
best_parameters['ranking_method'] = 'context'
dev = Development(best_parameters, language='en')
accuracy_context, correction_list_context = dev.conduct_experiment(testcorpus, testcandidates)
frequency_analysis_context = dev.frequency_analysis()best_parameters['ranking_method'] = 'noisy_channel'
dev = Development(best_parameters, language='en')
accuracy_noisychannel, correction_list_noisychannel = dev.conduct_experiment(testcorpus, testcandidates)
frequency_analysis_noisychannel = dev.frequency_analysis()best_parameters['ranking_method'] = 'frequency'
dev = Development(best_parameters, language='en')
accuracy_frequency, correction_list_frequency = dev.conduct_experiment(testcorpus, testcandidates)
```