Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/clips/clinspell

Clinical spelling correction with word and character n-gram embeddings.
https://github.com/clips/clinspell

Last synced: 3 months ago
JSON representation

Clinical spelling correction with word and character n-gram embeddings.

Awesome Lists containing this project

README

        

This repository contains source code for the paper ['Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-gram Embeddings'](http://www.clinjournal.org/sites/clinjournal.org/files/03.unsupervised-context-sensitive_0.pdf), which is published in Volume 7 of [CLIN Journal](http://www.clinjournal.org/biblio/volume). A shorter paper, which focuses exclusively on our English experiments, was presented at the BioNLP 2017 workshop at ACL: ['Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-gram Embeddings.'](http://www.aclweb.org/anthology/W17-2317) The source code offered here contains scripts to extract our manually annotated MIMIC-III data, and to
run the experiments described in our paper.

# License

MIT

# Requirements

* Python 3
* Python 2.7
* Numpy
* [pyxdameraulevenshtein](https://github.com/gfairchild/pyxDamerauLevenshtein)
* [Facebook fastText](https://github.com/facebookresearch/fastText)
* [fasttext](https://github.com/salestock/fastText.py), a Python interface for Facebook fastText

All packages are available from pip, except ```fastText```. To install these requirements, just run

```pip install -r requirements.txt```

from inside the cloned repository.

In order to build ```fastText```, use the following:

```
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make
```

To extract our manually annotated MIMIC-III test data, you should have access to the [MIMIC-III database](https://mimic.physionet.org).
It's important that this is specifically MIMIC-III v1.3: our extraction script only works for this version.

# Usage

## Demo

To demo the context-sensitive spelling correction model with the best parameters from the experiments,
go to the **demo** directory and follow the instructions in the README.

## Extracting the English test data

To extract the annotated test data, run

```python2.7 extract_test.py [path to NOTEEVENTS.csv file from the MIMIC-III v1.3 database]```

This script preprocesses the **NOTEEVENTS.csv** data and stores the preprocessed data in the file **mimic_preprocessed.txt**. It then extracts the annotated
test data, which is stored to the file **testcorpus.json** in four lists: correct replacements, misspellings, misspelling contexts, and line indices.

## Extracting development data and other resources

### Preprocessing

To generate development corpora as described in the paper, the data has to be preprocessed. To preprocess English data, run

```python3 preprocess.py [path to raw data] [path to created preprocessed data]```

This script uses the source code of the English tokenizer from [Pattern](https://github.com/clips/pattern).

To preprocess Dutch data, you can use the [Ucto](https://languagemachines.github.io/ucto/) tokenizer and, for every line, retain every token which
matches

```r'(^[^\d\W])[^\d\W]*(-[^\d\W]*)*([^\d\W]$)'```

### Generating frequency lists and neural embeddings

To extract a frequency list from the preprocessed data, run

```python3 frequencies.py [path to preprocessed data] [language]```

The [language] argument should always either be **en** if the language is English or **nl** if the language is Dutch.

To train the fastText vectors as we do, place the preprocessed data in the cloned fastText directory and run

```./fasttext skipgram -input [path to preprocessed data] -output ../data/embeddings_[language] -dim 300```

This makes an embeddings_[language].vec and embeddings_[language].bin file in the data repository.
Only the embeddings_[language].bin file is used by the code.

### Generating development corpora

To create a development corpus from preprocessed data, run

```python3 make_devcorpus.py [path to preprocessed data] [language] [path to created devcorpus] [window size] [allow oov] [samplesize]```

The [window size] argument specifies the minimal token window size on each side of a generated development instance.
The [allow oov] argument should be False for development setup 1 or 2 from the paper, and True for development setup 3.
The [samplesize] argument should contain the number of lines to sample from the data.

## Conducting experiments

### Generating candidates

To generate candidates for a created development corpus, run

```python3 candidates.py [path to preprocessed data] 2 [name of output] [language]```

To generate candidates for our extracted test data or other empirically observed data, run

```python3 candidates.py [path to preprocessed data] all [name of output] [language]```

### Ranking experiments

The ```Development``` class in **ranking_experiments.py** contains all functions to conduct the experiments.

Example:

```
import ranking_experiments

# load devcorpus for setup 1, 2 and 3

with open('devcorpus_setup1.json', 'r') as f:
corpusfiles_setup1 = json.load(f)
devcorpus_setup1 = corpusfiles_setup1[:3]

with open('devcorpus_setup2.json', 'r') as f:
corpusfiles_setup2 = json.load(f)
devcorpus_setup2 = corpusfiles_setup2[:3]

with open('devcorpus_setup3.json', 'r') as f:
corpusfiles_setup3 = json.load(f)
devcorpus_setup3 = corpusfiles_setup3[:3]

# load candidates for setup 1, 2 and 3
with open('candidates_devcorpus_setup1.json', 'r') as f:
candidates_setup1 = json.load(f)
with open('candidates_devcorpus_setup2.json', 'r') as f:
candidates_setup2 = json.load(f)
with open('candidates_devcorpus_setup3.json', 'r') as f:
candidates_setup3 = json.load(f)

# perform grid search
scores_setup1 = Development.grid_search(devcorpus_setup1, candidates_setup1, language='en')
scores_setup2 = Development.grid_search(devcorpus_setup2, candidates_setup2, language='en')

# search for best averaged parameters
best_parameters = Development.define_best_parameters('iv'=[scores_setup1, scores_setup2])

# perform grid search for oov penalty
oov_scores_setup1 = Development.tune_oov(devcorpus_setup1, candidates_list, best_parameters, language='en')
oov_scores_setup2 = Development.tune_oov(devcorpus_setup2, candidates_list, best_parameters, language='en')
oov_scores_setup3 = Development.tune_oov(devcorpus_setup3, candidates_list, best_parameters, language='en')

# search for best averaged oov penalty
best_oov = Development.define_best_parameters('iv'=[oov_scores_setup1, oov_scores_setup2], 'oov'=oov_scores_setup3)

# store best parameters
best_parameters['oov_penalty'] = best_oov
with open('parameters.json', 'w') as f:
json.dump(best_parameters, f)

# conduct ranking experiments with best parameters on test data

with open('testcorpus.json', 'r') as f:
testfiles = json.load(f)
testcorpus = [testfiles[0], testfiles[1], testfiles[2]]

with open('testcandidates.json', 'r') as f:
testcandidates = json.load(f)

# ranking experiment and analysis per frequency scenario for our context-sensitive model, noisy channel model, and majority frequency

best_parameters['ranking_method'] = 'context'
dev = Development(best_parameters, language='en')
accuracy_context, correction_list_context = dev.conduct_experiment(testcorpus, testcandidates)
frequency_analysis_context = dev.frequency_analysis()

best_parameters['ranking_method'] = 'noisy_channel'
dev = Development(best_parameters, language='en')
accuracy_noisychannel, correction_list_noisychannel = dev.conduct_experiment(testcorpus, testcandidates)
frequency_analysis_noisychannel = dev.frequency_analysis()

best_parameters['ranking_method'] = 'frequency'
dev = Development(best_parameters, language='en')
accuracy_frequency, correction_list_frequency = dev.conduct_experiment(testcorpus, testcandidates)
```