Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/clips/clinspell

Clinical spelling correction with word and character n-gram embeddings.
https://github.com/clips/clinspell
Last synced: 3 months ago
JSON representation
Clinical spelling correction with word and character n-gram embeddings.
Host: GitHub
URL: https://github.com/clips/clinspell
Owner: clips
Created: 2017-05-15T23:54:04.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2022-06-21T21:14:12.000Z (over 2 years ago)
Last Synced: 2024-05-21T00:49:48.960Z (6 months ago)
Language: Python
Homepage:
Size: 2.56 MB
Stars: 74
Watchers: 8
Forks: 16
Open Issues: 5
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        This repository contains source code for the paper ['Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-gram Embeddings'](http://www.clinjournal.org/sites/clinjournal.org/files/03.unsupervised-context-sensitive_0.pdf), which is published in Volume 7 of [CLIN Journal](http://www.clinjournal.org/biblio/volume). A shorter paper, which focuses exclusively on our English experiments, was presented at the BioNLP 2017 workshop at ACL: ['Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-gram Embeddings.'](http://www.aclweb.org/anthology/W17-2317) The source code offered here contains scripts to extract our manually annotated MIMIC-III data, and to

run the experiments described in our paper.

# License

MIT

# Requirements

* Python 3

* Python 2.7

* Numpy

* [pyxdameraulevenshtein](https://github.com/gfairchild/pyxDamerauLevenshtein)

* [Facebook fastText](https://github.com/facebookresearch/fastText)

* [fasttext](https://github.com/salestock/fastText.py), a Python interface for Facebook fastText

All packages are available from pip, except ```fastText```. To install these requirements, just run

```pip install -r requirements.txt```

from inside the cloned repository.

In order to build ```fastText```, use the following:

```

$ git clone https://github.com/facebookresearch/fastText.git

$ cd fastText

$ make

```

To extract our manually annotated MIMIC-III test data, you should have access to the [MIMIC-III database](https://mimic.physionet.org).

It's important that this is specifically MIMIC-III v1.3: our extraction script only works for this version.

# Usage

## Demo

To demo the context-sensitive spelling correction model with the best parameters from the experiments,

go to the **demo** directory and follow the instructions in the README.

## Extracting the English test data

To extract the annotated test data, run

```python2.7 extract_test.py [path to NOTEEVENTS.csv file from the MIMIC-III v1.3 database]```

This script preprocesses the **NOTEEVENTS.csv** data and stores the preprocessed data in the file **mimic_preprocessed.txt**. It then extracts the annotated 

test data, which is stored to the file **testcorpus.json** in four lists: correct replacements, misspellings, misspelling contexts, and line indices.

## Extracting development data and other resources

### Preprocessing

To generate development corpora as described in the paper, the data has to be preprocessed. To preprocess English data, run

```python3 preprocess.py [path to raw data] [path to created preprocessed data]```

This script uses the source code of the English tokenizer from [Pattern](https://github.com/clips/pattern). 

To preprocess Dutch data, you can use the [Ucto](https://languagemachines.github.io/ucto/) tokenizer and, for every line, retain every token which 

matches 

```r'(^[^\d\W])[^\d\W]*(-[^\d\W]*)*([^\d\W]$)'```

### Generating frequency lists and neural embeddings

To extract a frequency list from the preprocessed data, run

```python3 frequencies.py [path to preprocessed data] [language]```

The [language] argument should always either be **en** if the language is English or **nl** if the language is Dutch. 

To train the fastText vectors as we do, place the preprocessed data in the cloned fastText directory and run

```./fasttext skipgram -input [path to preprocessed data] -output ../data/embeddings_[language] -dim 300```

This makes an embeddings_[language].vec and embeddings_[language].bin file in the data repository.

Only the embeddings_[language].bin file is used by the code.

### Generating development corpora

To create a development corpus from preprocessed data, run

```python3 make_devcorpus.py [path to preprocessed data] [language] [path to created devcorpus] [window size] [allow oov] [samplesize]```

The [window size] argument specifies the minimal token window size on each side of a generated development instance.

The [allow oov] argument should be False for development setup 1 or 2 from the paper, and True for development setup 3. 

The [samplesize] argument should contain the number of lines to sample from the data.

## Conducting experiments

### Generating candidates

To generate candidates for a created development corpus, run

```python3 candidates.py [path to preprocessed data] 2 [name of output] [language]```

To generate candidates for our extracted test data or other empirically observed data, run

```python3 candidates.py [path to preprocessed data] all [name of output] [language]```

### Ranking experiments

The ```Development``` class in **ranking_experiments.py** contains all functions to conduct the experiments. 

Example:

```

import ranking_experiments

# load devcorpus for setup 1, 2 and 3

with open('devcorpus_setup1.json', 'r') as f:

        corpusfiles_setup1 = json.load(f)

devcorpus_setup1 = corpusfiles_setup1[:3]

with open('devcorpus_setup2.json', 'r') as f:

        corpusfiles_setup2 = json.load(f)

devcorpus_setup2 = corpusfiles_setup2[:3]

with open('devcorpus_setup3.json', 'r') as f:

        corpusfiles_setup3 = json.load(f)

devcorpus_setup3 = corpusfiles_setup3[:3]

# load candidates for setup 1, 2 and 3

with open('candidates_devcorpus_setup1.json', 'r') as f:

        candidates_setup1 = json.load(f)

with open('candidates_devcorpus_setup2.json', 'r') as f:

        candidates_setup2 = json.load(f)

with open('candidates_devcorpus_setup3.json', 'r') as f:

        candidates_setup3 = json.load(f)

# perform grid search

scores_setup1 = Development.grid_search(devcorpus_setup1, candidates_setup1, language='en')

scores_setup2 = Development.grid_search(devcorpus_setup2, candidates_setup2, language='en')

# search for best averaged parameters

best_parameters = Development.define_best_parameters('iv'=[scores_setup1, scores_setup2])

# perform grid search for oov penalty

oov_scores_setup1 = Development.tune_oov(devcorpus_setup1, candidates_list, best_parameters, language='en')

oov_scores_setup2 = Development.tune_oov(devcorpus_setup2, candidates_list, best_parameters, language='en')

oov_scores_setup3 = Development.tune_oov(devcorpus_setup3, candidates_list, best_parameters, language='en')

# search for best averaged oov penalty

best_oov = Development.define_best_parameters('iv'=[oov_scores_setup1, oov_scores_setup2], 'oov'=oov_scores_setup3)

# store best parameters

best_parameters['oov_penalty'] = best_oov

with open('parameters.json', 'w') as f:

	json.dump(best_parameters, f)

# conduct ranking experiments with best parameters on test data

with open('testcorpus.json', 'r') as f:

	testfiles = json.load(f)

testcorpus = [testfiles[0], testfiles[1], testfiles[2]]

with open('testcandidates.json', 'r') as f:

        testcandidates = json.load(f)

# ranking experiment and analysis per frequency scenario for our context-sensitive model, noisy channel model, and majority frequency

best_parameters['ranking_method'] = 'context'

dev = Development(best_parameters, language='en')

accuracy_context, correction_list_context = dev.conduct_experiment(testcorpus, testcandidates)

frequency_analysis_context = dev.frequency_analysis()

best_parameters['ranking_method'] = 'noisy_channel'

dev = Development(best_parameters, language='en')

accuracy_noisychannel, correction_list_noisychannel = dev.conduct_experiment(testcorpus, testcandidates)

frequency_analysis_noisychannel = dev.frequency_analysis()

best_parameters['ranking_method'] = 'frequency'

dev = Development(best_parameters, language='en')

accuracy_frequency, correction_list_frequency = dev.conduct_experiment(testcorpus, testcandidates)

```