https://github.com/crherlihy/clinical_nli_artifacts

MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain (ACL-IJCNLP '21)
https://github.com/crherlihy/clinical_nli_artifacts

adversarial-filtering annotation-artifacts clinical-nlp clinical-notes mednlp mimic-iii natural-language-inference nlp-machine-learning

Last synced: about 1 month ago
JSON representation

MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain (ACL-IJCNLP '21)

Host: GitHub
URL: https://github.com/crherlihy/clinical_nli_artifacts
Owner: crherlihy
Created: 2021-06-02T03:48:24.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2021-08-06T20:41:12.000Z (almost 4 years ago)
Last Synced: 2025-02-07T17:18:08.444Z (3 months ago)
Topics: adversarial-filtering, annotation-artifacts, clinical-nlp, clinical-notes, mednlp, mimic-iii, natural-language-inference, nlp-machine-learning
Language: Python
Homepage:
Size: 85 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain

This repository contains the source code required to reproduce the analysis presented in the paper "MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain", appearing at [ACL-IJCNLP 2021](https://aclanthology.org/2021.acl-short.129/).

#### Data:

MedNLI can be downloaded from [PhysioNet](https://physionet.org/content/mednli/1.0.0/), though credentialed access is required.
After you have downloaded the data, put the resulting directory underneath the project root directory. Organization is as follows:

```
.
├── mednli
│   └── 1.0.0
│   ├── LICENSE.txt
│   ├── README.txt
│   ├── SHA256SUMS.txt
│   ├── index.html
│   ├── mli_dev_v1.jsonl
│   ├── mli_test_v1.jsonl
│   └── mli_train_v1.jsonl
```
----
### Set-up:

#### Conda environment:
`conda env create -f environment.yml`

`conda activate clinical_nli`

#### scispaCy language model:
General usage is: `pip install `; `en_core_sci_sm` and `en_core_sci_lg` are both used in this pipeline:
- `pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz`
- `pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz`

#### fastText MIMIC-III embeddings:
Referenced in the [original MedNLI paper by Romanov and Shivade (2018)](https://arxiv.org/abs/1808.06752); available on the [associated repo](https://github.com/jgc128/mednli) or via:
- `wget https://mednli.blob.core.windows.net/shared/word_embeddings/wiki_en_mimic.fastText.no_clean.300d.pickled`

#### Configuration file:
`./example_cfg.ini`: Defines paths and task-specific hyper-parameters.

----

### Shell and python scripts:

From the project root directory:
`cd ./scripts && sh parse_embeds_aflite.sh`

Note: `parse_embeds_aflite.sh` has 4 boolean flags:
- `fastText`: parse MedNLI input files (JSON) and create fastText-formatted `.txt` files
- `ftAllSubsets`: create a single fastText-formatted `.txt` file containing instances from all splits (eg, train, dev test). Useful for AFLite.
- `embeddings`: recovers embeddings for each instance in the corpus (language model is configurable)
- `aflite`: runs adversarial filtering algorithm `AfLite` adapted from [Sakaguchi et al. (2019)](https://arxiv.org/abs/1907.10641); yields `easy` and `difficult` partitions

#### To replicate reported results, after running `sh parse_embeds_aflite.sh` with all flags set to `True`, run:

- `sh ft_baseline.sh`: computes fastText baseline results; if `evalAflite` flag is set to `True`, also computes fastText results for AfLite *easy* and *difficult* partitions.
- `sh lexical.sh`: computes ngram counts, PMI, and mean/median hypothesis length by label.
- `sh semantic.sh`: uses `scispaCy` to link named ents to UMLS; conducts statistical hypothesis testing re: heuristics.

From the project root directory, `cd ./src/utils` and:
- `python get_hyp_len.py`: Computes hypothesis length for two versions of the corpus (multi-word entities merged and separate).
- `python get_partition_ids.py`: Creates 2 arrays with instance ids for the *easy* and *difficult* AfLite partitions.
- instance ids will have the format ``
- underlying text can be recovered by joining against the `./mednli/fastText/mli_all_w_premise_v1_sep.txt` file.

----

If you find this code useful in your research, please consider citing:

```
@inproceedings{herlihy-rudinger-2021-mednli,
title = "{M}ed{NLI} Is Not Immune: {N}atural Language Inference Artifacts in the Clinical Domain",
author = "Herlihy, Christine and
Rudinger, Rachel",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-short.129",
doi = "10.18653/v1/2021.acl-short.129",
pages = "1020--1027",
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/crherlihy/clinical_nli_artifacts

Awesome Lists containing this project

README