https://github.com/crherlihy/clinical_nli_artifacts
MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain (ACL-IJCNLP '21)
https://github.com/crherlihy/clinical_nli_artifacts
adversarial-filtering annotation-artifacts clinical-nlp clinical-notes mednlp mimic-iii natural-language-inference nlp-machine-learning
Last synced: about 1 month ago
JSON representation
MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain (ACL-IJCNLP '21)
- Host: GitHub
- URL: https://github.com/crherlihy/clinical_nli_artifacts
- Owner: crherlihy
- Created: 2021-06-02T03:48:24.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2021-08-06T20:41:12.000Z (almost 4 years ago)
- Last Synced: 2025-02-07T17:18:08.444Z (3 months ago)
- Topics: adversarial-filtering, annotation-artifacts, clinical-nlp, clinical-notes, mednlp, mimic-iii, natural-language-inference, nlp-machine-learning
- Language: Python
- Homepage:
- Size: 85 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain
This repository contains the source code required to reproduce the analysis presented in the paper "MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain", appearing at [ACL-IJCNLP 2021](https://aclanthology.org/2021.acl-short.129/).
#### Data:
MedNLI can be downloaded from [PhysioNet](https://physionet.org/content/mednli/1.0.0/), though credentialed access is required.
After you have downloaded the data, put the resulting directory underneath the project root directory. Organization is as follows:```
.
├── mednli
│ └── 1.0.0
│ ├── LICENSE.txt
│ ├── README.txt
│ ├── SHA256SUMS.txt
│ ├── index.html
│ ├── mli_dev_v1.jsonl
│ ├── mli_test_v1.jsonl
│ └── mli_train_v1.jsonl
```
----
### Set-up:#### Conda environment:
`conda env create -f environment.yml``conda activate clinical_nli`
#### scispaCy language model:
General usage is: `pip install `; `en_core_sci_sm` and `en_core_sci_lg` are both used in this pipeline:
- `pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz`
- `pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz`#### fastText MIMIC-III embeddings:
Referenced in the [original MedNLI paper by Romanov and Shivade (2018)](https://arxiv.org/abs/1808.06752); available on the [associated repo](https://github.com/jgc128/mednli) or via:
- `wget https://mednli.blob.core.windows.net/shared/word_embeddings/wiki_en_mimic.fastText.no_clean.300d.pickled`#### Configuration file:
`./example_cfg.ini`: Defines paths and task-specific hyper-parameters.----
### Shell and python scripts:
From the project root directory:
`cd ./scripts && sh parse_embeds_aflite.sh`Note: `parse_embeds_aflite.sh` has 4 boolean flags:
- `fastText`: parse MedNLI input files (JSON) and create fastText-formatted `.txt` files
- `ftAllSubsets`: create a single fastText-formatted `.txt` file containing instances from all splits (eg, train, dev test). Useful for AFLite.
- `embeddings`: recovers embeddings for each instance in the corpus (language model is configurable)
- `aflite`: runs adversarial filtering algorithm `AfLite` adapted from [Sakaguchi et al. (2019)](https://arxiv.org/abs/1907.10641); yields `easy` and `difficult` partitions#### To replicate reported results, after running `sh parse_embeds_aflite.sh` with all flags set to `True`, run:
- `sh ft_baseline.sh`: computes fastText baseline results; if `evalAflite` flag is set to `True`, also computes fastText results for AfLite *easy* and *difficult* partitions.
- `sh lexical.sh`: computes ngram counts, PMI, and mean/median hypothesis length by label.
- `sh semantic.sh`: uses `scispaCy` to link named ents to UMLS; conducts statistical hypothesis testing re: heuristics.From the project root directory, `cd ./src/utils` and:
- `python get_hyp_len.py`: Computes hypothesis length for two versions of the corpus (multi-word entities merged and separate).
- `python get_partition_ids.py`: Creates 2 arrays with instance ids for the *easy* and *difficult* AfLite partitions.
- instance ids will have the format ``
- underlying text can be recovered by joining against the `./mednli/fastText/mli_all_w_premise_v1_sep.txt` file.----
If you find this code useful in your research, please consider citing:
```
@inproceedings{herlihy-rudinger-2021-mednli,
title = "{M}ed{NLI} Is Not Immune: {N}atural Language Inference Artifacts in the Clinical Domain",
author = "Herlihy, Christine and
Rudinger, Rachel",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-short.129",
doi = "10.18653/v1/2021.acl-short.129",
pages = "1020--1027",
}
```