{"id":13754345,"url":"https://github.com/jgc128/mednli","last_synced_at":"2025-05-09T22:31:57.937Z","repository":{"id":33286089,"uuid":"145300422","full_name":"jgc128/mednli","owner":"jgc128","description":"MedNLI - A Natural Language Inference Dataset For The Clinical Domain","archived":false,"fork":false,"pushed_at":"2023-02-15T17:53:05.000Z","size":74,"stargazers_count":124,"open_issues_count":15,"forks_count":31,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-11-16T07:33:25.411Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://jgc128.github.io/mednli/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jgc128.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-19T12:17:40.000Z","updated_at":"2024-11-13T20:47:45.000Z","dependencies_parsed_at":"2024-08-03T09:17:18.494Z","dependency_job_id":null,"html_url":"https://github.com/jgc128/mednli","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jgc128%2Fmednli","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jgc128%2Fmednli/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jgc128%2Fmednli/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jgc128%2Fmednli/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jgc128","download_url":"https://codeload.github.com/jgc128/mednli/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335823,"owners_count":21892744,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:55.638Z","updated_at":"2025-05-09T22:31:52.929Z","avatar_url":"https://github.com/jgc128.png","language":"Python","funding_links":[],"categories":["NLP语料和数据集"],"sub_categories":["其他_文本生成、文本对话"],"readme":"MedNLI - Natural Language Inference in Clinical Texts\n=====================================================\n\n## Information\nThis repository contains the code to fully reproduce experiments in the paper. \nAs such, it has quite a few dependencies and not trivial to install.\nIf you want just a simple ready-to-use baseline with pre-trained models,\nplease have a look at our baselines repository:\nhttps://github.com/jgc128/mednli_baseline\n\n## Installation\n\n1. Clone this repo: `git clone ...`\n2. Install NumPy: `pip install numpy==1.13.3`\n3. Install PyTorch v0.2.0: `pip install http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp36-cp36m-manylinux1_x86_64.whl` (see https://github.com/pytorch/pytorch#installation for details)\n4. Install requirements: `pip install -r requirements.txt`\n5. Install MetaMap: https://metamap.nlm.nih.gov/Installation.shtml\n   - Make sure to set `METAMAP_BINARY_PATH` in the `config.py` to your MetaMap binary installation\n6. Install PyMetaMap: https://github.com/AnthonyMRios/pymetamap\n7. Install UMLS Metathesaurus: https://www.nlm.nih.gov/research/umls/\n   - Make sure to set `UMLS_INSTALLATION_DIR` in the `config.py` pointing to your UMLS installation \n\n\n## Downloading the datasets\n\n1. Download SNLI: https://nlp.stanford.edu/projects/snli/\n2. Download MultiNLI: http://www.nyu.edu/projects/bowman/multinli/ (we experimented with MultiNLI v0.9)\n3. Download MedNLI: https://jgc128.github.io/mednli/\n\nPut all of the data inside the `./data/` dir so is has the following structure:\n```\n$ ls data/\nmednli_1.0  multinli_0.9  snli_1.0\n``` \n\n```\n$ ls data/snli_1.0/\nREADME.txt  snli_1.0_dev.jsonl  snli_1.0_dev.txt  snli_1.0_test.jsonl  snli_1.0_test.txt  snli_1.0_train.jsonl  snli_1.0_train.txt\n```\n\n## Downloading the word embeddings\n\n| Word Embedding  | Link |\n| ------------- | ------------- |\n|glove |  [glove.840B.300d.pickled](https://mednli.blob.core.windows.net/shared/word_embeddings/glove.840B.300d.pickled) |\n|mimic |  [mimic.fastText.no_clean.300d.pickled](https://mednli.blob.core.windows.net/shared/word_embeddings/mimic.fastText.no_clean.300d.pickled) |\n|bio_asq | [bio_asq.no_clean.300d.pickled](https://mednli.blob.core.windows.net/shared/word_embeddings/bio_asq.no_clean.300d.pickled) |\n|wiki_en | [wiki_en.fastText.300d.pickled](https://mednli.blob.core.windows.net/shared/word_embeddings/wiki_en.fastText.300d.pickled) |\n|wiki_en_mimic |  [wiki_en_mimic.fastText.no_clean.300d.pickled](https://mednli.blob.core.windows.net/shared/word_embeddings/wiki_en_mimic.fastText.no_clean.300d.pickled) |\n|glove_bio_asq |  [glove_bio_asq.no_clean.300d.pickled](https://mednli.blob.core.windows.net/shared/word_embeddings/glove_bio_asq.no_clean.300d.pickled) |\n|glove_bio_asq_mimic |[glove_bio_asq_mimic.no_clean.300d.pickled](https://mednli.blob.core.windows.net/shared/word_embeddings/glove_bio_asq_mimic.no_clean.300d.pickled) |\n\nPut all embeddings inside the `./data/word_embeddings/` dir so is has the following structure:\n\n```\n$ ls data/word_embeddings/\nglove.840B.300d.pickled\t\tglove_bio_asq_mimic.no_clean.300d.pickled \tmimic.fastText.no_clean.300d.pickled\n```\n\n\n## Running the code\nCode tested on Python 3.4 and Python 3.6.3\n\n0. Configuration: `config.py`\n1. Preprocess the data: `python preprocess.py`\n   - This script will create files `genre_*.pkl` in the `./data/nli_processed/` directory\n   - Preprocess the test data: `python preprocess.py process_test`\n2. Extract concepts: `python metamap_extract_concepts.py`\n   - Make sure to run MetaMap servers first before executing this script \n   - The script above works only for the MedNLI dataset. Rename the files `genre_*.pkl` to `genre_concepts_*.pkl` for SNLI and all MultiNLI domains.\n   - Call `main_data_test` as the main function to process the test data\n3. Create word embeddings cache: `python pickle_word_vectors.py \u003cpath_to_glove/word2vec file\u003e ./data/word_embeddings/\u003cname\u003e`\n   - See `WORD_VECTORS_FILENAME` in the `config.py` for file namings\n4. Create UMLS graph cache: `python parse_umls_create_concepts_graph.py`\n5. Optional: to create input data for the [official retrofitting script](https://github.com/mfaruqui/retrofitting) run `python create_retorfitting_data.py`\n6. Train the model: `python train_model.py`\n   - You can change the parameters in the `config` function or in the command line: `python train_model.py with use_umls_attention=True use_token_level_attention=True` (see the [Sacred documentation](http://sacred.readthedocs.io/en/latest/) for details)\n \n \n### Using a pre-trained model\n\n 1. Download model weights, and the the model-specific tokenizer and embeddings (see the table below).\n 2. Put the model weights into the `./data/saved_models/` dir.\n 3. Put the tokenizer and the embeddings into the `./data/` dir.\n 4. Create an input file that contains premises and hypotheses, delimited by the `\\t` character (see [example](https://mednli.blob.core.windows.net/shared/test_input.txt)).\n 4. Run the `predict.py` script and provide the input data in STDIN: `python predict.py \u003c data/input.txt`. The resulting probabilities of the `contradiction`, `neutral`, and `entailment` classes correspondingly wll be printed to STDOUT. If you do not want to see the logging and wish to save the results to a file, redirect STDERR to /dev/null and STDOUT to a file: `python predict.py \u003c data/test_input.txt 2\u003e/dev/null \u003e data/test_input_probabilities.txt`\n \nYou can configure the model weights, tokenizer, and the embeddings filename using the command line arguments:\n```\npython train_model.py with model_class=PyTorchInferSentModel model_weights_filename=PyTorchInferSentModel_50_glove_bio_asq_mimic_clinical__.slysamwq.h5 tokenizer_filename=tokenizer_clinical_.pickled embeddings_filename=embeddings_clinical_.pickled\n``` \n\n| Model description | Model files and parameters |\n|  ------------- |  ------------- | \n|InferSent model, trained on MedNLI only using the glove_bio_asq_mimic word vectors |  model_class: `PyTorchInferSentModel`  \u003cbr/\u003e [model weights](https://mednli.blob.core.windows.net/shared/pretrained_models/PyTorchInferSentModel_50_glove_bio_asq_mimic_clinical__.slysamwq.h5) \u003cbr/\u003e [tokenizer](https://mednli.blob.core.windows.net/shared/pretrained_models/tokenizer_clinical_.pickled) \u003cbr/\u003e [embeddings](https://mednli.blob.core.windows.net/shared/pretrained_models/embeddings_clinical_.pickled) |\n\nMore models coming soon!\n\n\n### Configuration options\n```python\nmodel_class = 'PyTorchInferSentModel' # class name of the model to run. See the `create_model` function for the available models\nmax_len = 50 # max sentence length\nlowercase = False # lowercase input data or nor\nclean = False # remove punctuation etc or not\nstem = False # do stemming to not\nword_vectors_type = 'glove'  # word vectors - see the `WORD_VECTORS_FILENAME` in `config.py` for details\nword_vectors_replace_cui = ''  # filename with retorifitted embeddings for CUIs, eg cui.glove.cbow_most_common.CHD-PAR.SNOMEDCT_US.retrofitted.pkl\ndownsample_source = 0 # down sample the source domain data to the size of the MedNLI\n\n# transfer learning settings\ngenre_source = 'clinical' # source domain for transfer learning. target='' and tune='' - no transfer\ngenre_target = '' # target domain - always MedNLI in case of experiemnts in the paper\ngenre_tune = '' # fine-tuning domain\nlambda_multi_task = -1 # whether to use dynamically sampled batches from different domains or not.\nuniform_batches = True # a batch will contain samples from just one domain\n\nrnn_size = 300 # size of LSTM\nrnn_cell = 'LSTM' # LSTM is used in the experiments in the paper\nregularization = 0.000001 # regularization strength\ndropout = 0.5 # dropout\nhidden_size = 300 # size of the hidden fully-connected layers\ntrainable_embeddings = False # train embeddings or not\n\n# knowledge-based attention\n# set both to true to reproduce the token-level UMLS attention used in the paper\nuse_umls_attention = False # whether to use the knowledge-based attention or not\nuse_token_level_attention = False # use CUIs or separate tokens for attention\n\nbatch_size = 512 # batch size\nepochs = 40 # number of epochs for training\nlearning_rate = 0.001 # learning rate for the Adam optimizer\ntraining_loop_mode = 'best_loss'  # best_loss or best_acc - the model will be saved on the base loss or accuracy on the validation set correspondingly\n\n```\n\n\n## Experiments in the paper\n\n### Baselines\nTo run the BOW, InferSent, and ESIM models with default settings, use the following commands accordingly:\n\n```\npython train_model.py with model_class=PyTorchSimpleModel\npython train_model.py with model_class=PyTorchInferSentModel\npython train_model.py with model_class=PyTorchESIMModel\n```\n\n### Transfer learning\nTo pre-train the model on the `Slate` domain, fine-tune on the MedNLI and test on the dev set of MedNLI (Sequential transfer in the paper), run the following command:\n\n`python train_model.py with genre_source=slate genre_tune=clinical genre_target=clinical`\n\nTo run the Multi-target transfer learning, specify the genres and use the corresponding versions of the models: `PyTorchMultiTargetSimpleModel`, `PyTorchMultiTargetInferSentModel`, and `PyTorchMultiTargetESIMModel`.\n\n\n### Word embeddings\nAll word embeddings have to be pickled first - see the `pickle_word_embeddings.py` script.\nTo run the model with a specific embeddings, use the `word_vectors_type` parameter:\n\n`python train_model.py with word_vectors_type=wiki_en_mimic`\n\n### Retorfitting\n\n - First, create the input data for retrofitting with the `create_retrofitting_data.py` script. \n - Second, run the official script from GitHub. (https://github.com/mfaruqui/retrofitting).\n - Next, pickle the resulting word vectors with the `pickle_word_vectors.py` script.\n - Finally, set the `word_vectors_replace_cui` parameter to the pickled retrofitted vectors:\n   - `python train_model.py with word_vectors_replace_cui=cui.glove.cbow_most_common.CHD-PAR.SNOMEDCT_US.retrofitted.pkl`\n   \n   \n### Knowledge-directed attention\nSet both `use_umls_attention` and `use_token_level_attention` to `True` to reproduce the token-level UMLS attention experiments:\n\n`python train_model.py with use_umls_attention=True use_token_level_attention=True`\n\n\n# Reference\nThe paper was accepted to EMNLP 2018! Meanwhile, here is an extended arXiv version:\n\nRomanov, A., \u0026 Shivade, C. (2018). Lessons from Natural Language Inference in the Clinical Domain. arXiv preprint arXiv:1808.06752.  \nhttps://arxiv.org/abs/1808.06752\n\n\n```\n@article{romanov2018lessons,\n\ttitle = {Lessons from Natural Language Inference in the Clinical Domain},\n\turl = {http://arxiv.org/abs/1808.06752},\n\tabstract = {State of the art models using deep neural networks have become very good in learning an accurate mapping from inputs to outputs. However, they still lack generalization capabilities in conditions that differ from the ones encountered during training. This is even more challenging in specialized, and knowledge intensive domains, where training data is limited. To address this gap, we introduce {MedNLI} - a dataset annotated by doctors, performing a natural language inference task ({NLI}), grounded in the medical history of patients. We present strategies to: 1) leverage transfer learning using datasets from the open domain, (e.g. {SNLI}) and 2) incorporate domain knowledge from external data and lexical sources (e.g. medical terminologies). Our results demonstrate performance gains using both strategies.},\n\tjournaltitle = {{arXiv}:1808.06752 [cs]},\n\tauthor = {Romanov, Alexey and Shivade, Chaitanya},\n\turldate = {2018-08-27},\n\tdate = {2018-08-21},\n\teprinttype = {arxiv},\n\teprint = {1808.06752},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjgc128%2Fmednli","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjgc128%2Fmednli","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjgc128%2Fmednli/lists"}