{"id":13689373,"url":"https://github.com/dmis-lab/biobert","last_synced_at":"2025-05-15T11:06:17.410Z","repository":{"id":34062780,"uuid":"167415972","full_name":"dmis-lab/biobert","owner":"dmis-lab","description":"Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining","archived":false,"fork":false,"pushed_at":"2023-08-13T21:11:54.000Z","size":508,"stargazers_count":2038,"open_issues_count":57,"forks_count":468,"subscribers_count":62,"default_branch":"master","last_synced_at":"2025-04-14T19:57:08.833Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://doi.org/10.1093/bioinformatics/btz682","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmis-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-01-24T18:27:35.000Z","updated_at":"2025-04-13T16:30:07.000Z","dependencies_parsed_at":"2023-01-15T04:19:09.599Z","dependency_job_id":"4955928d-fe94-4831-9618-14358c8dc67c","html_url":"https://github.com/dmis-lab/biobert","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2Fbiobert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2Fbiobert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2Fbiobert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2Fbiobert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmis-lab","download_url":"https://codeload.github.com/dmis-lab/biobert/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254328385,"owners_count":22052632,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T15:01:45.083Z","updated_at":"2025-05-15T11:06:17.361Z","avatar_url":"https://github.com/dmis-lab.png","language":"Python","funding_links":[],"categories":["Uncategorized","其他_生物医药","Information Extraction and NLP","Python"],"sub_categories":["Uncategorized","网络服务_其他"],"readme":"# BioBERT\nThis repository provides the code for fine-tuning BioBERT, a biomedical language representation model designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc.\nPlease refer to our paper [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](http://doi.org/10.1093/bioinformatics/btz682) for more details.\nThis project is done by [DMIS-Lab](https://dmis.korea.ac.kr).\n\n## Download\nWe provide five versions of pre-trained weights. Pre-training was based on the [original BERT code](https://github.com/google-research/bert) provided by Google, and training details are described in our paper. Currently available versions of pre-trained weights are as follows ([SHA1SUM](http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/sha1sum.html)):\n\n* **[BioBERT-Base v1.2 (+ PubMed 1M)](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2)** - trained in the same way as BioBERT-Base v1.1 but includes LM head, which can be useful for probing (available in PyTorch)\n* **[BioBERT-Large v1.1 (+ PubMed 1M)](http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/biobert_large_v1.1_pubmed.tar.gz)** - based on BERT-large-Cased (custom 30k vocabulary), [NER/QA Results](https://github.com/dmis-lab/biobert/wiki/BioBERT-Large-Results)\n* **[BioBERT-Base v1.1 (+ PubMed 1M)](http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/biobert_v1.1_pubmed.tar.gz)** - based on BERT-base-Cased (same vocabulary), [Results in the Paper](http://doi.org/10.1093/bioinformatics/btz682)\n* **[BioBERT-Base v1.0 (+ PubMed 200K)](http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/biobert_v1.0_pubmed.tar.gz)** - based on BERT-base-Cased (same vocabulary), [Results in the Paper](http://doi.org/10.1093/bioinformatics/btz682)\n* **[BioBERT-Base v1.0 (+ PMC 270K)](http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/biobert_v1.0_pmc.tar.gz)** - based on BERT-base-Cased (same vocabulary), [Results in the Paper](http://doi.org/10.1093/bioinformatics/btz682)\n* **[BioBERT-Base v1.0 (+ PubMed 200K + PMC 270K)](http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/biobert_v1.0_pubmed_pmc.tar.gz)** - based on BERT-base-Cased (same vocabulary), [Results in the Paper](http://doi.org/10.1093/bioinformatics/btz682)\n\nNote that the performances of v1.0 and v1.1 base models (BioBERT-Base v1.0, BioBERT-Base v1.1) are reported in the paper.\nAlternately, you can download pre-trained weights from [here](https://github.com/naver/biobert-pretrained/releases)\n\n## Installation\nSections below describe the installation and the fine-tuning process of BioBERT based on Tensorflow 1 (python version \u003c= 3.7).\nFor PyTorch version of BioBERT, you can check out [this repository](https://github.com/dmis-lab/biobert-pytorch).\nIf you are not familiar with coding and just want to recognize biomedical entities in your text using BioBERT, please use [this tool](https://bern.korea.ac.kr) which uses BioBERT for multi-type NER and normalization.\n\nTo fine-tune BioBERT, you need to download the [pre-trained weights of BioBERT](https://github.com/naver/biobert-pretrained).\nAfter downloading the pre-trained weights, use `requirements.txt` to install BioBERT as follows:\n```bash\n$ git clone https://github.com/dmis-lab/biobert.git\n$ cd biobert; pip install -r requirements.txt\n```\nNote that this repository is based on the [BERT repository](https://github.com/google-research/bert) by Google.\nAll the fine-tuning experiments were conducted on a single TITAN Xp GPU machine which has 12GB of RAM.\nYou might want to install `java` to use the official evaluation script of BioASQ. See `requirements.txt` for other details.\n\n## Quick Links\nLink | Detail\n------------- | -------------\n[BioBERT-PyTorch](https://github.com/dmis-lab/biobert-pytorch) | PyTorch-based BioBERT implementation\n[BERN](http://bern.korea.ac.kr) | Web-based biomedical NER + normalization using BioBERT\n[BERN2](http://bern2.korea.ac.kr) | Advanced version of BERN (web-based biomedical NER) w/ NER from BioLM + NEN from PubMedBERT\n[covidAsk](https://covidask.korea.ac.kr) | BioBERT based real-time question answering model for COVID-19\n[7th BioASQ](https://github.com/dmis-lab/bioasq-biobert) | Code for the seventh BioASQ challenge winning model (factoid/yesno/list)\n[Paper](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz682/5566506) | Paper link with [BibTeX](https://github.com/dmis-lab/biobert#citation) (Bioinformatics)\n\n## FAQs\n*   [How can I use BioBERT with PyTorch?](https://github.com/dmis-lab/biobert-pytorch)\n*   [Can I get word/sentence embeddings using BioBERT?](https://github.com/dmis-lab/biobert/issues/23)\n*   [How can I pre-train QA models on SQuAD?](https://github.com/dmis-lab/biobert/issues/10)\n*   [What vocabulary does BioBERT use?](https://github.com/naver/biobert-pretrained/issues/1)\n\n## Datasets\nWe provide a pre-processed version of benchmark datasets for each task as follows:\n*   **[`Named Entity Recognition`](http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/NERdata.zip)**: (17.3 MB), 8 datasets on biomedical named entity recognition\n*   **[`Relation Extraction`](http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/REdata.zip)**: (2.5 MB), 2 datasets on biomedical relation extraction\n*   **[`Question Answering`](http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/QA.zip)**: (5.23 MB), 3 datasets on biomedical question answering task.\n\nYou can simply run `download.sh` to download all the datasets at once.\n```bash\n$ ./download.sh\n```\nThis will download the datasets under the folder `datasets`.\nDue to the copyright issue of other datasets, we provide links of those datasets instead: **[`2010 i2b2/VA`](https://www.i2b2.org/NLP/DataSets/Main.php)**, **[`ChemProt`](http://www.biocreative.org/)**.\n\n## Fine-tuning BioBERT\nAfter downloading one of the pre-trained weights, unpack it to any directory you want, and we will denote this as `$BIOBERT_DIR`.\nFor instance, when using BioBERT-Base v1.1 (+ PubMed 1M), set `BIOBERT_DIR` environment variable as:\n```bash\n$ export BIOBERT_DIR=./biobert_v1.1_pubmed\n$ echo $BIOBERT_DIR\n\u003e\u003e\u003e ./biobert_v1.1_pubmed\n```\n\n### Named Entity Recognition (NER)\nLet `$NER_DIR` indicate a folder for a single NER dataset which contains `train_dev.tsv`, `train.tsv`, `devel.tsv` and `test.tsv`. Also, set `$OUTPUT_DIR` as a directory for NER outputs (trained models, test predictions, etc). For example, when fine-tuning on the NCBI disease corpus,\n```bash\n$ export NER_DIR=./datasets/NER/NCBI-disease\n$ export OUTPUT_DIR=./ner_outputs\n```\nFollowing command runs fine-tuning code on NER with default arguments.\n```bash\n$ mkdir -p $OUTPUT_DIR\n$ python run_ner.py --do_train=true --do_eval=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --num_train_epochs=10.0 --data_dir=$NER_DIR --output_dir=$OUTPUT_DIR\n```\nYou can change the arguments as you want. Once you have trained your model, you can use it in inference mode by using `--do_train=false --do_predict=true` for evaluating `test.tsv`.\nThe token-level evaluation result will be printed as stdout format.\nFor example, the result for NCBI-disease dataset will be like this:\n```\nINFO:tensorflow:***** token-level evaluation results *****\nINFO:tensorflow:  eval_f = 0.8972311\nINFO:tensorflow:  eval_precision = 0.88150835\nINFO:tensorflow:  eval_recall = 0.9136615\nINFO:tensorflow:  global_step = 2571\nINFO:tensorflow:  loss = 28.247158\n```\n(tips : You should go up a few lines to find the result. It comes before `INFO:tensorflow:**** Trainable Variables ****` )\n\nNote that this result is the token-level evaluation measure while the official evaluation should use the entity-level evaluation measure. \nThe results of `python run_ner.py` will be recorded as two files: `token_test.txt` and `label_test.txt` in `$OUTPUT_DIR`.\nUse `./biocodes/ner_detokenize.py` to obtain word level prediction file.\n```bash\n$ python biocodes/ner_detokenize.py --token_test_path=$OUTPUT_DIR/token_test.txt --label_test_path=$OUTPUT_DIR/label_test.txt --answer_path=$NER_DIR/test.tsv --output_dir=$OUTPUT_DIR\n```\nThis will generate `NER_result_conll.txt` in `$OUTPUT_DIR`.\nUse `./biocodes/conlleval.pl` for entity-level exact match evaluation results.\n```bash\n$ perl biocodes/conlleval.pl \u003c $OUTPUT_DIR/NER_result_conll.txt\n```\n\nThe entity-level results for the NCBI disease corpus will be like:\n```\nprocessed 24497 tokens with 960 phrases; found: 983 phrases; correct: 852.\naccuracy:  98.49%; precision:  86.67%; recall:  88.75%; FB1:  87.70\n             MISC: precision:  86.67%; recall:  88.75%; FB1:  87.70  983\n``` \nNote that this is a sample run of an NER model.\nThe performance of NER models usually converges at more than 50 epochs (learning rate = 1e-5 is recommended).\n\n### Relation Extraction (RE)\nLet `$RE_DIR` indicate a folder for a single RE dataset, `$TASK_NAME` denote the name of task (two possible options: {gad, euadr}), and `$OUTPUT_DIR` denote a directory for RE outputs:\n```bash\n$ export RE_DIR=./datasets/RE/GAD/1\n$ export TASK_NAME=gad\n$ export OUTPUT_DIR=./re_outputs_1\n```\nFollowing command runs fine-tuning code on RE with default arguments.\n```bash\n$ python run_re.py --task_name=$TASK_NAME --do_train=true --do_eval=true --do_predict=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --do_lower_case=false --data_dir=$RE_DIR --output_dir=$OUTPUT_DIR\n```\nThe predictions will be saved into a file called `test_results.tsv` in the `$OUTPUT_DIR`.\nUse `./biocodes/re_eval.py` for the evaluation.\nNote that the CHEMPROT dataset is a multi-class classification dataset and to evaluate the CHEMPROT result, you should run `re_eval.py` with additional `--task=chemprot` flag.\n```bash\n$ python ./biocodes/re_eval.py --output_path=$OUTPUT_DIR/test_results.tsv --answer_path=$RE_DIR/test.tsv\n```\nThe result for GAD dataset will be like this:\n```\nf1 score    : 83.74%\nrecall      : 90.75%\nprecision   : 77.74%\nspecificity : 71.15%\n```\nPlease be aware that you have to change `$OUTPUT_DIR` to train/test a new model.\nFor instance, as most RE datasets are in 10-fold, you have to make a different output directory to train/test a model for a different fold (e.g., `$ export OUTPUT_DIR=./re_outputs_2`).\n\n### Question Answering (QA)\nTo use the BioASQ dataset, you need to register in the [BioASQ website](http://participants-area.bioasq.org/general_information/general_information_registration/) which authorizes the use of the dataset.\nPlease unpack the pre-processed BioASQ dataset provided above to a directory `$QA_DIR`.\nFor example, with `$OUTPUT_DIR` for QA outputs, set as:\n```bash\n$ export QA_DIR=./datasets/QA/BioASQ\n$ export OUTPUT_DIR=./qa_outputs\n```\nFiles named as `BioASQ-*.json` are used for training and testing the model which are the pre-processed format for BioBERT.\nNote that we pre-trained our model on SQuAD dataset to get state-of-the-art performance (see [here](https://github.com/dmis-lab/bioasq-biobert) to get BioBERT pre-trained on SQuAD), and you might have to change `$BIOBERT_DIR` accordingly.\nFollowing command runs fine-tuning code on QA with default arguments.\n```bash\n$ python run_qa.py --do_train=True --do_predict=True --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --max_seq_length=384 --train_batch_size=12 --learning_rate=5e-6 --doc_stride=128 --num_train_epochs=5.0 --do_lower_case=False --train_file=$QA_DIR/BioASQ-train-factoid-4b.json --predict_file=$QA_DIR/BioASQ-test-factoid-4b-1.json --output_dir=$OUTPUT_DIR\n```\nThe predictions will be saved into a file called `predictions.json` and `nbest_predictions.json` in `$OUTPUT_DIR`.\nRun `./biocodes/transform_nbset2bioasqform.py` to convert `nbest_predictions.json` to the BioASQ JSON format, which will be used for the official evaluation.\n```bash\n$ python ./biocodes/transform_nbset2bioasqform.py --nbest_path=$OUTPUT_DIR/nbest_predictions.json --output_path=$OUTPUT_DIR\n```\nThis will generate `BioASQform_BioASQ-answer.json` in `$OUTPUT_DIR`.\nClone **[`evaluation code`](https://github.com/BioASQ/Evaluation-Measures)** from BioASQ github and run evaluation code on `Evaluation-Measures` directory. Please note that you should always put 5 as parameter for -e.\n```bash\n$ git clone https://github.com/BioASQ/Evaluation-Measures.git\n$ cd Evaluation-Measures\n$ java -Xmx10G -cp $CLASSPATH:./flat/BioASQEvaluation/dist/BioASQEvaluation.jar evaluation.EvaluatorTask1b -phaseB -e 5 ../$QA_DIR/4B1_golden.json ../$OUTPUT_DIR/BioASQform_BioASQ-answer.json\n```\nAs our model is only on factoid questions, the result will be like,\n```\n0.0 0.3076923076923077 0.5384615384615384 0.394017094017094 0.0 0.0 0.0 0.0 0.0 0.0\n```\nwhere the second, third and fourth numbers will be SAcc, LAcc, and MRR of factoid questions respectively.\nFor list and yes/no type questions, please refer to our repository for [BioBERT at the 7th BioASQ Challenge](https://github.com/dmis-lab/bioasq-biobert).\n\n## License and Disclaimer\nPlease see the LICENSE file for details. Downloading data indicates your acceptance of our disclaimer.\n\n## Citation\n```bibtex\n@article{lee2020biobert,\n  title={BioBERT: a pre-trained biomedical language representation model for biomedical text mining},\n  author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},\n  journal={Bioinformatics},\n  volume={36},\n  number={4},\n  pages={1234--1240},\n  year={2020},\n  publisher={Oxford University Press}\n}\n```\n\n## Contact Information\nFor help or issues using BioBERT, please submit a GitHub issue. Please contact Jinhyuk Lee\n(`lee.jnhk (at) gmail.com`), or Wonjin Yoon (`wonjin.info (at) gmail.com`) for communication related to BioBERT.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmis-lab%2Fbiobert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmis-lab%2Fbiobert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmis-lab%2Fbiobert/lists"}