{"id":21441768,"url":"https://github.com/dmis-lab/bioasq-biobert","last_synced_at":"2025-07-14T13:03:27.860Z","repository":{"id":36151143,"uuid":"199038422","full_name":"dmis-lab/bioasq-biobert","owner":"dmis-lab","description":"Pre-trained Language Model for Biomedical Question Answering","archived":false,"fork":false,"pushed_at":"2023-03-24T22:23:49.000Z","size":96,"stargazers_count":119,"open_issues_count":6,"forks_count":21,"subscribers_count":10,"default_branch":"v1.0","last_synced_at":"2024-05-14T00:23:17.045Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1909.08229","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmis-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-07-26T15:14:38.000Z","updated_at":"2024-04-25T23:14:29.000Z","dependencies_parsed_at":"2023-01-16T22:56:01.157Z","dependency_job_id":"ceefa35b-033f-4fff-b769-d827bfb0ccbf","html_url":"https://github.com/dmis-lab/bioasq-biobert","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2Fbioasq-biobert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2Fbioasq-biobert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2Fbioasq-biobert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2Fbioasq-biobert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmis-lab","download_url":"https://codeload.github.com/dmis-lab/bioasq-biobert/tar.gz/refs/heads/v1.0","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225990493,"owners_count":17556152,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-23T01:42:13.719Z","updated_at":"2024-11-23T01:42:14.237Z","avatar_url":"https://github.com/dmis-lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Pre-trained Language Model for Biomedical Question Answering \u003cbr\u003e _BioBERT at BioASQ 7b -Phase B_\nThis repository provides the source code and pre-processed datasets of our participating model for the BioASQ Challenge 7b. We utilized BioBERT, a language representation model for the biomedical domain, with minimum modifications for the challenge. \n\u003cbr\u003ePlease refer to our paper [Pre-trained Language Model for Biomedical Question Answering](https://arxiv.org/abs/1909.08229) for more details.\nThis paper is accepted for an oral presentation in **BioASQ Workshop @ ECML PKDD 2019**.\n\n## Citation\n\nPlease cite [the published version of the paper](https://link.springer.com/chapter/10.1007/978-3-030-43887-6_64):\n```\n@InProceedings{10.1007/978-3-030-43887-6_64,\n  author=\"Yoon, Wonjin and Lee, Jinhyuk and Kim, Donghyeon and Jeong, Minbyul and Kang, Jaewoo\",\n  editor=\"Cellier, Peggy and Driessens, Kurt\",\n  title=\"Pre-trained Language Model for Biomedical Question Answering\",\n  booktitle=\"Machine Learning and Knowledge Discovery in Databases\",\n  year=\"2020\",\n  publisher=\"Springer International Publishing\",\n  address=\"Cham\",\n  pages=\"727--740\",\n  isbn=\"978-3-030-43887-6\"\n}\n```\n\nAlso, we wish you to cite [BioBERT paper](http://dx.doi.org/10.1093/bioinformatics/btz682) as well since our model is based on BioBERT pre-trained weight. \n```\n@article{lee2019biobert,\n  title={BioBERT: a pre-trained biomedical language representation model for biomedical text mining},\n  author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},\n  doi = {10.1093/bioinformatics/btz682}, \n  journal={Bioinformatics},\n  year={2019}\n}\n```\n\n## Installation\nPlease note that this repository is based on the [BioBERT repository](https://github.com/dmis-lab/biobert).\n\n### Pre-trained weights (Pre-trained on SQuAD)\nWe are releasing the pre-trained weights for BioBERT system in the paper. The weights are pre-trained on `SQuAD v1.1` or `SQuAD v2.0` dataset on top of `BioBERT v1.1`(1M steps pre-trained on PubMed corpus).\nWe only used training set of SQuAD datasets. \n\u003cbr\u003eFor best performance, please use `BioBERT v1.1-SQuAD v1.1` for factoid and list questions and `BioBERT v1.1-SQuAD v2.0` for yseno questions.\n*   **[`BioBERT v1.1 - SQuAD v1.1`](https://drive.google.com/file/d/1-jIhuBOXv8ncXKAoFTWcN9fGiY97HCcN/view?usp=sharing)** : Recommanded for factoid and list questions\n\u003cbr\u003e`SHA-1 Checksum : 408809150A23B4B99EFD21AF2B4ACEA52B31F3D9`\n*   **[`BioBERT v1.1 - SQuAD v2.0`](http://nlp.dmis.korea.edu/projects/bioasq-biobert-2019-checkpoints/BERT-pubmed-1000000-SQuAD2.tar.gz)** : Recommanded for yseno questions\n\u003cbr\u003e`SHA-1 Checksum : 9A10621691BFEB834CBFD5F81E9D2C099247803A`\n*   **[`bert_config.json`](http://nlp.dmis.korea.edu/projects/bioasq-biobert-2019-checkpoints/bert_config.json) [`vocab.txt`](http://nlp.dmis.korea.edu/projects/bioasq-biobert-2019-checkpoints/vocab.txt)** : Essential files.\n\nAs an alternative option, you may wish to pre-train from scratch. In that case, please follow :\n```\n1. Fine-tune BioBERT on SQuAD dataset\n2. Use the resulting ckpt of 1 as an initial checkpoint for fine-tuning BioASQ datasets. \n```\nBe sure to set the output folder of step 2 as a different folder of step 1.\n\n## Datasets\nWe provide pre-processed version of BioASQ 6b/7b - Phase B datasets for each task as follows:\n*   **[`BioASQ 6b/7b`](https://drive.google.com/file/d/1-KefyBWOaCuswy9LFwnq7NC0H1Ymkv05/view?usp=sharing)** (23 MB) Last update : 15th Oct. 2019 \n\nDue to the copyright issue, we can not provide golden answers for BioASQ 6b test dataset at the moment. \n**However, you can extract golden answers for 6b from original BioASQ 7b dataset.**\nTo use original BioASQ datasets, you should register in [BioASQ website](http://participants-area.bioasq.org). \n\nFor details on the datasets, please see **An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition (Tsatsaronis et al. 2015)**.\n\n## Fine-tuning BioBERT\nAfter downloading one of the pre-trained models, unpack it to any directory you want, which we will denote as `$BIOBERT_DIR`.\nYou need to download other essential files ([`bert_config.json`](https://drive.google.com/open?id=17fX1-oChZ5rxu-e-JuaZl2I96q1dGJO4) and [`vocab.txt`](https://drive.google.com/open?id=1GQUvBbXvlI_PeUPsZTqh7xQDZMOXh7ko)) to `$BIOBERT_DIR` as well. \n\nPlease download our pre-processed version of BioASQ-6/7b datasets, and unpack it to `$BIOASQ_DIR`.\n\n### Training and predicting\n\nPlease use `run_factoid.py`, `run_yesno.py` and `run_list.py` for yesno, factoid and list questions respectively.\nUse `BioASQ-*.json` as training and testing dataset which we pre-processed the original BioASQ data to SQuAD dataset form. \nThis is necessary as the input data format of BioBERT is different from BioASQ dataset format. \nAlso, please be informed that the do_lower_case flag should be set as `--do_lower_case=False` since BioBERT model is based on `BERT-BASE (CASED)` model. \n\nAs an example, the following command runs fine-tuning and predicting code on factoid questions (6b; _full abstract_ method) with default arguments.\n\u003cbr\u003ePlease see [examplecode.sh](examplecode.sh) for yesno and list questions.\n\n``` \nexport BIOBERT_DIR=$HOME/BioASQ/BERT-pubmed-1000000-SQuAD\nexport BIOASQ_DIR=$HOME/BioASQ/data-release\n\npython run_factoid.py \\\n     --do_train=True \\\n     --do_predict=True \\\n     --vocab_file=$BIOBERT_DIR/vocab.txt \\\n     --bert_config_file=$BIOBERT_DIR/bert_config.json \\\n     --init_checkpoint=$BIOBERT_DIR/model.ckpt-14599 \\\n     --max_seq_length=384 \\\n     --train_batch_size=12 \\\n     --learning_rate=5e-6 \\\n     --doc_stride=128 \\\n     --num_train_epochs=5.0 \\\n     --do_lower_case=False \\\n     --train_file=$BIOASQ_DIR/BioASQ-6b/train/Full-Abstract/BioASQ-train-factoid-6b-full-annotated.json \\\n     --predict_file=$BIOASQ_DIR/BioASQ-6b/test/Full-Abstract/BioASQ-test-factoid-6b-3.json \\\n     --output_dir=/tmp/factoid_output/\n```\nYou can change the arguments as you want. Once you have trained your model, you can use it in inference mode by using `--do_train=false --do_predict=true` for evaluating other json file with identical structure.\n\nThe predictions will be saved into a file called `predictions.json` and `nbest_predictions.json` in the `output_dir`.\nRun transform file (for example, `transform_n2b_factoid.py`) in `./biocodes/` folder to convert `nbest_predictions.json` or `predictions.json` to BioASQ JSON format, which will be used for the official evaluation.\n```\npython ./biocodes/transform_n2b_factoid.py --nbest_path={QA_output_dir}/nbest_predictions.json --output_path={output_dir}\npython ./biocodes/transform_n2b_yesno.py --nbest_path={QA_output_dir}/predictions.json --output_path={output_dir}\npython ./biocodes/transform_n2b_list.py --nbest_path={QA_output_dir}/nbest_predictions.json --output_path={output_dir}\n```\nThis will generate `BioASQform_BioASQ-answer.json` in `{output_dir}`.\nClone **[`evaluation code`](https://github.com/BioASQ/Evaluation-Measures)** from BioASQ github and run evaluation code on `Evaluation-Measures` directory. Please note that you should put 5 as parameter for -e if you are evaluating the system for BioASQ 5b/6b/7b dataset .\n```\ncd Evaluation-Measures\njava -Xmx10G -cp $CLASSPATH:./flat/BioASQEvaluation/dist/BioASQEvaluation.jar evaluation.EvaluatorTask1b -phaseB -e 5 \\\n    $BIOASQ_DIR/6B1_golden.json \\\n    RESULTS_PATH/BioASQform_BioASQ-answer.json\n```\nAs our example is on factoid questions, the result will be like\n``` \n0.0 0.4358974358974359 0.6153846153846154 0.5072649572649572 0.0 0.0 0.0 0.0 0.0 0.0\n```\nwhere the second, third and fourth numbers will be SAcc, LAcc and MRR of factoid questions respectively.\n\nPlease be advised that the performance of yesno questions has relatively high variance. \nFollowing is our result of five independent experiments on yesno (6b) questions (We used settings of `Snippet as-is` dataset, `BioBERT v1.1 - SQuAD v2.0` model. Please see [examplecode.sh](examplecode.sh) for details.).\n\n\n|          |  1st  |  2nd  |  3rd  |  4th  |  5th  | Average |\n|----------|-------|-------|-------|-------|-------|---------|\n| Macro F1 | 74.11 | 78.46 | 80.89 | 71.57 | 81.25 | **77.256**  |\n\n\n**Be sure to clean `output_dir` in order to perform independent experiments. Otherwise, our code will skip training and reuse existing model in the `output_dir` for prediction**\n\n## Requirement\n* GPU (Our setting was Titan Xp with 12Gb graphic memory)\n* Python 3 (Not working on python 2; encoding issues for run_yesno.py)\n* TensorFlow v1.11 (Not working on TF v2)\n* For other software requirement details, please check `requirements.txt` \n\n## License and Disclaimer\nPlease see and agree `LICENSE` file for details. Downloading data indicates your acceptance of our disclaimer.\n\n\n## Contact information\n\nFor help or issues using our model, please contact Wonjin Yoon (`wonjin.info {at} gmail.com`) for communication related to the paper.\n\u003cbr\u003eWe welcome any suggestion regarding this repository.\n\u003cbr\u003e**Please denote the name of our paper when you contact me (Wonjin)** since I maintain BioBERT and other repositories using BioBERT.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmis-lab%2Fbioasq-biobert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmis-lab%2Fbioasq-biobert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmis-lab%2Fbioasq-biobert/lists"}