{"id":21441734,"url":"https://github.com/dmis-lab/biosyn","last_synced_at":"2025-06-24T19:46:47.628Z","repository":{"id":41898702,"uuid":"256985612","full_name":"dmis-lab/BioSyn","owner":"dmis-lab","description":"ACL'2020: Biomedical Entity Representations with Synonym Marginalization","archived":false,"fork":false,"pushed_at":"2023-07-07T02:18:17.000Z","size":19543,"stargazers_count":163,"open_issues_count":3,"forks_count":28,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-04-08T02:01:44.530Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2005.00239","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmis-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-04-19T11:48:04.000Z","updated_at":"2025-02-19T16:31:53.000Z","dependencies_parsed_at":"2024-11-23T01:41:29.373Z","dependency_job_id":"0723c197-2610-4711-b75e-195d730dac9c","html_url":"https://github.com/dmis-lab/BioSyn","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dmis-lab/BioSyn","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FBioSyn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FBioSyn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FBioSyn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FBioSyn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmis-lab","download_url":"https://codeload.github.com/dmis-lab/BioSyn/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FBioSyn/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261745502,"owners_count":23203399,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-23T01:41:20.593Z","updated_at":"2025-06-24T19:46:47.554Z","avatar_url":"https://github.com/dmis-lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch3 align=\"center\"\u003e\n\u003cp\u003eBioSyn\n\u003ca href=\"https://github.com/dmis-lab/BioSyn/blob/master/LICENSE\"\u003e\n   \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/badge/License-MIT-yellow.svg\"\u003e\n\u003c/a\u003e\n\u003c/h3\u003e\n\u003cdiv align=\"center\"\u003e\n    \u003cp\u003e\u003cb\u003eBio\u003c/b\u003emedical Entity Representations with \u003cb\u003eSyn\u003c/b\u003eonym Marginalization\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg alt=\"BioSyn Overview\" src=\"https://github.com/dmis-lab/BioSyn/blob/master/images/biosyn_demo.gif\" width=\"600px\"\u003e\n\u003c/div\u003e\n\nWe present BioSyn for learning biomedical entity representations. You can train BioSyn with the two main components described in our [paper](https://arxiv.org/abs/2005.00239): 1) synonym marginalization and 2) iterative candidate retrieval. Once you train BioSyn, you can easily normalize any biomedical mentions or represent them into entity embeddings.\n\n### Updates\n* \\[**Mar 17, 2022**\\] Checkpoints of BioSyn for normalizing [gene type](https://github.com/dmis-lab/BioSyn/edit/master/README.md#bc2gn) are released. The BC2GN data used for the gene type has been pre-processed by [Tutubalina et al., 2020](https://github.com/insilicomedicine/Fair-Evaluation-BERT).\n* \\[**Oct 25, 2021**\\] Trained models are uploaded in Huggingface Hub(Please check out [here](#trained-models)). Other than BioBERT, we also train our model using another pre-trained model [SapBERT](https://github.com/cambridgeltl/sapbert), and obtain better performance than as described in our paper.\n\n## Requirements\n```bash\n$ conda create -n BioSyn python=3.7\n$ conda activate BioSyn\n$ conda install numpy tqdm scikit-learn\n$ conda install pytorch=1.8.0 cudatoolkit=10.2 -c pytorch\n$ pip install transformers==4.11.3\n```\nNote that Pytorch has to be installed depending on the version of CUDA.\n\n### Datasets\n\nDatasets consist of queries (train, dev, test, and traindev), and dictionaries (train_dictionary, dev_dictionary, and test_dictionary). Note that the only difference between the dictionaries is that test_dictionary includes train and dev mentions, and dev_dictionary includes train mentions to increase the coverage. The queries are pre-processed with lowercasing, removing punctuations, resolving composite mentions and resolving abbreviation ([Ab3P](https://github.com/ncbi-nlp/Ab3P)). The dictionaries are pre-processed with lowercasing, removing punctuations (If you need the pre-processing codes, please let us know by openning an issue).\n\nNote that we use development (dev) set to search the hyperparameters, and train on traindev (train+dev) set to report the final performance.\n\n- [ncbi-disease](https://drive.google.com/file/d/1mmV7p33E1iF32RzAET3MsLHPz1PiF9vc/view?usp=sharing)\n- [bc5cdr-disease](https://drive.google.com/file/d/1moAqukbrdpAPseJc3UELEY6NLcNk22AA/view?usp=sharing)\n- [bc5cdr-chemical](https://drive.google.com/file/d/1mgQhjAjpqWLCkoxIreLnNBYcvjdsSoGi/view?usp=sharing)\n\n`TAC2017ADR` dataset cannot be shared because of the license issue. Please visit the [website](https://bionlp.nlm.nih.gov/tac2017adversereactions/) or see [here](https://github.com/dmis-lab/BioSyn/tree/master/preprocess) for pre-processing scripts.\n\n## Train\n\nThe following example fine-tunes our model on NCBI-Disease dataset (train+dev) with BioBERTv1.1. \n\n```bash\nMODEL_NAME_OR_PATH=dmis-lab/biobert-base-cased-v1.1\nOUTPUT_DIR=./tmp/biosyn-biobert-ncbi-disease\nDATA_DIR=./datasets/ncbi-disease\n\nCUDA_VISIBLE_DEVICES=1 python train.py \\\n    --model_name_or_path ${MODEL_NAME_OR_PATH} \\\n    --train_dictionary_path ${DATA_DIR}/train_dictionary.txt \\\n    --train_dir ${DATA_DIR}/processed_traindev \\\n    --output_dir ${OUTPUT_DIR} \\\n    --use_cuda \\\n    --topk 20 \\\n    --epoch 10 \\\n    --train_batch_size 16\\\n    --initial_sparse_weight 0\\\n    --learning_rate 1e-5 \\\n    --max_length 25 \\\n    --dense_ratio 0.5\n```\n\nNote that you can train the model on `processed_train` and evaluate it on `processed_dev` when you want to search for the hyperparameters. (the argument `--save_checkpoint_all` can be helpful. )\n\n## Evaluation\n\nThe following example evaluates our trained model with NCBI-Disease dataset (test). \n\n```bash\nMODEL_NAME_OR_PATH=./tmp/biosyn-biobert-ncbi-disease\nOUTPUT_DIR=./tmp/biosyn-biobert-ncbi-disease\nDATA_DIR=./datasets/ncbi-disease\n\npython eval.py \\\n    --model_name_or_path ${MODEL_NAME_OR_PATH} \\\n    --dictionary_path ${DATA_DIR}/test_dictionary.txt \\\n    --data_dir ${DATA_DIR}/processed_test \\\n    --output_dir ${OUTPUT_DIR} \\\n    --use_cuda \\\n    --topk 20 \\\n    --max_length 25 \\\n    --save_predictions \\\n    --score_mode hybrid\n```\n\n### Result\n\nThe predictions are saved in `predictions_eval.json` with mentions, candidates and accuracies (the argument `--save_predictions` has to be on).\nFollowing is an example.\n\n```\n{\n  \"queries\": [\n    {\n      \"mentions\": [\n        {\n          \"mention\": \"ataxia telangiectasia\",\n          \"golden_cui\": \"D001260\",\n          \"candidates\": [\n            {\n              \"name\": \"ataxia telangiectasia\",\n              \"cui\": \"D001260|208900\",\n              \"label\": 1\n            },\n            {\n              \"name\": \"ataxia telangiectasia syndrome\",\n              \"cui\": \"D001260|208900\",\n              \"label\": 1\n            },\n            {\n              \"name\": \"ataxia telangiectasia variant\",\n              \"cui\": \"C566865\",\n              \"label\": 0\n            },\n            {\n              \"name\": \"syndrome ataxia telangiectasia\",\n              \"cui\": \"D001260|208900\",\n              \"label\": 1\n            },\n            {\n              \"name\": \"telangiectasia\",\n              \"cui\": \"D013684\",\n              \"label\": 0\n            }]\n        }]\n    },\n    ...\n    ],\n    \"acc1\": 0.9114583333333334,\n    \"acc5\": 0.9385416666666667\n}\n```\n\n## Inference\nWe provide a simple script that can normalize a biomedical mention or represent the mention into an embedding vector with BioSyn. \n### Trained models\n\n#### NCBI-Disease\n|              Model                | Acc@1/Acc@5 | \n|:----------------------------------|:--------:|  \n| [biosyn-biobert-ncbi-disease](https://huggingface.co/dmis-lab/biosyn-biobert-ncbi-disease) | 91.1/93.9 | \n| [biosyn-sapbert-ncbi-disease](https://huggingface.co/dmis-lab/biosyn-sapbert-ncbi-disease) | **92.4**/**95.8** |\n\n#### BC5CDR-Disease\n|              Model                | Acc@1/Acc@5 |\n|:----------------------------------|:--------:|\n| [biosyn-biobert-bc5cdr-disease](https://huggingface.co/dmis-lab/biosyn-biobert-bc5cdr-disease) | 93.2/96.0 |\n| [biosyn-sapbert-bc5cdr-disease](https://huggingface.co/dmis-lab/biosyn-sapbert-bc5cdr-disease) | **93.5**/**96.4** |\n\n#### BC5CDR-Chemical\n|              Model                | Acc@1/Acc@5 | \n|:----------------------------------|:--------:|\n| [biosyn-biobert-bc5cdr-chemical](https://huggingface.co/dmis-lab/biosyn-biobert-bc5cdr-chemical) | **96.6**/97.2 |\n| [biosyn-sapbert-bc5cdr-chemical](https://huggingface.co/dmis-lab/biosyn-sapbert-bc5cdr-chemical) | **96.6**/**98.3** |\n\n#### BC2GN-Gene\n|              Model                | Acc@1/Acc@5 |\n|:----------------------------------|:--------:| \n| [biosyn-biobert-bc2gn](https://huggingface.co/dmis-lab/biosyn-biobert-bc2gn) | 90.6/95.6 |\n| [biosyn-sapbert-bc2gn](https://huggingface.co/dmis-lab/biosyn-sapbert-bc2gn) | 91.3/96.3  | \n\n### Predictions (Top 5)\n\nThe example below gives the top 5 predictions for a mention `ataxia telangiectasia`. Note that the initial run will take some time to embed the whole dictionary. You can download the dictionary file [here](https://github.com/dmis-lab/BioSyn#datasets).\n\n```bash\nMODEL_NAME_OR_PATH=dmis-lab/biosyn-biobert-ncbi-disease\nDATA_DIR=./datasets/ncbi-disease\n\npython inference.py \\\n    --model_name_or_path ${MODEL_NAME_OR_PATH} \\\n    --dictionary_path ${DATA_DIR}/test_dictionary.txt \\\n    --use_cuda \\\n    --mention \"ataxia telangiectasia\" \\\n    --show_predictions\n```\n\n#### Result\n```json\n{\n  \"mention\": \"ataxia telangiectasia\", \n  \"predictions\": [\n    {\"name\": \"ataxia telangiectasia\", \"id\": \"D001260|208900\"},\n    {\"name\": \"ataxia telangiectasia syndrome\", \"id\": \"D001260|208900\"}, \n    {\"name\": \"telangiectasia\", \"id\": \"D013684\"}, \n    {\"name\": \"telangiectasias\", \"id\": \"D013684\"}, \n    {\"name\": \"ataxia telangiectasia variant\", \"id\": \"C566865\"}\n  ]\n}\n```\n\n### Embeddings\nThe example below gives an embedding of a mention `ataxia telangiectasia`.\n\n```bash\nMODEL_NAME_OR_PATH=dmis-lab/biosyn-biobert-ncbi-disease\nDATA_DIR=./datasets/ncbi-disease\n\npython inference.py \\\n    --model_name_or_path ${MODEL_NAME_OR_PATH} \\\n    --use_cuda \\\n    --mention \"ataxia telangiectasia\" \\\n    --show_embeddings\n```\n\n#### Result\n```\n{\n  \"mention\": \"ataxia telangiectasia\", \n  \"mention_sparse_embeds\": array([0.05979538, 0., ..., 0., 0.], dtype=float32),\n  \"mention_dense_embeds\": array([-7.14258850e-02, ..., -4.03847933e-01,],dtype=float32)\n}\n```\n\n## Demo\n\n### How to run web demo\n\nWeb demo is implemented on [Tornado](https://www.tornadoweb.org/) framework.\nIf a dictionary is not yet cached, it will take about couple of minutes to create dictionary cache.\n\n```bash\nMODEL_NAME_OR_PATH=dmis-lab/biosyn-biobert-ncbi-disease\n\npython demo.py \\\n  --model_name_or_path ${MODEL_NAME_OR_PATH} \\\n  --use_cuda \\\n  --dictionary_path ./datasets/ncbi-disease/test_dictionary.txt\n```\n\n## Citations\n```bibtex\n@inproceedings{sung2020biomedical,\n    title={Biomedical Entity Representations with Synonym Marginalization},\n    author={Sung, Mujeen and Jeon, Hwisang and Lee, Jinhyuk and Kang, Jaewoo},\n    booktitle={ACL},\n    year={2020},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmis-lab%2Fbiosyn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmis-lab%2Fbiosyn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmis-lab%2Fbiosyn/lists"}