{"id":21361738,"url":"https://github.com/ganjinzero/coder","last_synced_at":"2025-10-25T14:10:07.697Z","repository":{"id":40460061,"uuid":"286878813","full_name":"GanjinZero/CODER","owner":"GanjinZero","description":"CODER: Knowledge infused cross-lingual medical term embedding for term normalization. [JBI, ACL-BioNLP 2022]","archived":false,"fork":false,"pushed_at":"2022-06-28T06:35:49.000Z","size":5901,"stargazers_count":49,"open_issues_count":1,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2023-03-05T10:19:09.710Z","etag":null,"topics":["embeddings","medical","multi-language","nlp","pretrained-language-model","umls"],"latest_commit_sha":null,"homepage":"https://www.sciencedirect.com/science/article/pii/S1532046421003129","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GanjinZero.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-08-12T00:40:19.000Z","updated_at":"2023-03-04T17:58:34.000Z","dependencies_parsed_at":"2022-08-09T21:01:27.317Z","dependency_job_id":null,"html_url":"https://github.com/GanjinZero/CODER","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GanjinZero%2FCODER","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GanjinZero%2FCODER/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GanjinZero%2FCODER/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GanjinZero%2FCODER/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GanjinZero","download_url":"https://codeload.github.com/GanjinZero/CODER/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225850218,"owners_count":17534067,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","medical","multi-language","nlp","pretrained-language-model","umls"],"created_at":"2024-11-22T06:11:15.697Z","updated_at":"2025-10-25T14:10:07.632Z","avatar_url":"https://github.com/GanjinZero.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CODER\n![CODER](img/1.png)\n\nCODER: Knowledge infused cross-lingual medical term embedding for term normalization. [Paper](http://arxiv.org/abs/2011.02947)\n\n![CODER++](img/coder++.png)\n\nCODER++: Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations. [Paper](https://arxiv.org/abs/2204.00391)\n\n# Use the model by transformers\nModels have been uploaded to huggingface/transformers repo.\n\n```python\nfrom transformers import AutoTokenizer, AutoModel\n\ntokenizer = AutoTokenizer.from_pretrained(\"GanjinZero/UMLSBert_ENG\")\nmodel = AutoModel.from_pretrained(\"GanjinZero/UMLSBert_ENG\")\n```\nEnglish checkpoint: **GanjinZero/coder_eng** or GanjinZero/UMLSBert_ENG (old name)\n\nEnglish checkpoint CODER++: **GanjinZero/coder_eng_pp** (with hard negative sampling)\n\u003c!-- Please try to use transformers 3.4.0 to load CODER++, we find the model loaded in transformers 4.12.0 behave differently! --\u003e\n\nMultilingual checkpoint: **GanjinZero/coder_all** ~~or GanjinZero/UMLSBert_ALL  (discarded old name)~~\n\n# Train your model\n```shell\ncd pretrain\npython train.py --umls_dir your_umls_dir --model_name_or_path monologg/biobert_v1.1_pubmed\n```\nyour_umls_dir should contain **MRCONSO.RRF**, **MRREL.RRF** and **MRSTY.RRF**.\nUMLS Download path:[UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsarchives04.html#2020AA).\n\n# A small tool for load UMLS RRF\n```python\nfrom pretrain.load_umls import UMLS\numls = UMLS(your_umls_dir)\n```\n\n# Test CODER or other embeddings\n## CADEC\n```shell\ncd test\npython cadec/cadec_eval.py bert_model_name_or_path\npython cadec/cadec_eval.py word_embedding_path\n```\n\n## MANTRA GSC\nDownload [the Mantra GSC](https://files.ifi.uzh.ch/cl/mantra/gsc/GSC-v1.1.zip) and unzip the xml files to /test/mantra/dataset, run\n```\ncd test/mantra\npython test.py\n```\n\n## MCSM\n```shell\ncd test/embeddings_reimplement\npython mcsm.py\n```\n\n## DDBRC\nOnly sampled data is provided.\n```shell\ncd test/diseasedb\npython train.py your_embedding embedding_type freeze_or_not gpu_id\n```\n- embedding_type should be in [bert, word, cui]\n- freeze_or_not should be in [T, F], T means freeze the embedding, and F means fine-tune the embedding\n\n# Citation\n```bibtex\n@article{YUAN2022103983,\ntitle = {CODER: Knowledge-infused cross-lingual medical term embedding for term normalization},\njournal = {Journal of Biomedical Informatics},\npages = {103983},\nyear = {2022},\nissn = {1532-0464},\ndoi = {https://doi.org/10.1016/j.jbi.2021.103983},\nurl = {https://www.sciencedirect.com/science/article/pii/S1532046421003129},\nauthor = {Zheng Yuan and Zhengyun Zhao and Haixia Sun and Jiao Li and Fei Wang and Sheng Yu},\nkeywords = {medical term normalization, cross-lingual, medical term representation, knowledge graph embedding, contrastive learning}\n}\n```\n\n```bibtex\n@inproceedings{zeng-etal-2022-automatic,\n    title = \"Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations\",\n    author = \"Zeng, Sihang  and\n      Yuan, Zheng  and\n      Yu, Sheng\",\n    booktitle = \"Proceedings of the 21st Workshop on Biomedical Language Processing\",\n    month = may,\n    year = \"2022\",\n    address = \"Dublin, Ireland\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2022.bionlp-1.8\",\n    pages = \"91--96\",\n    abstract = \"Term clustering is important in biomedical knowledge graph construction. Using similarities between terms embedding is helpful for term clustering. State-of-the-art term embeddings leverage pretrained language models to encode terms, and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning. These embeddings provide close embeddings for terms belonging to the same concept. However, from our probing experiments, these embeddings are not sensitive to minor textual differences which leads to failure for biomedical term clustering. To alleviate this problem, we adjust the sampling strategy in pretraining term embeddings by providing dynamic hard positive and negative samples during contrastive learning to learn fine-grained representations which result in better biomedical term clustering. We name our proposed method as CODER++, and it has been applied in clustering biomedical concepts in the newly released Biomedical Knowledge Graph named BIOS.\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fganjinzero%2Fcoder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fganjinzero%2Fcoder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fganjinzero%2Fcoder/lists"}