{"id":13525114,"url":"https://github.com/chaoyi-wu/PMC-LLaMA","last_synced_at":"2025-04-01T04:31:16.073Z","repository":{"id":159620944,"uuid":"632335292","full_name":"chaoyi-wu/PMC-LLaMA","owner":"chaoyi-wu","description":"The official codes for \"PMC-LLaMA: Towards Building Open-source Language Models for Medicine\"","archived":false,"fork":false,"pushed_at":"2024-07-08T12:45:23.000Z","size":8266,"stargazers_count":565,"open_issues_count":24,"forks_count":52,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-08-02T06:17:34.114Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chaoyi-wu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-25T07:42:28.000Z","updated_at":"2024-07-31T16:07:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"4341bdde-ef5d-4a49-9ef0-7b91e4fcc9a2","html_url":"https://github.com/chaoyi-wu/PMC-LLaMA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaoyi-wu%2FPMC-LLaMA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaoyi-wu%2FPMC-LLaMA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaoyi-wu%2FPMC-LLaMA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaoyi-wu%2FPMC-LLaMA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chaoyi-wu","download_url":"https://codeload.github.com/chaoyi-wu/PMC-LLaMA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222698219,"owners_count":17024879,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T06:01:16.096Z","updated_at":"2024-11-02T09:30:49.545Z","avatar_url":"https://github.com/chaoyi-wu.png","language":"Python","funding_links":[],"categories":["🤖 模型","Model","医学大语言模型与智能体","Python","A01_文本生成_文本对话","🤖 Scientific Models","Medical LLMs \u0026 Foundation Models"],"sub_categories":["🧩 领域模型","其他特征工具","大语言对话模型及数据","🧬 Life Sciences"],"readme":"# PMC-LLaMA\n\nThe official codes for \"PMC-LLaMA: Towards Building Open-source Language Models for Medicine\". \n\n\u003c!-- vim-markdown-toc GFM --\u003e\n\n* [Latest News](#latest-news)\n* [Environment](#environment)\n* [Quick Start](#quick-start)\n* [Training](#training)\n* [Results](#results)\n    * [QA Benchmark](#qa-benchmark)\n    * [Zero-shot Cases](#zero-shot-cases)\n* [Acknowledge](#acknowledge)\n* [Contact](#contact)\n\n\u003c!-- vim-markdown-toc --\u003e\n\n[**Arxiv Version**](https://arxiv.org/abs/2304.14454)\n\nWe prove that medical LLM should be first pretrained with domain corpus, and then tuned with instructions following dataset.\n\nWe have released The latest model **PMC_LLaMA_13B** finetuned on our instructions the following dataset.\nIt has shown a better ability to follow user instructions than MedLLaMA_13B.\n\n\u003cimg src=./figures/teaser.png width=\"50%\"\u003e\n\nSimilarly, it can be easily loaded with:\n\n```python\nimport transformers\nimport torch\ntokenizer = transformers.LlamaTokenizer.from_pretrained('axiong/PMC_LLaMA_13B')\nmodel = transformers.LlamaForCausalLM.from_pretrained('axiong/PMC_LLaMA_13B')\n```\nHereby we present PMC_LLaMA's versions and briefs.\n\n[MedLLaMA_13B](https://huggingface.co/chaoyi-wu/MedLLaMA_13B) is pretrained on medical corpus, and [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B) is further finetuned based on that.\n\n| Version | Link | Brief | Release Date |\n| --- | --- | --- | --- |\n|MMed-Llama-3 ![](./figures/new.gif) | https://huggingface.co/Henrychur/MMed-Llama-3-8B | Latest Pretrained Multilingual LLM on Llama-3 | 2024/05/22 |\n| MMedLM  | https://github.com/MAGIC-AI4Med/MMedLM | Further Pretrained Multilingual LLM | 2024/02/21 |\n| PMC_LLaMA_13B | https://huggingface.co/axiong/PMC_LLaMA_13B | Instruction Tuned | 2023/09/01 |\n| MedLLaMA_13B | https://huggingface.co/chaoyi-wu/MedLLaMA_13B | Pre-training LLaMA on 4.8M PubmedCentral papers and Medical Books | 2023/05/01 |\n| PMC_LLaMA_7B_10_epoch | https://huggingface.co/chaoyi-wu/PMC_LLAMA_7B_10_epoch | Similar to PMC_LLaMA_7B but trained 10 epochs | 2023/05/01 |\n| PMC_LLaMA_7B | https://huggingface.co/chaoyi-wu/PMC_LLAMA_7B | LLaMA-7b finetuned with PMC papers for 5 epochs | 2023/04/25 |\n\n\n## Latest News\nWe have released a new report genration metrics [RaTEScore](https://arxiv.org/abs/2406.16845). We strongly believe to promote the develop a generative-based medical foundation models, developing a robust and reliable metric is a critical and foundation step. \n\n## Environment\nSimply set up the required environment as following:\n```bash\nconda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.6 -c pytorch -c nvidia\npip install transformers=4.28.1, sentencepiece, datasets\n```\n\n## Quick Start\nCheck `simple_test.py` for quickly use PMC-LLaMA or you can follow this folowing simple sample.\n\n```python\nimport transformers\nimport torch\ntokenizer = transformers.LlamaTokenizer.from_pretrained('axiong/PMC_LLaMA_13B')\nmodel = transformers.LlamaForCausalLM.from_pretrained('axiong/PMC_LLaMA_13B')\nmodel.cuda()  # move the model to GPU\n\nprompt_input = (\n    'Below is an instruction that describes a task, paired with an input that provides further context.'\n    'Write a response that appropriately completes the request.\\n\\n'\n    '### Instruction:\\n{instruction}\\n\\n### Input:\\n{input}\\n\\n### Response:'\n)\n\nexample = {\n    \"instruction\": \"You're a doctor, kindly address the medical queries according to the patient's account. Answer with the best option directly.\",\n    \"input\": (\n        \"###Question: A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. \"\n        \"She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. \"\n        \"She otherwise feels well and is followed by a doctor for her pregnancy. \"\n        \"Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air.\"\n        \"Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. \"\n        \"Which of the following is the best treatment for this patient?\"\n        \"###Options: A. Ampicillin B. Ceftriaxone C. Doxycycline D. Nitrofurantoin\"\n    )\n}\ninput_str = [prompt_input.format_map(example)]\n\nmodel_inputs = tokenizer(\n    input_str,\n    return_tensors='pt',\n    padding=True,\n)\nprint( f\"\\033[32mmodel_inputs\\033[0m: { model_inputs }\" )\n\n\ntopk_output = model.generate(\n    model_inputs.input_ids.cuda(),\n    max_new_tokens=1000,\n    top_k=50\n)\noutput_str = tokenizer.batch_decode(topk_output)\nprint('model predict: ', output_str[0])\n```\n\n\n## Training\n\nThe training process can be divided as two phases: pretrain and instruction-tuning.\n\n**Pre-training**\n\nThe script for pretraining locates at `Pretrain/training.sh`.\n\nOur pretraining dataset sources from [S2ORC](https://github.com/allenai/s2orc). Only those papers with PubMed IDs are deemed as medical-related and used during pretraining.\n\u003c!-- The raw training data can be dowloaded from [S2ORC](https://github.com/allenai/s2orc), filter out the papers with PubmedCentral IDs, and you can get the training data we use.  --\u003e\n\nThe book is listed in this repo as [MedicalBook.xlsx](https://github.com/chaoyi-wu/PMC-LLaMA/blob/main/MedicalBook.xlsx), due to licenses, we cannot release raw content. For reproducing, pls buy and process the books.\n\nMore details about how to fine-tune LLaMA can refer to [Finetune_LLAMA](https://github.com/chaoyi-wu/Finetune_LLAMA)\n\n\n**Instruction Tuning**\n\nWe also provide instruction tuning script at `SFT/train.py`.\nAnd you can find our instruction dataset at [PMC LLaMA Instructions](https://huggingface.co/datasets/axiong/pmc_llama_instructions).\n\n\n## Results\n\n### QA Benchmark\n| Method              | Model Size          | USMLE | MedMCQA | PubMedQA |\n|---------------------|---------------------|------------------|--------------|------------------|\n| Human (pass)        | -                   | 50.0            | --            | 60.0           |\n| Human (expert)      | -                   | 87.0            | 90.0         | 78.0           |\n| ChatGPT             | 175B                | **57.0**        | 44.7         | 63.9           |\n| LLaMA-2             | 13B                 | 42.73           | 37.41        | 68.0           |\n| LLaMA-2             | 70B                 | 43.68           | 35.02        | 74.3           |\n| Med-Alpaca          | 13B                 | 30.85           | 31.13        | 53.2           |\n| Chat-Doctor         | 7B                  | 33.93           | 31.10        | 54.3           |\n| PMC_LLaMA_13B ![](./figures/new.gif) | 13B | **56.36**   | **56.04**  | **77.9**  |\n\n\nNote that, the manual and zero-shot results with * are referred from [LMFLow](https://github.com/OptimalScale/LMFlow/tree/main/src/lmflow).\n\n\n### Zero-shot Cases\n\nWe demonstrate PMC_LLaMA_13B's responses with out of domain queries.\n\n\u003cimg src=./figures/pmc_llama_cases.png\u003e\n\nNote that, due to train on the papers, MedLLaMA_13B may generate some citation numbers (LLaMA somtimes will do this as well) and we dismiss them in the cases to show the main contents.\nWhile for PMC_LLaMA_13B, it's much easier to extract the correct answer as the output result is structured.\n\n\n## Acknowledge\nMinimal LLaMA -- https://github.com/zphang/minimal-llama\n\nalpaca -- https://github.com/tatsu-lab/stanford_alpaca\n\nLMFLow -- https://github.com/OptimalScale/LMFlow/tree/main/src/lmflow\n\nLLaMA: Open and Efficient Foundation Language Models -- https://arxiv.org/abs/2302.13971\n\n## Contact\nIf you have any question, please feel free to contact wtzxxxwcy02@sjtu.edu.cn.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchaoyi-wu%2FPMC-LLaMA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchaoyi-wu%2FPMC-LLaMA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchaoyi-wu%2FPMC-LLaMA/lists"}