{"id":13794768,"url":"https://github.com/starmpcc/CAMEL","last_synced_at":"2025-05-12T21:32:15.365Z","repository":{"id":163113848,"uuid":"638178366","full_name":"starmpcc/CAMEL","owner":"starmpcc","description":"Clinically Adapted Model Enhanced from LLaMA","archived":false,"fork":false,"pushed_at":"2023-09-01T06:32:02.000Z","size":1610,"stargazers_count":77,"open_issues_count":0,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-10-18T23:15:14.511Z","etag":null,"topics":["alpaca","camel","clinical","gpt","large-language-model","llama","llm","self-instruct"],"latest_commit_sha":null,"homepage":"https://starmpcc.github.io/CAMEL/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/starmpcc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-05-09T08:44:29.000Z","updated_at":"2024-09-08T14:54:08.000Z","dependencies_parsed_at":"2024-01-07T06:23:03.936Z","dependency_job_id":"3bf6ddc3-d3da-41b3-b80d-78df35961e70","html_url":"https://github.com/starmpcc/CAMEL","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/starmpcc%2FCAMEL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/starmpcc%2FCAMEL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/starmpcc%2FCAMEL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/starmpcc%2FCAMEL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/starmpcc","download_url":"https://codeload.github.com/starmpcc/CAMEL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225157000,"owners_count":17429698,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alpaca","camel","clinical","gpt","large-language-model","llama","llm","self-instruct"],"created_at":"2024-08-03T23:00:47.562Z","updated_at":"2024-11-18T09:31:03.838Z","avatar_url":"https://github.com/starmpcc.png","language":"Python","funding_links":[],"categories":["Specialized Medical LLMs","文本生成","Medical LLMs \u0026 Foundation Models"],"sub_categories":["LLaMA以及扩展"],"readme":"# CAMEL: Clinically Adapted Model Enhanced from LLaMA\n\u003cp align='center'\u003e\n\u003cimg src=\"./resources/camel.png\"  width=\"400\" height=\"400\" center-align=\"true\"\u003e\n\u003cdiv align=\"center\"\u003e\u003cb\u003eCAMEL\u003c/b\u003e from Bing Image Creator\u003c/div\u003e\n\u003c/p\u003e\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n## UPDATE: NEW MODEL ANNOUNCEMENT\nWe are proud to introduce **Asclepius**, a more advanced clinical large language model.\nAs this model was trained on synthetic clinical notes, it is publicly accessible via Huggingface.\nIf you are considering using CAMEL, we highly recommend switching to Asclepius instead.\nFor more information, please visit this [link](https://github.com/starmpcc/Asclepius).\n\n----\n\n\n\n### [**Our Blog Post**](https://starmpcc.github.io/CAMEL)\n### [**Our Demo**](https://starmpcc-camel-demo-demo-i7ajms.streamlit.app/)\n\n\u003cbr/\u003e\n\nWe present **CAMEL**, Clinically Adapted Model Enhanced from LLaMA. As LLaMA for its foundation, **CAMEL** is furtherpre-trained on MIMIC-III and MIMIC-IV clinical notes, and finetuned over clinical instructions (Figure 2). Our preliminary evaluation with GPT-4 assessment, demonstrates that **CAMEL** achieves over 96% of the quality of OpenAI's GPT-3.5 (Figure 1). In accordance with the data usage policies of our source data, both our instruction dataset and model will be published on PhysioNet with credentialized access. To facilitate replication, we will also release all code, allowing individual healthcare institutions to reproduce our model using their own clinical notes.\nFor further detail, please refer our [**blog post**](https://starmpcc.github.io/CAMEL).\n\n\u003cp align='center'\u003e\n\u003cimg src=\"./resources/performance.png\" center-align=\"true\" width=\"70%\"\u003e\n\u003cdiv align=\"center\"\u003eFigure 1. Performance Comparison\u003c/div\u003e\n\u003c/p\u003e\n\n\n\u003cp align='center'\u003e\n\u003cimg src=\"./resources/pipeline.jpg\" center-align=\"true\"\u003e\n\u003cdiv align=\"center\"\u003eFigure 2. Model Pipeline\u003c/div\u003e\n\u003c/p\u003e\n\n## Reproducing Guide\nDue to the license issue of [MIMIC](https://mimic.mit.edu) and [i2b2](https://i2b2.org) datasets, we cannot publish the instruction dataset and checkpoints. We would publish our model and data via physionet within few weeks.\n\n\u003cdetails\u003e\n\u003csummary\u003eEnvironment Setup\u003c/summary\u003e\n\n```\nconda create -n camel python=3.9 -y\nconda activate camel\nconda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y\npip install pandarallel pandas jupyter numpy datasets sentencepiece openai fire\npip install git+https://github.com/huggingface/transformers.git@871598be552c38537bc047a409b4a6840ba1c1e4\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e Pretraining \u003c/summary\u003e\n\n- Note Preprocessing\n  - For each note, we concatenate the category in front of the text.\n  - To prevent test set leakage, we removed 404 overlapping notes from MIMIC-III with [RadQA](https://physionet.org/content/radqa/1.0.0/), [CLIP](https://github.com/asappresearch/clip), [n2c2 2018](https://pubmed.ncbi.nlm.nih.gov/31584655/) datasets for further evaluation.\n  - We concatenate all notes with `\u003ceos\u003e` tokens.\n  - `$ python pretraining_preprocess/mimiciii_preproc.py --mimiciii_note_path {MIMICIII_NOTE_PATH} --output_path {OUTPUT_PATH}`\n  - `$ python pretraining_preprocess/mimiciv_preproc.py --discharge_note_path {DISCHAGE_NOTE_PATH} --radiology_note_path {RADIOLOGY_NOTE_PATH} --output_path {OUTPUT_PATH}`\n  - `$ python pretraining_preprocess/tokenize_data.py --data_path {DATA_PATH} --save_path {SAVE_PATH}`\n- Run Pretriaining\n  ```\n  $ torchrun --nproc_per_node=8 --master_port={YOUR_PORT} \\\n      src/train.py \\\n      --model_name_or_path \"decapoda-research/llama-7b-hf\" \\\n      --data_path  {DATA_FILE} \\\n      --bf16 True \\\n      --output_dir ./checkpoints \\\n      --num_train_epochs 1 \\\n      --per_device_train_batch_size 2 \\\n      --per_device_eval_batch_size 2 \\\n      --gradient_accumulation_steps 8 \\\n      --evaluation_strategy \"no\" \\\n      --save_strategy \"steps\" \\\n      --save_steps 1000 \\\n      --learning_rate 2e-5 \\\n      --weight_decay 0. \\\n      --warmup_ratio 0.03 \\\n      --lr_scheduler_type \"cosine\" \\\n      --logging_steps 1 \\\n      --fsdp \"full_shard auto_wrap\" \\\n      --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \\\n      --tf32 True \\\n      --model_max_length 2048 \\\n      --gradient_checkpointing True\n  ```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eInstruction Finetuning\u003c/summary\u003e\n\n- NOTE: To generate instructions, [you should use certified Azure Openai API](https://physionet.org/news/post/415).\n- Instruction Generation\n\n  - Set environment variables\n    - `OPENAI_API_KEY`\n    - `OPENAI_API_BASE`\n    - `OPENAI_DEPLOYMENT_NAME`\n  - Preprocess Notes\n    - `$ python instructino/preprocess_note.py`\n  - De-Identification instruction generation\n    - `$ python instruction/de_id_gen.py --input {PREPROCESSED_NOTES} --output {OUTPUT_FILE_1} --mode inst`\n    - `$ python instruction/de_id_postprocess.py --input {OUTPUT_FILE_1} --output {OUTPUT_FILE_2}`\n    - `$ python instruction/de_id_gen.py --input {OUTPUT__FILE_2} --output {inst_output/OUTPUT_FILE_deid} --mode ans`\n  - Other tasks instruction generation\n    - You can generate instructions selectively for each dataset.\n    - `$ python instruction/instructtion_gen.py --input {PREPROCESSED_NOTES} --output {inst_output/OUTPUT_FILE} --source {mimiciii, mimiciv, i2b2}`\n  - Merge and formatting files\n    - `$ python instruction/merge_data.py --data_path {inst_output} --output {OUTPUT_FILE_FINAL}` \n\n\n- Run Instruction Finetuning\n\n  - All of our experimente were perfomed with 8x A6000 gpus.\n  - Adjust `nproc_per_node` and  `gradient accumulate step` to fit to your hardware (global batch size=128).\n```\n    $ torchrun --nproc_per_node=8 --master_port={YOUR_PORT} \\\n        src/instruction_ft.py \\\n        --model_name_or_path \"decapoda-research/llama-7b-hf\" \\\n        --data_path  {OUTPUT_FILE_FINAL} \\\n        --bf16 True \\\n        --output_dir ./checkpoints \\\n        --num_train_epochs 3 \\\n        --per_device_train_batch_size 2 \\\n        --per_device_eval_batch_size 2 \\\n        --gradient_accumulation_steps 8 \\\n        --evaluation_strategy \"no\" \\\n        --save_strategy \"epoch\" \\\n        --learning_rate 2e-5 \\\n        --weight_decay 0. \\\n        --warmup_ratio 0.03 \\\n        --lr_scheduler_type \"cosine\" \\\n        --logging_steps 1 \\\n        --fsdp \"full_shard auto_wrap\" \\\n        --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \\\n        --tf32 True \\\n        --model_max_length 2048 \\\n        --gradient_checkpointing True\n        --ddp_timeout 18000\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eEvaluation\u003c/summary\u003e\n\n- Run model on MTSamples\n  ```\n  CUDA_VISIBLE_DEVICES=0 python src/evaluate.py \\\n    --model_name {MODEL_PATH} \\\n    --data_path eval/mtsamples_instructions.json \\\n    --output_path {OUTPUT_PATH}\n  ```\n\n  - We provide the output generated by GPT-3.5, Alpaca, and CAMEL as `mtsamples_results.json` in `eval` folder. \n\n- Run GPT-4 for evaluation\n \n  ```\n  python eval/gpt4_evaluate.py --input {INPUT_PATH} --output {OUTPUT_PATH} \n  ```\n\u003c/details\u003e\n\n## Citation\n```\n@misc{CAMEL,\n    title = {CAMEL : Clinically Adapted Model Enhanced from LLaMA},\n    author = {Sunjun Kweon and Junu Kim and Seongsu Bae and Eunbyeol Cho and Sujeong Im and Jiyoun Kim and Gyubok Lee and JongHak Moon and JeongWoo Oh and Edward Choi},\n    month = {May},\n    year = {2023}\n    publisher = {GitHub},\n    journal = {GitHub repository},\n    howpublished = {\\url{https://github.com/starmpcc/CAMEL}},\n}\n```\n## Code References\n- [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca)\n- [Minimal-LLaMA](https://github.com/zphang/minimal-llama)\n- [Vicuna](https://github.com/lm-sys/FastChat)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstarmpcc%2FCAMEL","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstarmpcc%2FCAMEL","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstarmpcc%2FCAMEL/lists"}