Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/FreedomIntelligence/Apollo

Multilingual Medicine: Model, Dataset, Benchmark, Code
https://github.com/FreedomIntelligence/Apollo

llm medical open-source

Last synced: 3 months ago
JSON representation

Multilingual Medicine: Model, Dataset, Benchmark, Code

Host: GitHub
URL: https://github.com/FreedomIntelligence/Apollo
Owner: FreedomIntelligence
License: apache-2.0
Created: 2024-01-22T11:31:15.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-03-09T12:52:55.000Z (4 months ago)
Last Synced: 2024-03-09T15:21:18.977Z (4 months ago)
Topics: llm, medical, open-source
Language: Python
Homepage: https://apollo.llmzoo.com/#/
Size: 12 MB
Stars: 72
Watchers: 9
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md

Lists

awesome-latest-LLM - Apollo - 7B) | ~7B | | | | | | multilingual | (Model)

README

        # Multilingual Medicine: Model, Dataset, Benchmark, Code

Covering English, Chinese, French, Hindi, Spanish, Hindi, Arabic So far

![Python 3.10](https://img.shields.io/badge/Python-3.10-lightblue) ![Pytorch 2.1.2](https://img.shields.io/badge/PyTorch-2.1.2-lightblue) ![transformers](https://img.shields.io/badge/transformers-4.34.0.dev0%2B-lightblue) ![accelerate](https://img.shields.io/badge/accelerate-0.22-lightblue)



   📃 Paper • 🌐 Demo • 🤗 ApolloCorpus • 🤗 XMedBench 

   
   中文  |  English



![Apollo](assets/apollo_medium_final.png)

## 🌈 Update

* **[2024.03.07]** [Paper](https://arxiv.org/abs/2403.03640) released.

* **[2024.02.12]** ApolloCorpus and  XMedBench  is published！🎉

* **[2024.01.23]** Apollo repo is published！🎉

## Results

   🤗 Apollo-0.5B • 🤗 Apollo-1.8B • 🤗 Apollo-2B  • 🤗 Apollo-6B • 🤗 Apollo-7B 

   🤗 Apollo-0.5B-GGUF • 🤗 Apollo-2B-GGUF  • 🤗 Apollo-6B-GGUF • 🤗 Apollo-7B-GGUF 

   

   

   

   ![Apollo](assets/result.png)

      

   

  

## Dataset & Evaluation

- Dataset

  🤗 ApolloCorpus


   Click to expand

    ![Apollo](assets/dataset.png)

    - [Zip File](https://huggingface.co/datasets/FreedomIntelligence/Medbase_data/blob/main/Medbase_data-datasets.zip)

    - [Data category](https://huggingface.co/datasets/FreedomIntelligence/Medbase_data/tree/main/train)

       - Pretrain:

         - data item:

            - json_name: {data_source}_{language}_{data_type}.json

            - data_type: medicalBook, medicalGuideline, medicalPaper, medicalWeb(from online forum), medicalWiki

            - language: en(English), zh(chinese), es(spanish), fr(french), hi(Hindi)

            - data_type: qa(generated qa from text)

            - data_type==text: list of string

              ```

              [

                "string1",

                "string2",

                ...

              ]

              ```

            - data_type==qa: list of qa pairs(list of string)

              ```

              [

                [

                  "q1",

                  "a1",

                  "q2",

                  "a2",

                  ...

                ],

                ...

              ]

              ```

      - SFT:

          - json_name: {data_source}_{language}.json

          - data_type: code, general, math, medicalExam, medicalPatient

          - data item: list of qa pairs(list of string)

            ```

              [

                [

                  "q1",

                  "a1",

                  "q2",

                  "a2",

                  ...

                ],

                ...

              ]

              ```

   

   

- Evaluation

  🤗 XMedBench 

   Click to expand

      

     - EN:

       - [MedQA-USMLE](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options) 

       - [MedMCQA](https://huggingface.co/datasets/medmcqa/viewer/default/test)

       - [PubMedQA](https://huggingface.co/datasets/pubmed_qa): Because the results fluctuated too much, they were not used in the paper.

       - [MMLU-Medical](https://huggingface.co/datasets/cais/mmlu)

         - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine

     - ZH:

       - [MedQA-MCMLE](https://huggingface.co/datasets/bigbio/med_qa/viewer/med_qa_zh_4options_bigbio_qa/test)

       - [CMB-single](https://huggingface.co/datasets/FreedomIntelligence/CMB): Not used in the paper

         - Randomly sample 2,000 multiple-choice questions with single answer.

       - [CMMLU-Medical](https://huggingface.co/datasets/haonan-li/cmmlu)

         - Anatomy, Clinical_knowledge, College_medicine, Genetics, Nutrition, Traditional_chinese_medicine, Virology

       - [CExam](https://github.com/williamliujl/CMExam): Not used in the paper

         - Randomly sample 2,000 multiple-choice questions

     - ES: [Head_qa](https://huggingface.co/datasets/head_qa)

     - FR: [Frenchmedmcqa](https://github.com/qanastek/FrenchMedMCQA)

     - HI: [MMLU_HI](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Arabic)

        - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine

     - AR: [MMLU_Ara](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Hindi)

        - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine

   

   

## Results reproduction

   Click to expand

   

   We take Gemma-2b as example

   1. Download Dataset for project:

      ```

      bash 0.download_data.sh

      ```

    

   2. Prepare test and dev for specific model:

      

      - Create test data for with special token, you can use ./util/check.ipynb to check models' special tokens

        

       ```

       bash 1.data_process_test&dev.sh

       ```

    

   3. Prepare train data for specific model (Create tokenized data in advance):

    

      - You can adjust data Training order and Training Epoch in this step

       ```

       bash 2.data_process_train.sh

       ```

    

   4. Train the model

    

      - If you want to train in Multi Nodes please refer to ./scripts/multi_node_train_*.sh

       ```

       bash 3.single_node_train_gemma.sh

       ```

   5. (Optional) Proxy-Tuning: Directly improve model capabilities without fine-tuning

       ```

         bash src/proxy-tuning/scripts/eval/proxy_tuning.sh

       ```

   6. Evaluate your model: Generate score for benchmark

      

         ```

         bash 4.eval.sh

         ```

   7. Evaluate your model: Play with your ckpts in bash

    

         ```

         python ./src/evaluate/cli_demo.py --model_name='./ckpts/your/path/tfmr'

         ```

   

   

##  Acknowledgment

- [HuatuoGPT-II](https://github.com/FreedomIntelligence/HuatuoGPT-II)

- [proxy-tuning](https://github.com/alisawuffles/proxy-tuning)

##  Citation

Please use the following citation if you intend to use our dataset for training or evaluation:

```

@misc{wang2024apollo,

   title={Apollo: Lightweight Multilingual Medical LLMs towards Democratizing Medical AI to 6B People},

   author={Xidong Wang and Nuo Chen and Junyin Chen and Yan Hu and Yidong Wang and Xiangbo Wu and Anningzhe Gao and Xiang Wan and Haizhou Li and Benyou Wang},

   year={2024},

   eprint={2403.03640},

   archivePrefix={arXiv},

   primaryClass={cs.CL}

}

```

## Star History