https://github.com/freedomintelligence/apollomoe

[ICLR'25] ApolloMoE: Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts
https://github.com/freedomintelligence/apollomoe

Last synced: 10 months ago
JSON representation

[ICLR'25] ApolloMoE: Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts

Host: GitHub
URL: https://github.com/freedomintelligence/apollomoe
Owner: FreedomIntelligence
Created: 2024-10-09T06:42:49.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-11-20T03:37:06.000Z (over 1 year ago)
Last Synced: 2025-04-30T19:48:52.023Z (about 1 year ago)
Language: Python
Homepage:
Size: 1.26 MB
Stars: 40
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Democratizing Medical LLMs For Much More Languages

Covering 12 Major Languages including English, Chinese, French, Hindi, Spanish, Arabic, Russian, Japanese, Korean, German, Italian, Portuguese and 38 Minor Languages So far.



   📃 Paper • 🌐 Demo • 🤗 ApolloMoEDataset • 🤗 ApolloMoEBench  • 🤗 Models • 🌐 Apollo



![Apollo](assets/apollo_medium_final.png)

## 🌈 Update

* **[2024.10.15]** ApolloMoE repo is published！🎉

## Languages Coverage

12 Major Languages and 38 Minor Languages

  Click to view the Languages Coverage

   

   ![ApolloMoE](assets/languages.png)

## Architecture

  Click to view the MoE routing image

  ![ApolloMoE](/assets/hybrid_routing.png)

## Results

### Dense

   🤗 Apollo2-0.5B • 🤗 Apollo2-1.5B • 🤗 Apollo2-2B  • 🤗 Apollo2-3.8B • 🤗 Apollo2-7B  • 🤗 Apollo2-9B  

   

  Click to view the Dense Models Results

   

   ![ApolloMoE](assets/dense_results.png)

### Post-MoE

   🤗 Apollo-MoE-0.5B  • 🤗 Apollo-MoE-1.5B  • 🤗 Apollo-MoE-7B  

   

  Click to view the Post-MoE Models Results

   

   ![ApolloMoE](assets/post_moe_results.png)

## Usage Format

#### Apollo2

- 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|>

- 2B, 9B: User:{query}\nAssistant:{response}\

- 3.8B: <|user|>\n{query}<|end|><|assisitant|>\n{response}<|end|>

#### Apollo-MoE

- 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|>

## Dataset & Evaluation

- Dataset

  🤗 ApolloMoEDataset

   Click to expand

    ![ApolloMoE](assets/Dataset.png)

    - [Data category](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus)

   

   The complete data is stored in `ApolloMoEDataset.json`, while a sample shown in `ApolloMoEDataset_sample.json`

- Evaluation

  🤗 ApolloMoEBench 

   Click to expand

      

     - EN:

       - [MedQA-USMLE](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options) 

       - [MedMCQA](https://huggingface.co/datasets/medmcqa/viewer/default/test)

       - [PubMedQA](https://huggingface.co/datasets/pubmed_qa): Because the results fluctuated too much, they were not used in the paper.

       - [MMLU-Medical](https://huggingface.co/datasets/cais/mmlu)

         - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine

     - ZH:

       - [MedQA-MCMLE](https://huggingface.co/datasets/bigbio/med_qa/viewer/med_qa_zh_4options_bigbio_qa/test)

       - [CMB-single](https://huggingface.co/datasets/FreedomIntelligence/CMB): Not used in the paper

         - Randomly sample 2,000 multiple-choice questions with single answer.

       - [CMMLU-Medical](https://huggingface.co/datasets/haonan-li/cmmlu)

         - Anatomy, Clinical_knowledge, College_medicine, Genetics, Nutrition, Traditional_chinese_medicine, Virology

       - [CExam](https://github.com/williamliujl/CMExam): Not used in the paper

         - Randomly sample 2,000 multiple-choice questions

     - ES: [Head_qa](https://huggingface.co/datasets/head_qa)

     - FR:

       - [Frenchmedmcqa](https://github.com/qanastek/FrenchMedMCQA)

       - [MMLU_FR]

         - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine

     - HI: [MMLU_HI](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Hindi)

        - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine

     - AR: [MMLU_AR](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Arabic)

        - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine

     - JA: [IgakuQA](https://github.com/jungokasai/IgakuQA)

     - KO: [KorMedMCQA](https://huggingface.co/datasets/sean0042/KorMedMCQA)

     - IT:

       - [MedExpQA](https://huggingface.co/datasets/HiTZ/MedExpQA)

       - [MMLU_IT]

         - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine

     - DE: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): German part

     - PT: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): Portuguese part

     - RU: [RuMedBench](https://github.com/sb-ai-lab/MedBench)

      

      

   

## Model Download and Inference

   We take Apollo-MoE-0.5B as an example

   1. Login Huggingface

      

       ```

       huggingface-cli login --token $HUGGINGFACE_TOKEN

       ```

       

   2. Download model to local dir

        

       ```python

       from huggingface_hub import snapshot_download

       import os

       local_model_dir=os.path.join('/path/to/models/dir','Apollo-MoE-0.5B')

       snapshot_download(repo_id="FreedomIntelligence/Apollo-MoE-0.5B", local_dir=local_model_dir)

       ```

       

   3. Inference Example

      ```python

      from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

      import os

      

      local_model_dir=os.path.join('/path/to/models/dir','Apollo-MoE-0.5B')

      

      model=AutoModelForCausalLM.from_pretrained(local_model_dir,trust_remote_code=True)

      tokenizer = AutoTokenizer.from_pretrained(local_model_dir,trust_remote_code=True)

      generation_config = GenerationConfig.from_pretrained(local_model_dir, pad_token_id=tokenizer.pad_token_id, num_return_sequences=1, max_new_tokens=7, min_new_tokens=2, do_sample=False, temperature=1.0, top_k=50, top_p=1.0)

      

      inputs = tokenizer('Answer direclty.\nThe capital of Mongolia is Ulaanbaatar.\nThe capital of Iceland is Reykjavik.\nThe capital of Australia is', return_tensors='pt')

      inputs = inputs.to(model.device)

      pred = model.generate(**inputs,generation_config=generation_config)

      print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

      ```

   

## Results reproduction

   

   (Optional) Custom Model as Base

   

   Click to expand

      

   ```

      copy /path/to/your/configuration_upcycling_qwen2_moe.py /path/to/src/variants/moe_initilization/configuration_upcycling_qwen2_moe.py

      copy /path/to/your/modeling_upcycling_qwen2_moe.py /path/to/src/variants/moe_initilization/modeling_upcycling_qwen2_moe.py

      cd /path/to/src/variants/moe_initilization

      bash convert.sh

   ```

   

   Full-finetune on Base Model

   

   Click to expand

   

   

   We take Apollo2-7B or Apollo-MoE-0.5B as examples

   

   1. Download and extract data:

      

      - Dowload Dataset and Benchmark firstly

      - Extract major or minor data part according to your needs:

      ```

      bash 0.extract_data.sh

      ```   

    

   2. Prepare test and dev data for specific model:

      - Create test data for with special token

        

       ```

       bash 1.data_process_test&dev.sh

       ```

    

   3. Prepare train data for specific model (Create tokenized data in advance):

    

      - You can adjust data Training order and Training Epoch in this step

       ```

       bash 2.data_process_train.sh

       ```

    

   4. Train the model

    

      - If you want to train in Multi Nodes please refer to ./src/sft/training_config/zero_multi.yaml

       ```

       bash 3.single_node_train.sh

       ```

   5. Evaluate your model: Generate score for benchmark

      

         ```

         bash 4.eval.sh

         ```

   

##  Citation

Please use the following citation if you intend to use our dataset for training or evaluation:

```

@misc{zheng2024efficientlydemocratizingmedicalllms,

      title={Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts}, 

      author={Guorui Zheng and Xidong Wang and Juhao Liang and Nuo Chen and Yuping Zheng and Benyou Wang},

      year={2024},

      eprint={2410.10626},

      archivePrefix={arXiv},

      primaryClass={cs.CL},

      url={https://arxiv.org/abs/2410.10626}, 

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/freedomintelligence/apollomoe

Awesome Lists containing this project

README