https://github.com/freedomintelligence/multilingualsift

MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning
https://github.com/freedomintelligence/multilingualsift

Last synced: about 1 year ago
JSON representation

MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning

Host: GitHub
URL: https://github.com/freedomintelligence/multilingualsift
Owner: FreedomIntelligence
License: apache-2.0
Created: 2023-06-30T03:49:55.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-08-15T17:58:46.000Z (almost 3 years ago)
Last Synced: 2025-03-30T19:22:13.248Z (about 1 year ago)
Language: Python
Homepage:
Size: 214 KB
Stars: 89
Watchers: 7
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Multilingual Supervised Instruction Fine-tuning

This repo aims to provide the data, models, evaluation benchmark for multilingual instruction fine-tuning.

## 📚 Data

We translate Alpaca-GPT4 and Evol-Instruct from English to languages using GPT-3.5 Turbo, where

* For Alpaca-GPT4, we directly translate the instructions and responses.

* For Evol-Instruct, we translate the instructions and use to generate the responses using the translated instructions.

* For ShareGPT, we translate the English data from ShareGPT to other languages (Note: Due to the large scale of ShareGPT, we have yet to translate all the data).

| Language   | Alpaca-GPT4                                                                                 | Evol-instruct                                                                                 |ShareGPT                                                                                 |

|------------|---------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|

| Chinese    | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-chinese)    | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-chinese)    |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-chinese)|

| Japanese   | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-japanese)   | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-japanese)   |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-japanese)|

| Korean     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-korean)     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-korean)     |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-korean)|

| German     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-deutsch)    | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-deutsch)    |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-deutsch)|

| French     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-french)     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-french)     |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-french)|

| Italian    | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-italian)    | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-italian)    |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-italian)| 

| Arabic     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-arabic)     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-arabic)     |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-arabic)|

| Portuguese | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-portuguese) | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-portuguese) |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-portuguese)|

| Spanish    | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-spanish)    | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-spanish)    |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-spanish)|

| Hindi      | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-hindi)      | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-hindi)      |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-hindi)|

| Indonesian | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-indonesian) | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-indonesian) |[[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-indonesian)|

## 🤖 Models

### CLI Interation

```bash

python -m src.deploy.cli --model-path /path/to/weights/

```

For example, you can use [FreedomIntelligence/phoenix-multiple-langs-v1](https://huggingface.co/FreedomIntelligence/phoenix-multiple-langs-v1) fine-tuned on eight languages (`English`, `Chinese`, `French`, `Spanish`, `Portuguese`, `Arabic`, `Indonesian`, `Hindi`):

```bash

python -m src.deploy.cli --model-path FreedomIntelligence/phoenix-multiple-langs-v1

```

### Deployment

1. Launch a controller

```shell

python -m src.deploy.webapp.controller

```

2. Launch a model worker

```shell

python -m src.deploy.webapp.model_worker --model-path /path/to/weights/

```

3. Launch a gradio web server

```shell

python -m src.deploy.webapp.gradio_web_server

```

Now, you can open your browser and chat with a model.

### Training

Specify the `train_data_path` and `val_data_path` and then run

```shell

bash scripts/train.sh

```

## 💯 Evaluation Benchmark

### Evaluation Data

* We translate [MMLU](https://github.com/hendrycks/test) and [Vicuna-80](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/vicuna_bench/question.jsonl) to languages above for evaluation.

| Language   | MMLU                                                                                 |

|------------|--------------------------------------------------------------------------------------|

| Chinese    | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Chinese)    |

| Japanese   | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Japanese)   |

| Korean     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Korean)     |

| German     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Deutsch)    |

| French     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_French)     |

| Italian    | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Italian)    |

| Arabic     | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Arabic)     |

| Portuguese | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Portuguese) |

| Spanish    | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Spanish)    |

| Hindi      | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Hindi)      |

| Indonesian | [[huggingface]](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Indonesian) |

### Evaluation

* For MMLU

```shell

bash scripts/eval_mmlu.sh ${LANGUAGE} ${MODEL_PATH} ${MODEL_ID}

```

* For Vicuna-80

```shell

bash scripts/eval_vicuna-80.sh ${LANGUAGE} ${MODEL_PATH} ${MODEL_ID}

```

## Citation

If you find this repository helpful, please cite the repository below.

```angular2

@software{Chen_MultilingualSIFT_Multilingual_Supervised_2023,

author = {Chen, Zhihong and Yan, Shuo and Liang, Juhao and Jiang, Feng and Wu, Xiangbo and Yu, Fei and Chen, Guiming Hardy and Chen, Junying and Zhang, Hongbo and Li Jianquan and Wan Xiang and Wang, Benyou},

month = jul,

title = {{MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning}},

url = {https://github.com/FreedomIntelligence/MultilingualSIFT.git},

version = {0.1},

year = {2023}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/freedomintelligence/multilingualsift

Awesome Lists containing this project

README