Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/RUCAIBox/LLMBox

A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.
https://github.com/RUCAIBox/LLMBox

Last synced: about 1 month ago
JSON representation

A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.

Lists

README

        

**LLMBox** | [Training](training) | [Utilization](utilization)

# LLMBox

LLMBox is a comprehensive library for implementing LLMs, including **a unified training pipeline** and **comprehensive model evaluation**. LLMBox is designed to be a one-stop solution for training and utilizing LLMs. Through a pratical library design, we achieve a high-level of **flexibility** and **efficiency** in both training and utilization stages.

## Key Features

Training

- **Diverse training strategies:** We support multiple training strategies, including Supervised Fine-tuning (`SFT`), Pre-training (`PT`), `PPO` and `DPO`.
- **Comprehensive SFT datasets:** We support 9 SFT datasets as the inputs for training.
- **Tokenizer Vocabulary Merging:** We support the tokenizer merging function to expand the vocabulary.
- **Data Construction Strategies:** We currently support merging multiple datasets for training. `Self-Instruct` and `Evol-Instruct` are also available to process the dataset.
- **Parameter Efficient Fine-Tuning:** `LoRA` and `QLoRA` are supported in SFT or PT.
- **Efficient Training:** We support [`Flash Attention`](https://github.com/Dao-AILab/flash-attention) and `Deepspeed` for efficient training.

Utilization

- **Comprehensive Evaluation:** We support 51 commonly used datasets.
- **In-Context Learning:** We support various ICL strategies, including `KATE`, `GlobalE`, and `APE`.
- **Chain-of-Thought:** For some datasets, we support three types of CoT evaluation: `base`, `least-to-most`, and `pal`.
- **Evaluation Methods:** We currently support three evaluation methods for multiple choice questions or generation questions.
- **Prefix Caching:** By caching the `past_key_value` of prefix, we can speed up local inference by up to 6x.
- **vLLM and Flash Attention Support:** We also support [`vLLM`](https://github.com/vllm-project/vllm) and [`Flash Attention`](https://github.com/Dao-AILab/flash-attention) for efficient inference.
- **Quantization:** BitsAndBytes and GPTQ quantization are supported.

## Quick Start

### Install

```python
git clone https://github.com/RUCAIBox/LLMBox.git && cd LLMBox
pip install -r requirements.txt # or `requirements-openai.txt`
```

If you are only evaluating the OpenAI (or OpenAI compatible) models, you can install the minimal requirements.

### Quick Start with Training

You can start with training a SFT model based on LLaMA-2 (7B) with deepspeed3:

```bash
cd training
bash download.sh
bash bash/run_7b_ds3.sh
```

### Quick Start with Utilization

To utilize your model, or evaluate an existing model, you can run the following command:

```python
python inference.py -m gpt-3.5-turbo -d copa # --num_shot 0 --model_type instruction
```

This is default to run the OpenAI GPT 3.5 turbo model on the CoPA dataset in a zero-shot manner.

## Training

LLMBox Training supports various training strategies and dataset construction strategies, along with some efficiency-improving modules. You can train your model with the following command:

```bash
python train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--data_path data/ \
--dataset alpaca_data_1k.json \
--output_dir $OUTPUT_DIR \
--num_train_epochs 2 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 2 \
--save_strategy "epoch" \
--save_steps 2 \
--save_total_limit 2 \
--learning_rate 1e-5 \
--lr_scheduler_type "constant"
```

Alternatively, you can use the following preset bash scripts to train your model:

### Merging Tokenizer

If you want to pre-train your models on corpora with languages or tokens not well-supported in original language mdoels(e.g., LLaMA), we provide the tokenizer merging function to expand the vocabulary based on the corpora by using [sentencepiece](https://github.com/google/sentencepiece). You can check [merge_tokenizer.py](training/merge_tokenizer.py) for detailed information. Please follow the guide in [Pre-train](training/README.md##2-continual-pre-training-with-your-own-corpora).

```bash
bash bash/run_7b_pt.sh
```

### Merging Datasets

If you want to train your models with a mix of multiple datasets, you can pass a list of dataset files or names to LLMBox. LLMBox will transfer each file or name into a PTDataset or SFTDataset, and merge them together to construct a combined dataset. You can also set the merging ratio of each dataset by passing a list of floats to LLMBox. Please follow the guide in [Merge Dataset](training/README.md##3-merging-different-datasets-with-designated-ratios-for-training).

```bash
bash bash/run_7b_hybrid.sh
```

### Self-Instruct and Evol-Instruct

Since manually creating instruction data of high qualities to train the model is very time-consuming and labor-intensive, Self-Instruct and Evol-Instruct are proposed to create large amounts of instruction data with varying levels of complexity using LLM instead of humans. LLMBox support both Self-Instruct and Evol-Instruct to augment or enhance the input data files. Please follow the guide in [Self-Insturct and Evol-Instruct](training/README.md#8-self-instruct-and-evol-instruct-for-generation-instructions)

```bash
python self_instruct/self_instruct.py --seed_tasks_path=seed_tasks.jsonl
```

For more details, view the [training](./training/README.md) documentation.

## Utilization

We provide a broad support on Huggingface models, OpenAI, Anthropic, QWen and models for further utilization. Currently a total of 51 commonly used datasets are supported, including: `HellaSwag`, `MMLU`, `GSM8K`, `AGIEval`, `CEval`, and `CMMLU`. For a full list of supported models and datasets, view the [utilization](./utilization/README.md) documentation.

```bash
CUDA_VISIBLE_DEVICES=0 python inference.py \
-m llama-2-7b-hf \
-d mmlu agieval:[English] \
--model_type instruction \
--num_shot 5 \
--ranking_type ppl_no_option
```


Performance


Model
get_ppl
get_prob
generation


Hellaswag (0-shot)
MMLU (5-shot)
GSM (8-shot)


GPT-3.5 Turbo
79.98
69.25
75.13


LLaMA-2 (7B)
76
45.95
14.63

### Efficient Evaluation

We by default enable prefix caching for efficient evaluation. vLLM is also supported.


Time


Model
Efficient Method
get_ppl
get_prob
generation


Hellaswag (0-shot)
MMLU (5-shot)
GSM (8-shot)


LLaMA-2 (7B)
Vanilla
0:05:32
0:18:30
2:10:27


vLLM
0:06:37
0:14:55
0:03:36


Prefix Caching
0:05:48
0:05:51
0:17:13

You can also use the following command to use vllm:

```python
python inference.py -m ../Llama-2-7b-hf -d mmlu:abstract_algebra,anatomy --vllm True # --prefix_caching False --flash_attention False
```

To evaluate with quantization, you can use the following command:

```python
python inference.py -m model -d dataset --load_in_4bits # --load_in_8_bits or --gptq
```

### Evaluation Method

Various types of evaluation methods are supported:


Dataset
Evaluation Method
Variants (Ranking Type)


GenerationDataset
generation


MultipleChoiceDataset
get_ppl
ppl_no_option, ppl


get_prob
prob

By default, we use the `get_ppl` method with `ppl_no_option` ranking type for `MultipleChoiceDataset` and the `generation` method for `GenerationDataset`. You can also use the following command to use the `get_prob` method or `ppl` variant of `get_ppl` for MultipleChoiceDataset:

```python
python inference.py -m model -d dataset --ranking_type prob # or ppl
```

We also support In-Context Learning and Chain-of-Thought evaluation for some datasets:

```python
python inference.py -m model -d dataset --kate # --globale or --ape
python inference.py -m model -d dataset --cot least_to_most # --base or --pal
```

For a more detailed instruction on model utilization, view the [utilization](./utilization/README.md) documentation.

## Contributing

Please let us know if you encounter a bug or have any suggestions by [filing an issue](https://github.com/RUCAIBox/LLMBox/issues).

We welcome all contributions from bug fixes to new features and extensions.

We expect all contributions discussed in the issue tracker and going through PRs.

Make sure to format your code with `yapf --style style.cfg` and `isort` before submitting a PR.

## The Team

LLMBox is developed and maintained by [AI Box](http://aibox.ruc.edu.cn/).

## License

LLMBox uses [MIT License](./LICENSE).

## Reference

If you find LLMBox useful for your research or development, please cite the following papers:

```
```