https://github.com/PKU-Alignment/align-anything

Align Anything: Training All-modality Model with Feedback
https://github.com/PKU-Alignment/align-anything
chameleon dpo large-language-models multimodal rlhf vision-language-model
Last synced: 3 months ago
JSON representation
Align Anything: Training All-modality Model with Feedback
Host: GitHub
URL: https://github.com/PKU-Alignment/align-anything
Owner: PKU-Alignment
License: apache-2.0
Created: 2024-07-14T11:05:19.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-03-30T09:06:45.000Z (4 months ago)
Last Synced: 2025-03-30T10:19:41.453Z (4 months ago)
Topics: chameleon, dpo, large-language-models, multimodal, rlhf, vision-language-model
Language: Python
Homepage: https://arxiv.org/abs/2412.15838
Size: 60.9 MB
Stars: 3,131
Watchers: 260
Forks: 395
Open Issues: 21
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md
Awesome Lists containing this project

StarryDivineSky - PKU-Alignment/align-anything
awesome-production-machine-learning - Align-Anything - Alignment/align-anything.svg?style=social) - Align-Anything aims to align any modality large models (any-to-any models), including LLMs, VLMs, and others, with human intentions and values (Industry Strength Natural Language Processing)
awesome-RLHF - Official
README

        



  

   

  

    project website

    ^HOT

        

    PKU-Alignment Team

    ^welcome

  

   


[![PyPI](https://img.shields.io/pypi/v/align-anything?logo=pypi)](https://pypi.org/project/align-anything)

[![License](https://img.shields.io/github/license/PKU-Alignment/align-anything?label=license)](#license)

[📘Documentation](https://pku-alignment.notion.site/Align-Anything-37a300fb5f774bb08e5b21fdeb476c64) |

[🆕Update News](#news) |

[🛠️Quick Start](#quick-start) |

[🚀Algorithms](#algorithms) |

[👀Evaluation](#evaluation) |

[🤔Reporting Issues](#report-issues)





[Our 100K Instruction-Following Datasets](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K)



Align-Anything aims to align any modality large models (any-to-any models), including LLMs, VLMs, and others, with human intentions and values. More details about the definition and milestones of alignment for Large Models can be found in [AI Alignment](https://alignmentsurvey.com). Overall, this framework has the following characteristics:

- **Highly Modular Framework.** Its versatility stems from the abstraction of different algorithm types and well-designed APIs, allowing users to easily modify and customize the code for different tasks.

- **Support for Various Model Fine-Tuning.** This framework includes fine-tuning capabilities for models such as LLaMA3.1, LLaVA, Gemma, Qwen, Baichuan, and others (see [Model Zoo](https://github.com/PKU-Alignment/align-anything/blob/main/Model-Zoo.md)).

- **Support Fine-Tuning across Any Modality.** It supports fine-tuning alignments for different modality model, including LLMs, VLMs, and other modalities (see [Development Roadmap](#development-roadmap)).

- **Support Different Alignment Methods.** The framework supports different alignment algorithms, including SFT, DPO, PPO, and others.

|| promptSmall white toilet sitting in a small corner next to a wall. | promptA close up of a neatly made bed with two night stands  | promptA pizza is sitting on a plate at a restaurant. |promptA girl in a dress next to a piece of luggage and flowers.|

|---| ---------------------------------- | --- | --- | --- |

|Before Alignment ([Chameleon-7B](https://huggingface.co/facebook/chameleon-7b))|  |  |   | |

|**After Alignment ([Chameleon 7B Plus](https://huggingface.co/PKU-Alignment/AA-chameleon-7b-plus))**|  |  |   | |

> Alignment fine-tuning can significantly enhance the instruction-following capabilities of large multimodal models. After fine-tuning, Chameleon 7B Plus generates images that are more relevant to the prompt.

## Algorithms

We support basic alignment algorithms for different modalities, each of which may involve additional algorithms. For instance, in the text modality, we have also implemented SimPO, KTO, and others.

| Modality                           | SFT | RM  | DPO | PPO |

| ---------------------------------- | --- | --- | --- | --- |

| `Text -> Text (t2t)`               | ✔️   | ✔️   | ✔️   | ✔️   |

| `Text+Image -> Text (ti2t)`        | ✔️   | ✔️   | ✔️   | ✔️   |

| `Text+Image -> Text+Image (ti2ti)` | ✔️   | ✔️   | ✔️   | ✔️   |

| `Text+Audio -> Text (ta2t)`        | ✔️   | ✔️   | ✔️   | ✔️   |

| `Text+Video -> Text (tv2t)`        | ✔️   | ✔️   | ✔️   | ✔️   |

| `Text -> Image (t2i)`              | ✔️   | ⚒️   | ✔️   | ⚒️   |

| `Text -> Video (t2v)`              | ✔️   | ⚒️   | ✔️   | ⚒️   |

| `Text -> Audio (t2a)`              | ✔️   | ⚒️   | ✔️   | ⚒️   |

## Evaluation

We support evaluation datasets for `Text -> Text`, `Text+Image -> Text` and `Text -> Image`.

| Modality              | Supported Benchmarks                                                  |

| :-------------------- | :----------------------------------------------------------- |

| `t2t`       | [ARC](https://huggingface.co/datasets/allenai/ai2_arc), [BBH](https://huggingface.co/datasets/lukaemon/bbh), [Belebele](https://huggingface.co/datasets/facebook/belebele), [CMMLU](https://huggingface.co/datasets/haonan-li/cmmlu), [GSM8K](https://huggingface.co/datasets/openai/gsm8k), [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval), [MMLU](https://huggingface.co/datasets/cais/mmlu), [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro), [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts), [PAWS-X](https://huggingface.co/datasets/google-research-datasets/paws-x), [RACE](https://huggingface.co/datasets/ehovy/race), [TruthfulQA ](https://huggingface.co/datasets/truthfulqa/truthful_qa) |

| `ti2t` | [A-OKVQA](https://huggingface.co/datasets/HuggingFaceM4/A-OKVQA), [LLaVA-Bench(COCO)](https://huggingface.co/datasets/lmms-lab/llava-bench-coco), [LLaVA-Bench(wild)](https://huggingface.co/datasets/lmms-lab/llava-bench-in-the-wild), [MathVista](https://huggingface.co/datasets/AI4Math/MathVista), [MM-SafetyBench](https://huggingface.co/datasets/PKU-Alignment/MM-SafetyBench), [MMBench](https://huggingface.co/datasets/lmms-lab/MMBench), [MME](https://huggingface.co/datasets/lmms-lab/MME), [MMMU](https://huggingface.co/datasets/MMMU/MMMU), [MMStar](https://huggingface.co/datasets/Lin-Chen/MMStar), [MMVet](https://huggingface.co/datasets/lmms-lab/MMVet), [POPE](https://huggingface.co/datasets/lmms-lab/POPE), [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA), [SPA-VL](https://huggingface.co/datasets/sqrti/SPA-VL), [TextVQA](https://huggingface.co/datasets/lmms-lab/textvqa), [VizWizVQA](https://huggingface.co/datasets/lmms-lab/VizWiz-VQA) |

|`tv2t` |[MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [Video-MME](https://huggingface.co/datasets/lmms-lab/Video-MME) |

|`ta2t` |[AIR-Bench](https://huggingface.co/datasets/qyang1021/AIR-Bench-Dataset) |

| `t2i`      | [ImageReward](https://huggingface.co/datasets/THUDM/ImageRewardDB), [HPSv2](https://huggingface.co/datasets/zhwang/HPDv2), [COCO-30k(FID)](https://huggingface.co/datasets/sayakpaul/coco-30-val-2014) |

| `t2v`      | [ChronoMagic-Bench](https://huggingface.co/datasets/BestWishYsh/ChronoMagic-Bench) |

| `t2a`      | [AudioCaps(FAD)](https://huggingface.co/datasets/AudioLLMs/audiocaps_test) |

- ⚒️ : coming soon.

# News

- 2024-10-10: We support SFT for `Any -> Any` modality models Emu3.

- 2024-09-24: We support SFT, DPO, RM and PPO for `Text + Video -> Text` modality models.

- 2024-09-13: We support SFT, DPO, RM and PPO for `Text + Audio -> Text` modality models.

- 2024-08-17: We support DPO and PPO for `Text+Image -> Text+Image` modality models.

- 2024-08-15: We support a new function in the evaluation module: the `models_pk` script in [here](./scripts/models_pk.sh), which enables comparing the performance of two models across different benchmarks.

- 2024-08-06: We restructure the framework to support any modality evaluation and the supported benchmark list is [here](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/evaluation/benchmarks).

- 2024-08-06: We support `Text+Image -> Text+Image` modality for the SFT trainer and Chameleon models.

More News

- 2024-07-23: We support `Text -> Image`, `Text -> Audio`, and `Text -> Video` modalities for the SFT trainer and DPO trainer.

- 2024-07-22: We support the **Chameleon** model for the SFT trainer and DPO trainer!

- 2024-07-17: We open-source the Align-Anything-Instruction-100K dataset for text modality. This dataset is available in both [English](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K) and [Chinese](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K-zh) versions, each sourced from different data sets and meticulously refined for quality by GPT-4.

- 2024-07-14: We open-source the align-anything framework.

# Installation

```bash

# clone the repository

git clone [email protected]:PKU-Alignment/align-anything.git

cd align-anything

# create virtual env

conda create -n align-anything python==3.11

conda activate align-anything

```

- **`[Optional]`** We recommend installing [CUDA](https://anaconda.org/nvidia/cuda) in the conda environment and set the environment variable.

```bash

# We tested on the H800 computing cluster, and this version of CUDA works well. 

# You can adjust this version according to the actual situation of the computing cluster.

conda install nvidia/label/cuda-12.2.0::cuda

export CUDA_HOME=$CONDA_PREFIX

```

> If your CUDA installed in a different location, such as `/usr/local/cuda/bin/nvcc`, you can set the environment variables as follows:

```bash

export CUDA_HOME="/usr/local/cuda"

```

Fianlly, install `align-anything` by:

```bash

pip install -e .

```

## Wandb Logger

We support `wandb` logging. By default, it is set to offline. If you need to view wandb logs online, you can specify the environment variables of `WANDB_API_KEY` before starting the training:

```bash

export WANDB_API_KEY="..."  # your W&B API key here

```

# Quick Start

## Training Scripts

To prepare for training, all scripts are located in the `./scripts` and parameters that require user input have been left empty. For example, the DPO scripts for `Text + Image -> Text` modality is as follow:

```bash

MODEL_NAME_OR_PATH="" # model path

TRAIN_DATASETS="" # dataset path

TRAIN_TEMPLATE="" # dataset template

TRAIN_SPLIT="" # split the dataset

OUTPUT_DIR=""  # output dir

source ./setup.sh # source the setup script

export CUDA_HOME=$CONDA_PREFIX # replace it with your CUDA path

deepspeed \

	--master_port ${MASTER_PORT} \

	--module align_anything.trainers.text_image_to_text.dpo \

	--model_name_or_path ${MODEL_NAME_OR_PATH} \

	--train_datasets ${TRAIN_DATASETS} \

	--train_template SPA_VL \

	--train_split train \

	--output_dir ${OUTPUT_DIR}

```

We can run DPO with [LLaVA-v1.5-7B](https://huggingface.co/llava-hf/llava-1.5-7b-hf) (HF format) and [SPA-VL](https://huggingface.co/datasets/sqrti/SPA-VL) dataset using the follow script:

```bash

MODEL_NAME_OR_PATH="llava-hf/llava-1.5-7b-hf" # model path

TRAIN_DATASETS="sqrti/SPA-VL" # dataset path

TRAIN_TEMPLATE="SPA_VL" # dataset template

TRAIN_SPLIT="train" # split the dataset

OUTPUT_DIR="../output/dpo" # output dir

export WANDB_API_KEY="YOUR_WANDB_KEY" # wandb logging

source ./setup.sh # source the setup script

export CUDA_HOME=$CONDA_PREFIX # replace it with your CUDA path

deepspeed \

	--master_port ${MASTER_PORT} \

	--module align_anything.trainers.text_image_to_text.dpo \

	--model_name_or_path ${MODEL_NAME_OR_PATH} \

	--train_datasets ${TRAIN_DATASETS} \

	--train_template ${TRAIN_TEMPLATE} \

	--train_split ${TRAIN_SPLIT} \

	--output_dir ${OUTPUT_DIR}

```

## Evaluation

All evaluation scripts can be found in the `./scripts`. The `./scripts/evaluate.sh` script runs model evaluation on the benchmarks, and parameters that require user input have been left empty. The corresponding script is as follow:

```bash

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

cd "${SCRIPT_DIR}/../align_anything/evaluation" || exit 1

BENCHMARKS=("") # evaluation benchmarks

OUTPUT_DIR="" # output dir

GENERATION_BACKEND="" # generation backend

MODEL_ID="" # model's unique id

MODEL_NAME_OR_PATH="" # model path

CHAT_TEMPLATE="" # model template

for BENCHMARK in "${BENCHMARKS[@]}"; do

    python __main__.py \

        --benchmark ${BENCHMARK} \

        --output_dir ${OUTPUT_DIR} \

        --generation_backend ${GENERATION_BACKEND} \

        --model_id ${MODEL_ID} \

        --model_name_or_path ${MODEL_NAME_OR_PATH} \

        --chat_template ${CHAT_TEMPLATE}

done

```

For example, you can evaluate [LLaVA-v1.5-7B](https://huggingface.co/llava-hf/llava-1.5-7b-hf) (HF format) on [POPE](https://huggingface.co/datasets/lmms-lab/POPE) and [MM-SafetyBench](https://huggingface.co/datasets/PKU-Alignment/MM-SafetyBench) benchmarks using the follow script:

```bash

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

cd "${SCRIPT_DIR}/../align_anything/evaluation" || exit 1

BENCHMARKS=("POPE" "MM-SafetyBench") # evaluation benchmarks

OUTPUT_DIR="../output/evaluation" # output dir

GENERATION_BACKEND="vLLM" # generation backend

MODEL_ID="llava-1.5-7b-hf" # model's unique id

MODEL_NAME_OR_PATH="llava-hf/llava-1.5-7b-hf" # model path

CHAT_TEMPLATE="Llava" # model template

for BENCHMARK in "${BENCHMARKS[@]}"; do

    python __main__.py \

        --benchmark ${BENCHMARK} \

        --output_dir ${OUTPUT_DIR} \

        --generation_backend ${GENERATION_BACKEND} \

        --model_id ${MODEL_ID} \

        --model_name_or_path ${MODEL_NAME_OR_PATH} \

        --chat_template ${CHAT_TEMPLATE}

done

```

You can modify the configuration files for the benchmarks in [this directory](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/configs/evaluation/benchmarks) to suit specific evaluation tasks and models, and adjust inference parameters for [vLLM](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/configs/evaluation/vllm) or [DeepSpeed](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/configs/evaluation/deepspeed) based on your generation backend. For more details about the evaluation pipeline, refer to the [here](https://github.com/PKU-Alignment/align-anything/blob/main/align_anything/evaluation/README.md).

# Inference

## Interactive Client

```bash

python3 -m align_anything.serve.cli --model_name_or_path your_model_name_or_path

```



## Interactive Arena

```bash

python3 -m align_anything.serve.arena \

    --red_corner_model_name_or_path your_red_model_name_or_path \

    --blue_corner_model_name_or_path your_blue_model_name_or_path

```



## Report Issues

If you have any questions in the process of using align-anything, don't hesitate to ask your questions on [the GitHub issue page](https://github.com/PKU-Alignment/align-anything/issues/new/choose), we will reply to you in 2-3 working days.

# Citation

Please cite the repo if you use the data or code in this repo.

```bibtex

@misc{align_anything,

  author = {PKU-Alignment Team},

  title = {Align Anything: training all modality models to follow instructions with unified language feedback},

  year = {2024},

  publisher = {GitHub},

  journal = {GitHub repository},

  howpublished = {\url{https://github.com/PKU-Alignment/align-anything}},

}

```

# License

align-anything is released under Apache License 2.0.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/PKU-Alignment/align-anything

Awesome Lists containing this project

README