An open API service indexing awesome lists of open source software.

https://github.com/2u1/llama3.2-vision-finetune

An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.
https://github.com/2u1/llama3.2-vision-finetune

llama3 multi-modal vision-language vision-language-model

Last synced: about 1 year ago
JSON representation

An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.

Awesome Lists containing this project

README

          

# Fine-tuning Llama3.2-Vision

This repository contains a script for training [Llama3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) with only using HuggingFace and [Liger-Kernel](https://github.com/linkedin/Liger-Kernel).

## Other projects

**[[Phi3-Vision Finetuning]](https://github.com/2U1/Phi3-Vision-Finetune)**

**[[Qwen2-VL Finetuning]](https://github.com/2U1/Qwen2-VL-Finetune)**

**[[Molmo Finetuning]](https://github.com/2U1/Molmo-Finetune)**

**[[Pixtral Finetune]](https://github.com/2U1/Pixtral-Finetune)**

**[[SmolVLM Finetune]](https://github.com/2U1/SmolVLM-Finetune)**

**[[Gemma3 Finetune]](https://github.com/2U1/Gemma3-Finetune)**

## Update

- [2025/01/24] Add option for using DoRA.
- [2025/01/24] Fix error in LoRA training.
- [2025/01/18] 🔥Supports mixed-modality data.
- [2025/01/11] Updated 8-bit training using ms_amp fp8 with opt_level O3.
- [2024/11/05] Add memory efficient 8-bit training.
- [2024/11/05] 🔥Supports training with liger-kernel.
- [2024/10/04] 🔥Supports text-only data.

## Table of Contents

- [Fine-tuning Llama3.2-Vision](#fine-tuning-llama32-vision)
- [Other projects](#other-projects)
- [Update](#update)
- [Table of Contents](#table-of-contents)
- [Supported Features](#supported-features)
- [Docker](#docker)
- [Installation](#installation)
- [Environments](#environments)
- [Using `environment.yaml`](#using-environmentyaml)
- [Dataset Preparation](#dataset-preparation)
- [Training](#training)
- [Full Finetuning](#full-finetuning)
- [Full Finetuning with 8-bit](#full-finetuning-with-8-bit)
- [Finetune with LoRA](#finetune-with-lora)
- [Train with video dataset](#train-with-video-dataset)
- [Merge LoRA Weights](#merge-lora-weights)
- [Issue for libcudnn error](#issue-for-libcudnn-error)
- [TODO](#todo)
- [Known Issues](#known-issues)
- [License](#license)
- [Citation](#citation)
- [Acknowledgement](#acknowledgement)

## Supported Features

- Deepspeed
- LoRA, QLoRA
- Full-finetuning
- Multi-image and video training

## Docker

To simplfy the setting process for training, you could use the provided pre-build environments.

The settings are done in the conda env named `train`.


You could find more information about the image [here](https://hub.docker.com/repository/docker/john119/vlm/general).

```
docker pull john119/vlm:v1
docker run --gpus all -it -v /host/path:/docker/path --name vlm --ipc=host john119/vlm:v1 /bin/bash
```

## Installation

### Environments

- Ubuntu 22.04
- Nvidia-Driver 550.120
- Cuda version 12.4

Install the required packages using `environment.yml`.

### Using `environment.yaml`

```bash
conda env create -f environment.yaml
conda activate llama
```

**Note:** Llama3.2-Vision does not support flash-attention2 for now.

## Dataset Preparation

The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided `--image_folder`.

**When using a multi-image dataset, the image tokens should all be ``, and the image file names should have been in a list.**
**Please see the example below and follow format your data.**

Example for single image dataset

```json
[
{
"id": "000000033471",
"image": "000000033471.jpg",
"conversations": [
{
"from": "human",
"value": "\nWhat are the colors of the bus in the image?"
},
{
"from": "gpt",
"value": "The bus in the image is white and red."
},
{
"from": "human",
"value": "What feature can be seen on the back of the bus?"
},
{
"from": "gpt",
"value": "The back of the bus features an advertisement."
},
{
"from": "human",
"value": "Is the bus driving down the street or pulled off to the side?"
},
{
"from": "gpt",
"value": "The bus is driving down the street, which is crowded with people and other vehicles."
}
]
}
...
]
```

Example for multi image dataset

```json
[
{
"id": "000000033471",
"image": ["000000033471.jpg", "000000033472.jpg"],
"conversations": [
{
"from": "human",
"value": "\n\nIs the perspective of the camera differnt?"
},
{
"from": "gpt",
"value": "Yes, It the perspective of the camera is different."
}
]
}
...
]
```

Example for video dataset

```json
[
{
"id": "sample1",
"video": "sample1.mp4",
"conversations": [
{
"from": "human",
"value": "\nWhat is going on in this video?"
},
{
"from": "gpt",
"value": "A man is walking down the road."
}
]
}
...
]
```

**Note:** Llama3.2-Vision uses a video as a sequential of images.

## Training

**Note:** Deepspeed zero2 is faster than zero3, however it consumes more memory. Also, most of the time zero2 is more stable than zero3.


**Tip:** You could use `adamw_bnb_8bit` for optimizer to save memory.

To run the training script, use the following command:

### Full Finetuning

```bash
bash scripts/finetune.sh
```

### Full Finetuning with 8-bit

```bash
bash scripts/finetune_8bit.sh
```

**You need to install [ms-amp](https://github.com/Azure/MS-AMP) to use this script.**

This script will finetune the model with fp8 model dtype. If you run out of vram, you could use this.

You could combine fp8 training with offloading.

### Finetune with LoRA

If you want to train only the language model with LoRA and perform full training for the vision model:

```bash
bash scripts/finetune_lora.sh
```

If you want to train both the language model and the vision model with LoRA:

```bash
bash scripts/finetune_lora_vision.sh
```

**IMPORTANT:** If you want to tune the `embed_token` with LoRA, You need to tune `lm_head` together.

Training arguments

- `--deepspeed` (str): Path to DeepSpeed config file (default: "scripts/zero2.json").
- `--data_path` (str): Path to the LLaVA formatted training data (a JSON file). **(Required)**
- `--image_folder` (str): Path to the images folder as referenced in the LLaVA formatted training data. **(Required)**
- `--model_id` (str): Path to the Llama3.2-Vision model. **(Required)**
- `--optim` (str): Optimizer when training (default: `adamw_torch`).
- `--output_dir` (str): Output directory for model checkpoints
- `--num_train_epochs` (int): Number of training epochs (default: 1).
- `--per_device_train_batch_size` (int): Training batch size per GPU per forwarding step.
- `--gradient_accumulation_steps` (int): Gradient accumulation steps (default: 4).
- `--freeze_vision_tower` (bool): Option to freeze vision_model (default: False).
- `--tune_merger` (bool): Option to tune projector (default: True).
- `--num_lora_modules` (int): Number of target modules to add LoRA (-1 means all layers).
- `--vision_lr` (float): Learning rate for vision_model.
- `--projector_lr` (float): Learning rate for projector.
- `--learning_rate` (float): Learning rate for language module.
- `--bf16` (bool): Option for using bfloat16.
- `--fp16` (bool): Option for using fp16.
- `--lora_enable` (bool): Option for enabling LoRA (default: False)
- `--vision_lora` (bool): Option for including vision_tower to the LoRA module. The `lora_enable` should be `True` to use this option. (default: False)
- `--use_dora` (bool): Option for using DoRA instead of LoRA. The `lora_enable` should be `True` to use this option. (default: False)
- `--lora_namespan_exclude` (str): Exclude modules with namespans to add LoRA.
- `--max_seq_length` (int): Maximum sequence length (default: 128K).
- `--bits` (int): Quantization bits (default: 16).
- `--disable_flash_attn2` (bool): Disable Flash Attention 2.
- `--report_to` (str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').
- `--logging_dir` (str): Logging directory (default: "./tf-logs").
- `--lora_rank` (int): LoRA rank (default: 128).
- `--lora_alpha` (int): LoRA alpha (default: 256).
- `--lora_dropout` (float): LoRA dropout (default: 0.05).
- `--logging_steps` (int): Logging steps (default: 1).
- `--dataloader_num_workers` (int): Number of data loader workers (default: 4).

**Note:** The learning rate of `vision_model` should be 10x ~ 5x smaller than the `language_model`.

### Train with video dataset

You can train the model using a video dataset. However, Llama3.2-Vision processes videos as a sequence of images, so you’ll need to select specific frames and treat them as multiple images for training. You can set LoRA configs and use for LoRA too.

```bash
bash scripts/finetune_video.sh
```

If you run out of vram, you can use [zero3_offload](./scripts/zero3_offload.json) instead of [zero3](./scripts/zero3_offload.json). However, using zero3 is preferred.

#### Merge LoRA Weights

```
bash scripts/merge_lora.sh
```

**Note:** Remember to replace the paths in `finetune.sh` or `finetune_lora.sh` with your specific paths. (Also in `merge_lora.sh` when using LoRA.)

#### Issue for libcudnn error

```
Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
```

You could run `unset LD_LIBRARY_PATH` for this error.
You could see this [issue](https://github.com/andimarafioti/florence2-finetuning/issues/2)

## TODO

- [x] Support for multi-image & video data
- [x] Support for batch_size > 1
- [x] Handle mixed-modality data

## Known Issues

- [libcudnn issue](#issue-for-libcudnn-error)

## License

This project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.

## Citation

If you find this repository useful in your project, please consider giving a :star: and citing:

```bibtex
@misc{Llama3.2-Vision-Finetuning,
author = {Yuwon Lee},
title = {Llama3.2-Vision-Finetune},
year = {2024},
publisher = {GitHub},
url = {https://github.com/2U1/Llama3.2-Vision-Ft}
}
```

## Acknowledgement

This project is based on

- [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT): An amazing open-source project of LMM.
- [Llama3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct): Awesome pretrained MLLM by Meta.