https://github.com/Yxxxb/VoCo-LLaMA

VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
https://github.com/Yxxxb/VoCo-LLaMA

image-compression llama llava

Last synced: 5 months ago
JSON representation

VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".

Host: GitHub
URL: https://github.com/Yxxxb/VoCo-LLaMA
Owner: Yxxxb
License: apache-2.0
Created: 2024-06-17T10:32:51.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-07-06T11:45:49.000Z (12 months ago)
Last Synced: 2024-10-18T23:16:02.850Z (8 months ago)
Topics: image-compression, llama, llava
Language: Python
Homepage: https://yxxxb.github.io/VoCo-LLaMA-page/
Size: 817 KB
Stars: 78
Watchers: 4
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-token-merge-for-mllms - [Code
awesome-token-merge-for-mllms - [Code

README

# VoCo-LLaMA: Towards Vision Compression with Large Language Models

[Xubing Ye](https://github.com/Yxxxb), [Yukang Gan](https://scholar.google.com/citations?user=8rltp9AAAAAJ&hl=zh-CN), [Xiaoke Huang](https://xk-huang.github.io/), [Yixiao Ge](https://geyixiao.com/), [Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en), [Yansong Tang](https://andytang15.github.io)

## TL;DR

We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By fully utilizing the LLMs' understanding paradigm of vision tokens, our method can compress hundreds of vision tokens into a single VoCo token, while minimizing visual information loss.

VoCo-LLaMA demonstrates the ability to understand video through continuous training using time-series compressed token sequences of video frames.

VoCo-LLaMA presents a promising way to unlock the full potential of VLMs' contextual window.

![image](https://i.imgur.com/wznshA6.jpeg)

## News

- [x] **[2024/06/17]** Upload paper and release vision compression code.

## Preparation

### Install

1. Clone this repository and navigate to VoCo-LLaMA folder

```bash
git clone https://github.com/Yxxxb/VoCo-LLaMA.git
cd VoCo-LLaMA
```

2. Install Package

```Shell
conda create -n voco_llama python=3.10 -y
conda activate voco_llama
pip install --upgrade pip # enable PEP 660 support
pip install -e .
```

3. Install additional packages for training cases

```
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
cp VoCo-LLaMA/llava/model/language_model/cache_py/modeling_attn_mask_utils.py /data/miniconda3/envs/voco_llama/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py
```

### Data and Pre-trained weights

VoCo-LLaMA training requires only visual instruction fine-tuning. Please download the aligned LLaVA checkpoints ([base LLM and projection layers](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). Please download the annotation of the LLaVA instruction tuning data [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json), and download the images from constituting datasets:

- COCO: [train2017](http://images.cocodataset.org/zips/train2017.zip)
- GQA: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
- OCR-VQA: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), we save all files as `.jpg`
- TextVQA: [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
- VisualGenome: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)

After downloading all of them, organize the data as follows in `./playground/data`,

```
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
```

## Train

VoCo-LLaMA is trained on 8 A100 GPUs with 40GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.

Train VoCo-LLaMA with vision instruction tuning by running following command:

```
bash scripts/finetune.sh
```

## Evaluation

There are evaluations about visual understanding we follow the relevant settings in LLaVA. Please refer to the LLaVA official [repository](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md) for details of data setup and testing.

## Citation

If you find this work useful, please consider citing our paper:

```bash
@article{ye2024voco,
author={Ye, Xubing and Gan, Yukang and Huang, Xiaoke and Ge, Yixiao and Shan, Ying and Tang, Yansong},
title={{VoCo-LLaMA: Towards Vision Compression with Large Language Models}},
journal={arXiv preprint arXiv:2406.12275},
year={2024},
}
```

##

## Acknowledgement

- [LLaVA](https://github.com/haotian-liu/LLaVA): the codebase we built upon.
- [Vicuna](https://github.com/lm-sys/FastChat): our base model Vicuna-7B that has the amazing language capabilities!