Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Yxxxb/VoCo-LLaMA
VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
https://github.com/Yxxxb/VoCo-LLaMA
image-compression llama llava
Last synced: 13 days ago
JSON representation
VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
- Host: GitHub
- URL: https://github.com/Yxxxb/VoCo-LLaMA
- Owner: Yxxxb
- License: apache-2.0
- Created: 2024-06-17T10:32:51.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-07-06T11:45:49.000Z (7 months ago)
- Last Synced: 2024-10-18T23:16:02.850Z (3 months ago)
- Topics: image-compression, llama, llava
- Language: Python
- Homepage: https://yxxxb.github.io/VoCo-LLaMA-page/
- Size: 817 KB
- Stars: 78
- Watchers: 4
- Forks: 4
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-token-merge-for-mllms - [Code
- awesome-token-merge-for-mllms - [Code
README
# VoCo-LLaMA: Towards Vision Compression with Large Language Models
[Xubing Ye](https://github.com/Yxxxb), [Yukang Gan](https://scholar.google.com/citations?user=8rltp9AAAAAJ&hl=zh-CN), [Xiaoke Huang](https://xk-huang.github.io/), [Yixiao Ge](https://geyixiao.com/), [Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en), [Yansong Tang](https://andytang15.github.io)
## TL;DR
We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By fully utilizing the LLMs' understanding paradigm of vision tokens, our method can compress hundreds of vision tokens into a single VoCo token, while minimizing visual information loss.
VoCo-LLaMA demonstrates the ability to understand video through continuous training using time-series compressed token sequences of video frames.
VoCo-LLaMA presents a promising way to unlock the full potential of VLMs' contextual window.
![image](https://i.imgur.com/wznshA6.jpeg)
## News
- [x] **[2024/06/17]** Upload paper and release vision compression code.
## Preparation
### Install
1. Clone this repository and navigate to VoCo-LLaMA folder
```bash
git clone https://github.com/Yxxxb/VoCo-LLaMA.git
cd VoCo-LLaMA
```2. Install Package
```Shell
conda create -n voco_llama python=3.10 -y
conda activate voco_llama
pip install --upgrade pip # enable PEP 660 support
pip install -e .
```3. Install additional packages for training cases
```
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
cp VoCo-LLaMA/llava/model/language_model/cache_py/modeling_attn_mask_utils.py /data/miniconda3/envs/voco_llama/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py
```### Data and Pre-trained weights
VoCo-LLaMA training requires only visual instruction fine-tuning. Please download the aligned LLaVA checkpoints ([base LLM and projection layers](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). Please download the annotation of the LLaVA instruction tuning data [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json), and download the images from constituting datasets:
- COCO: [train2017](http://images.cocodataset.org/zips/train2017.zip)
- GQA: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
- OCR-VQA: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), we save all files as `.jpg`
- TextVQA: [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
- VisualGenome: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)After downloading all of them, organize the data as follows in `./playground/data`,
```
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
```## Train
VoCo-LLaMA is trained on 8 A100 GPUs with 40GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.
Train VoCo-LLaMA with vision instruction tuning by running following command:
```
bash scripts/finetune.sh
```## Evaluation
There are evaluations about visual understanding we follow the relevant settings in LLaVA. Please refer to the LLaVA official [repository](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md) for details of data setup and testing.
## Citation
If you find this work useful, please consider citing our paper:
```bash
@article{ye2024voco,
author={Ye, Xubing and Gan, Yukang and Huang, Xiaoke and Ge, Yixiao and Shan, Ying and Tang, Yansong},
title={{VoCo-LLaMA: Towards Vision Compression with Large Language Models}},
journal={arXiv preprint arXiv:2406.12275},
year={2024},
}
```##
## Acknowledgement
- [LLaVA](https://github.com/haotian-liu/LLaVA): the codebase we built upon.
- [Vicuna](https://github.com/lm-sys/FastChat): our base model Vicuna-7B that has the amazing language capabilities!