Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thu-nics/FrameFusion
Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
https://github.com/thu-nics/FrameFusion
Last synced: 13 days ago
JSON representation
Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
- Host: GitHub
- URL: https://github.com/thu-nics/FrameFusion
- Owner: thu-nics
- License: mit
- Created: 2024-12-30T03:29:34.000Z (29 days ago)
- Default Branch: main
- Last Pushed: 2025-01-06T12:51:30.000Z (21 days ago)
- Last Synced: 2025-01-06T13:42:54.306Z (21 days ago)
- Language: Python
- Size: 19.5 MB
- Stars: 9
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-token-merge-for-mllms - [Code
- awesome-token-merge-for-mllms - [Code
README
# FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
**[[arXiv](https://arxiv.org/abs/2501.01986)]** **[[Project Page](https://thu-nics.github.io/FrameFusion_Project_Page/)]**
FrameFusion reduces the number of tokens in Large Vision-Language Models (LVLMs) by combining similarity-based merging with importance-based pruning. It achieves a 70% vision token reduction, 3.4–4.4× LLM speedups, and 1.6–1.9× end-to-end speedups with minimal performance impact.
## Environment Setup
Create a new environment:
```bash
conda create -n framefusion python=3.10
conda activate framefusion
```Install the dependencies:
```bash
pip install -r requirements.txt
```Install FrameFusion:
```bash
pip install -e .
```To use Llava-Video LVLM, you also need to install the dependencies for it. We recommend clone the [official repository](https://github.com/LLaVA-VL/LLaVA-NeXT), then install it with `pip install -e .` in the cloned repository.
## How to
### Run an example
We provide an example with LLaVA-Video-7B model to inference on a video with or without FrameFusion in `script/playground/example_llava.py`.
```bash
python script/playground/example_llava.py
```### Apply FrameFusion
You can apply FrameFusion in your own code to any huggingface model that supports the interface with few lines of code. Here is an example:
```python
from llava.model.builder import load_pretrained_model
from framefusion.interface import apply_framefusion# set attn_implementation to be sdpa
tokenizer, model, image_processor, max_length = load_pretrained_model("lmms-lab/LLaVA-Video-7B-Qwen2", None, "llava_qwen", torch_dtype="bfloat16", attn_implementation='sdpa', device_map="auto")# apply FrameFusion
apply_framefusion(model, cost=0.3, similarity_lower_bound=0.6, ratio_lower_bound=0.1)# use the model as usual
```### Adept to new models
#### Understand Code Structure
- `framefusion/`: The main package for FrameFusion.
- `models/`: The adapter for different models.
- `main.py`: The main implementation of FrameFusion.
- `interface.py`: The interface for applying FrameFusion.
- `scripts/`: Scripts for running experiments.
- `evaluate/`: Scripts for evaluating the performance models.
- `playground/`: Scripts for running misc experiments.
- `example/`: Example input videos#### Modify the code
1. Add a new model adapter in `framefusion/models/`, it applies framefusion after the attention module.
> Three model functions are required: `llm_forward`, `decoder_forward`, and `attention_forward`. The forward functions are easily modified from the corresponding `modeling_.py` functions in huggingface transformers. All modifications are marked with `###` comments. For LLM, see `framefusion/models/qwen2/modeling_qwen2.py` as an example.
2. Register the model in `framefusion/interface.py`, it applies framefusion to the correct model class.
3. Add a new example in `script/playground/`, it shows how to apply framefusion to the model.
#### Happy to help
If you have any questions on applying FrameFusion to a new model, please feel free to open an issue. We are happy to help you and expand the adapter for more models.