https://github.com/FreedomIntelligence/LongLLaVA
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
https://github.com/FreedomIntelligence/LongLLaVA
Last synced: 8 months ago
JSON representation
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
- Host: GitHub
- URL: https://github.com/FreedomIntelligence/LongLLaVA
- Owner: FreedomIntelligence
- Created: 2024-09-02T13:49:15.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-06T06:14:15.000Z (over 1 year ago)
- Last Synced: 2025-01-12T10:04:51.970Z (over 1 year ago)
- Language: Python
- Size: 3.83 MB
- Stars: 188
- Watchers: 13
- Forks: 13
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README

📃 Paper • 🌐 Demo • 🤗 LongLLaVA-53B-A13B • 🤗 LongLLaVA-9B

## 🌈 Update
* **[2024.09.05]** LongLLaVA repo is published!🎉
* **[2024.10.12]** [LongLLaVA-53B-A13B](https://huggingface.co/FreedomIntelligence/LongLLaVA-53B-A13B), [LongLLaVA-9b](https://huggingface.co/FreedomIntelligence/LongLLaVA-9B) and [Jamba-9B-Instruct](https://huggingface.co/FreedomIntelligence/Jamba-9B-Instruct) are repleased!🎉
## Architecture
Click to view the architecture image

## Results
Click to view the Results
- Main Results

- Diagnostic Results

- Video-NIAH

## Results reproduction
### 1. Environment Setup
```bash
pip install -r requirements.txt
```
### 2. Data DownLoad and Construction
Dataset Taxonomy

- Dataset DownLoading and Construction
> Coming Soon.
### 3. Training
- Downloading Language Models
- Stage I: Single-image Alignment.
```bash
bash Align.sh
```
- Stage II: Single-image Instruction-tuning.
```bash
bash SingleImageSFT.sh
```
- Stage III: Multi-image Instruction-tuning.
```bash
bash MultiImageSFT.sh
```
### 4. Evaluation
- Command Line Interface
```bash
python cli.py --model_dir path-to-longllava
```
- Model Inference
```python
query = 'What does the picture show?'
image_paths = ['image_path1'] # image or video path
from cli import Chatbot
bot = Chatbot(path-to-longllava)
output = bot.chat(query, image_paths)
print(output) # Prints the output of the model
```
- Benchmarks
```bash
python Eval.sh
```
### 5. Reproduce other results in Paper
- FLOPs
```bash
python /utils/cal_flops.py
```
- Prefill Time & Throughput & GPU Memory Usage
```bash
python ./benchmarks/Efficiency/evaluate.py
python ./benchmarks/Efficiency/evaluatevllm.py
```
- DownCycling
To Transfer Jamba-MoE to Dense
```bash
python ./utils/dense_downcycling.py
```
## TO DO
- [ ] Release Data Construction Code
## Acknowledgement
- [LLaVA](https://github.com/haotian-liu/LLaVA): Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
## Citation
```
@misc{wang2024longllavascalingmultimodalllms,
title={LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture},
author={Xidong Wang and Dingjie Song and Shunian Chen and Chen Zhang and Benyou Wang},
year={2024},
eprint={2409.02889},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.02889},
}
```