https://github.com/freedomintelligence/longllava

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
https://github.com/freedomintelligence/longllava

Last synced: about 1 year ago
JSON representation

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Host: GitHub
URL: https://github.com/freedomintelligence/longllava
Owner: FreedomIntelligence
Created: 2024-09-02T13:49:15.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-01-06T06:14:15.000Z (over 1 year ago)
Last Synced: 2025-03-28T17:11:15.588Z (about 1 year ago)
Language: Python
Size: 3.83 MB
Stars: 200
Watchers: 13
Forks: 16
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ![header](./assets/header.png) 



   📃 Paper • 🌐 Demo • 🤗 LongLLaVA-53B-A13B • 🤗 LongLLaVA-9B 



![efficiency](./assets/singleGPU.png) 

## 🌈 Update

* **[2024.09.05]** LongLLaVA repo is published！🎉

* **[2024.10.12]** [LongLLaVA-53B-A13B](https://huggingface.co/FreedomIntelligence/LongLLaVA-53B-A13B), [LongLLaVA-9b](https://huggingface.co/FreedomIntelligence/LongLLaVA-9B) and [Jamba-9B-Instruct](https://huggingface.co/FreedomIntelligence/Jamba-9B-Instruct) are repleased！🎉 

## Architecture

  Click to view the architecture image

  ![Architecture Image](./assets/arch.png)

## Results

  Click to view the Results

  - Main Results

      ![Main Results](./assets/result1.png) 

  - Diagnostic Results

      ![Diagnostic Results](./assets/diaresult.png)

  - Video-NIAH

      ![Video-NIAH](./assets/NIAH.png)

## Results reproduction

### 1. Environment Setup

  ```bash

  pip install -r requirements.txt

  ```

### 2. Data DownLoad and Construction

  Dataset Taxonomy

  ![Dataset](./assets/dataset.png) 

- Dataset DownLoading and Construction

  > Coming Soon.

### 3. Training

- Downloading Language Models

  


   🤗 Jamba-9B-Instruct 

  


- Stage I: Single-image Alignment.

  ```bash

  bash Align.sh

  ```

- Stage II: Single-image Instruction-tuning.

  ```bash

  bash SingleImageSFT.sh

  ```

- Stage III: Multi-image Instruction-tuning. 

  ```bash

  bash MultiImageSFT.sh

  ```

### 4. Evaluation

- Command Line Interface

```bash

python cli.py --model_dir path-to-longllava

```

- Model Inference

```python

query = 'What does the picture show?'

image_paths = ['image_path1'] # image or video path

from cli import Chatbot

bot = Chatbot(path-to-longllava)

output = bot.chat(query, image_paths)

print(output) # Prints the output of the model

```

- Benchmarks

```bash

python Eval.sh

```

### 5. Reproduce other results in Paper

- FLOPs

```bash

python /utils/cal_flops.py

```

- Prefill Time & Throughput & GPU Memory Usage

```bash

python ./benchmarks/Efficiency/evaluate.py

python ./benchmarks/Efficiency/evaluatevllm.py

```

- DownCycling

To Transfer Jamba-MoE to Dense 

```bash

python ./utils/dense_downcycling.py

```

## TO DO

- [ ] Release Data Construction Code

## Acknowledgement

- [LLaVA](https://github.com/haotian-liu/LLaVA): Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

## Citation

```

@misc{wang2024longllavascalingmultimodalllms,

      title={LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture}, 

      author={Xidong Wang and Dingjie Song and Shunian Chen and Chen Zhang and Benyou Wang},

      year={2024},

      eprint={2409.02889},

      archivePrefix={arXiv},

      primaryClass={cs.CL},

      url={https://arxiv.org/abs/2409.02889}, 

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/freedomintelligence/longllava

Awesome Lists containing this project

README