https://github.com/tencentarc/st-llm

[ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"
https://github.com/tencentarc/st-llm

large-language-models video-language-model video-understanding

Last synced: 9 months ago
JSON representation

[ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"

Host: GitHub
URL: https://github.com/tencentarc/st-llm
Owner: TencentARC
License: apache-2.0
Created: 2024-03-28T13:06:58.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-09-10T13:25:18.000Z (almost 2 years ago)
Last Synced: 2025-05-07T11:52:13.856Z (about 1 year ago)
Topics: large-language-models, video-language-model, video-understanding
Language: Python
Homepage:
Size: 19 MB
Stars: 145
Watchers: 6
Forks: 5
Open Issues: 10
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          






 ST-LLM: Large Language Models Are Effective Temporal Learners





[![hf](https://img.shields.io/badge/🤗-Hugging%20Face-blue.svg)](https://huggingface.co/farewellthree/ST_LLM_weight/tree/main)

[![arXiv](https://img.shields.io/badge/Arxiv-2311.08046-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2404.00308)

[![License](https://img.shields.io/badge/Code%20License-Apache2.0-yellow)](https://github.com/farewellthree/ST-LLM/blob/main/LICENSE)



[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=st-llm-large-language-models-are-effective-1)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=st-llm-large-language-models-are-effective-1)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=st-llm-large-language-models-are-effective-1)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=st-llm-large-language-models-are-effective-1)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=st-llm-large-language-models-are-effective-1)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=st-llm-large-language-models-are-effective-1)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=st-llm-large-language-models-are-effective-1)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=st-llm-large-language-models-are-effective-1)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=st-llm-large-language-models-are-effective-1)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=st-llm-large-language-models-are-effective-1)

## News :loudspeaker:

* **[2024/3/28]**  All codes and weights are available now! Welcome to watch this repository for the latest updates.

## Introduction :bulb:

- **ST-LLM** is a temporal-sensitive video large language model. Our model incorporates three key architectural: 

  - (1) Joint spatial-temporal modeling within large language models for effective video understanding.

  - (2) Dynamic masking strategy and mask video modeling for efficiency and robustness.

  - (3) Global-local input module for long video understanding.

- **ST-LLM** has established new state-of-the-art results on MVBench, VideoChatGPT Bench and VideoQA Bench:



    

        MethodMVBenchVcgBenchVideoQABench

    

  

        AvgCorrectDetailContextTemporalConsistMSVDMSRVTTANet

    

  

        VideoLLaMA34.11.962.182.161.821.791.9851.629.612.4

    

  

        LLaMA-Adapter31.72.032.322.301.982.152.1654.943.834.2

    

  

        VideoChat35.52.232.502.531.942.242.2956.345.026.5

    

  

        VideoChatGPT32.72.382.402.522.621.982.3764.949.335.7

    

  

        MovieChat-2.762.933.012.242.422.6774.252.745.7

    

  

        Vista-LLaMA-2.442.643.182.262.312.5765.360.548.3

    

  

        LLaMA-VID-2.892.963.003.532.462.5169.757.747.4

    

  

        Chat-UniVi-2.992.892.913.462.892.8165.054.645.8

    

  

        VideoChat251.12.983.022.883.512.662.8170.054.149.1

    

  

        ST-LLM54.93.153.233.053.742.932.8174.663.250.9

    

  



## Demo 🤗

Please download the conversation weights from [here](https://huggingface.co/farewellthree/ST_LLM_weight/tree/main/conversation_weight) and follow the instructions in [installation](README.md#Installation) first. Then, run the gradio demo:

```

CUDA_VISIBLE_DEVICES=0 python3 demo_gradio.py --ckpt-path /path/to/STLLM_conversation_weight

```

We have also prepared local scripts that are easy to modify：[demo.py](demo.py)













## Examples 👀

- **Video Description: for high-difficulty videos with complex scene changes, ST-LLM can accurately describe all the contents.**



  

   



- **Action Identification: ST-LLM can accurately and comprehensively describe the actions occurring in the video.**



  

   





  

   





  

   



- **Reasoning: for the challenging open-ended reasoning questions, STLLM can also provide reasonable answers.**

  


  

   



## Installation 🛠️

Git clone our repository, creating a Python environment and activate it via the following command

```bash

git clone https://github.com/farewellthree/ST-LLM.git

cd ST-LLM

conda create --name stllm python=3.10

conda activate stllm

pip install -r requirement.txt

```

## Training & Validation :bar_chart:

The instructions of data, training and evaluating can be found in [trainval.md](trainval.md).

## Acknowledgement 👍

* [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT) and [MVBench](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2) Great job contributing video LLM benchmark.

* [InstuctBLIP](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) and [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4/tree/main) The codebase and the basic image LLM we built upon.

## Citation ✏️

If you find the code and paper useful for your research, please consider staring this repo and citing our paper:

```

@article{liu2023one,

  title={One for all: Video conversation is feasible without video instruction tuning},

  author={Liu, Ruyang and Li, Chen and Ge, Yixiao and Shan, Ying and Li, Thomas H and Li, Ge},

  journal={arXiv preprint arXiv:2309.15785},

  year={2023}

}

```

```

@article{liu2023one,

  title={ST-LLM: Large Language Models Are Effective Temporal Learners},

  author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},

  journal={https://arxiv.org/abs/2404.00308},

  year={2023}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tencentarc/st-llm

Awesome Lists containing this project

README

ST-LLM: Large Language Models Are Effective Temporal Learners