{"id":18322489,"url":"https://github.com/tencentarc/st-llm","last_synced_at":"2025-10-08T03:18:51.054Z","repository":{"id":230554357,"uuid":"778813089","full_name":"TencentARC/ST-LLM","owner":"TencentARC","description":"[ECCV 2024🔥] Official implementation of the paper \"ST-LLM: Large Language Models Are Effective Temporal Learners\"","archived":false,"fork":false,"pushed_at":"2024-09-10T13:25:18.000Z","size":19899,"stargazers_count":145,"open_issues_count":10,"forks_count":5,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-05-07T11:52:13.856Z","etag":null,"topics":["large-language-models","video-language-model","video-understanding"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentARC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-28T13:06:58.000Z","updated_at":"2025-04-26T03:04:01.000Z","dependencies_parsed_at":"2024-11-05T18:44:38.890Z","dependency_job_id":null,"html_url":"https://github.com/TencentARC/ST-LLM","commit_stats":null,"previous_names":["tencentarc/st-llm"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TencentARC/ST-LLM","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FST-LLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FST-LLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FST-LLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FST-LLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentARC","download_url":"https://codeload.github.com/TencentARC/ST-LLM/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FST-LLM/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278882803,"owners_count":26062356,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-08T02:00:06.501Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-models","video-language-model","video-understanding"],"created_at":"2024-11-05T18:24:51.252Z","updated_at":"2025-10-08T03:18:51.027Z","avatar_url":"https://github.com/TencentARC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"example/material/stllm_logo.png\" alt=\"ST-LLM\" style=\"width: 50%; min-width: 150px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003ch2 align=\"center\"\u003e \u003ca href=\"https://arxiv.org/abs/2404.00308\"\u003eST-LLM: Large Language Models Are Effective Temporal Learners\u003c/a\u003e\u003c/h2\u003e\n\n\u003ch5 align=center\u003e\n\n[![hf](https://img.shields.io/badge/🤗-Hugging%20Face-blue.svg)](https://huggingface.co/farewellthree/ST_LLM_weight/tree/main)\n[![arXiv](https://img.shields.io/badge/Arxiv-2311.08046-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2404.00308)\n[![License](https://img.shields.io/badge/Code%20License-Apache2.0-yellow)](https://github.com/farewellthree/ST-LLM/blob/main/LICENSE)\n\u003c/h5\u003e\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=st-llm-large-language-models-are-effective-1)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=st-llm-large-language-models-are-effective-1)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=st-llm-large-language-models-are-effective-1)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=st-llm-large-language-models-are-effective-1)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=st-llm-large-language-models-are-effective-1)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=st-llm-large-language-models-are-effective-1)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=st-llm-large-language-models-are-effective-1)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=st-llm-large-language-models-are-effective-1)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=st-llm-large-language-models-are-effective-1)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=st-llm-large-language-models-are-effective-1)\n\n## News :loudspeaker:\n\n* **[2024/3/28]**  All codes and weights are available now! Welcome to watch this repository for the latest updates.\n\n## Introduction :bulb:\n\n- **ST-LLM** is a temporal-sensitive video large language model. Our model incorporates three key architectural: \n  - (1) Joint spatial-temporal modeling within large language models for effective video understanding.\n  - (2) Dynamic masking strategy and mask video modeling for efficiency and robustness.\n  - (3) Global-local input module for long video understanding.\n- **ST-LLM** has established new state-of-the-art results on MVBench, VideoChatGPT Bench and VideoQA Bench:\n\n\u003cdiv align=\"center\"\u003e\n\u003ctable border=\"1\" width=\"100%\"\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth rowspan=\"2\"\u003eMethod\u003c/th\u003e\u003cth rowspan=\"2\"\u003eMVBench\u003c/th\u003e\u003cth colspan=\"6\"\u003eVcgBench\u003c/th\u003e\u003cth colspan=\"3\"\u003eVideoQABench\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003cth\u003eAvg\u003c/th\u003e\u003cth\u003eCorrect\u003c/th\u003e\u003cth\u003eDetail\u003c/th\u003e\u003cth\u003eContext\u003c/th\u003e\u003cth\u003eTemporal\u003c/th\u003e\u003cth\u003eConsist\u003c/th\u003e\u003cth\u003eMSVD\u003c/th\u003e\u003cth\u003eMSRVTT\u003c/th\u003e\u003cth\u003eANet\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eVideoLLaMA\u003c/td\u003e\u003ctd\u003e34.1\u003c/td\u003e\u003ctd\u003e1.96\u003c/td\u003e\u003ctd\u003e2.18\u003c/td\u003e\u003ctd\u003e2.16\u003c/td\u003e\u003ctd\u003e1.82\u003c/td\u003e\u003ctd\u003e1.79\u003c/td\u003e\u003ctd\u003e1.98\u003c/td\u003e\u003ctd\u003e51.6\u003c/td\u003e\u003ctd\u003e29.6\u003c/td\u003e\u003ctd\u003e12.4\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eLLaMA-Adapter\u003c/td\u003e\u003ctd\u003e31.7\u003c/td\u003e\u003ctd\u003e2.03\u003c/td\u003e\u003ctd\u003e2.32\u003c/td\u003e\u003ctd\u003e2.30\u003c/td\u003e\u003ctd\u003e1.98\u003c/td\u003e\u003ctd\u003e2.15\u003c/td\u003e\u003ctd\u003e2.16\u003c/td\u003e\u003ctd\u003e54.9\u003c/td\u003e\u003ctd\u003e43.8\u003c/td\u003e\u003ctd\u003e34.2\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eVideoChat\u003c/td\u003e\u003ctd\u003e35.5\u003c/td\u003e\u003ctd\u003e2.23\u003c/td\u003e\u003ctd\u003e2.50\u003c/td\u003e\u003ctd\u003e2.53\u003c/td\u003e\u003ctd\u003e1.94\u003c/td\u003e\u003ctd\u003e2.24\u003c/td\u003e\u003ctd\u003e2.29\u003c/td\u003e\u003ctd\u003e56.3\u003c/td\u003e\u003ctd\u003e45.0\u003c/td\u003e\u003ctd\u003e26.5\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eVideoChatGPT\u003c/td\u003e\u003ctd\u003e32.7\u003c/td\u003e\u003ctd\u003e2.38\u003c/td\u003e\u003ctd\u003e2.40\u003c/td\u003e\u003ctd\u003e2.52\u003c/td\u003e\u003ctd\u003e2.62\u003c/td\u003e\u003ctd\u003e1.98\u003c/td\u003e\u003ctd\u003e2.37\u003c/td\u003e\u003ctd\u003e64.9\u003c/td\u003e\u003ctd\u003e49.3\u003c/td\u003e\u003ctd\u003e35.7\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eMovieChat\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e2.76\u003c/td\u003e\u003ctd\u003e2.93\u003c/td\u003e\u003ctd\u003e3.01\u003c/td\u003e\u003ctd\u003e2.24\u003c/td\u003e\u003ctd\u003e2.42\u003c/td\u003e\u003ctd\u003e2.67\u003c/td\u003e\u003ctd\u003e74.2\u003c/td\u003e\u003ctd\u003e52.7\u003c/td\u003e\u003ctd\u003e45.7\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eVista-LLaMA\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e2.44\u003c/td\u003e\u003ctd\u003e2.64\u003c/td\u003e\u003ctd\u003e3.18\u003c/td\u003e\u003ctd\u003e2.26\u003c/td\u003e\u003ctd\u003e2.31\u003c/td\u003e\u003ctd\u003e2.57\u003c/td\u003e\u003ctd\u003e65.3\u003c/td\u003e\u003ctd\u003e60.5\u003c/td\u003e\u003ctd\u003e48.3\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eLLaMA-VID\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e2.89\u003c/td\u003e\u003ctd\u003e2.96\u003c/td\u003e\u003ctd\u003e3.00\u003c/td\u003e\u003ctd\u003e3.53\u003c/td\u003e\u003ctd\u003e2.46\u003c/td\u003e\u003ctd\u003e2.51\u003c/td\u003e\u003ctd\u003e69.7\u003c/td\u003e\u003ctd\u003e57.7\u003c/td\u003e\u003ctd\u003e47.4\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eChat-UniVi\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e2.99\u003c/td\u003e\u003ctd\u003e2.89\u003c/td\u003e\u003ctd\u003e2.91\u003c/td\u003e\u003ctd\u003e3.46\u003c/td\u003e\u003ctd\u003e2.89\u003c/td\u003e\u003ctd\u003e2.81\u003c/td\u003e\u003ctd\u003e65.0\u003c/td\u003e\u003ctd\u003e54.6\u003c/td\u003e\u003ctd\u003e45.8\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eVideoChat2\u003c/td\u003e\u003ctd\u003e51.1\u003c/td\u003e\u003ctd\u003e2.98\u003c/td\u003e\u003ctd\u003e3.02\u003c/td\u003e\u003ctd\u003e2.88\u003c/td\u003e\u003ctd\u003e3.51\u003c/td\u003e\u003ctd\u003e2.66\u003c/td\u003e\u003ctd\u003e2.81\u003c/td\u003e\u003ctd\u003e70.0\u003c/td\u003e\u003ctd\u003e54.1\u003c/td\u003e\u003ctd\u003e49.1\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eST-LLM\u003c/td\u003e\u003ctd\u003e\u003cb\u003e54.9\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e3.15\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e3.23\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e3.05\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e3.74\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e2.93\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e2.81\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e74.6\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e63.2\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e50.9\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n  \n\u003c/table\u003e\n\u003c/div\u003e\n\n## Demo 🤗\nPlease download the conversation weights from [here](https://huggingface.co/farewellthree/ST_LLM_weight/tree/main/conversation_weight) and follow the instructions in [installation](README.md#Installation) first. Then, run the gradio demo:\n```\nCUDA_VISIBLE_DEVICES=0 python3 demo_gradio.py --ckpt-path /path/to/STLLM_conversation_weight\n```\nWe have also prepared local scripts that are easy to modify：[demo.py](demo.py)\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"example/material/Mabaoguo.gif\" width=\"70%\" /\u003e\n\u003c/div\u003e\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"example/material/Driving.gif\" width=\"70%\" /\u003e\n\u003c/div\u003e\n\n## Examples 👀\n- **Video Description: for high-difficulty videos with complex scene changes, ST-LLM can accurately describe all the contents.**\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"example/driving.gif\" width=\"25%\" style=\"display:inline-block\" /\u003e\n  \u003cimg src=\"example/driving.jpg\" width=\"65%\" style=\"display:inline-block\" /\u003e \n\u003c/p\u003e\n\n- **Action Identification: ST-LLM can accurately and comprehensively describe the actions occurring in the video.**\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"example/cooking.gif\" width=\"21%\" style=\"display:inline-block\" /\u003e\n  \u003cimg src=\"example/cooking.jpg\" width=\"68%\" style=\"display:inline-block\" /\u003e \n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"example/TVshow.gif\" width=\"21%\" style=\"display:inline-block\" /\u003e\n  \u003cimg src=\"example/TVshow.jpg\" width=\"68%\" style=\"display:inline-block\" /\u003e \n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"example/monkey.gif\" width=\"21%\" style=\"display:inline-block\" /\u003e\n  \u003cimg src=\"example/monkey.jpg\" width=\"68%\" style=\"display:inline-block\" /\u003e \n\u003c/p\u003e\n\n- **Reasoning: for the challenging open-ended reasoning questions, STLLM can also provide reasonable answers.**\n  \u003cp align=\"center\"\u003e\n  \u003cimg src=\"example/BaoguoMa.gif\" width=\"26%\" style=\"display:inline-block\" /\u003e\n  \u003cimg src=\"example/baoguoma.jpg\" width=\"66%\" style=\"display:inline-block\" /\u003e \n\u003c/p\u003e\n\n## Installation 🛠️\nGit clone our repository, creating a Python environment and activate it via the following command\n\n```bash\ngit clone https://github.com/farewellthree/ST-LLM.git\ncd ST-LLM\nconda create --name stllm python=3.10\nconda activate stllm\npip install -r requirement.txt\n```\n\n## Training \u0026 Validation :bar_chart:\nThe instructions of data, training and evaluating can be found in [trainval.md](trainval.md).\n\n## Acknowledgement 👍\n* [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT) and [MVBench](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2) Great job contributing video LLM benchmark.\n* [InstuctBLIP](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) and [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4/tree/main) The codebase and the basic image LLM we built upon.\n\n## Citation ✏️\nIf you find the code and paper useful for your research, please consider staring this repo and citing our paper:\n```\n@article{liu2023one,\n  title={One for all: Video conversation is feasible without video instruction tuning},\n  author={Liu, Ruyang and Li, Chen and Ge, Yixiao and Shan, Ying and Li, Thomas H and Li, Ge},\n  journal={arXiv preprint arXiv:2309.15785},\n  year={2023}\n}\n```\n```\n@article{liu2023one,\n  title={ST-LLM: Large Language Models Are Effective Temporal Learners},\n  author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},\n  journal={https://arxiv.org/abs/2404.00308},\n  year={2023}\n}\n```\n \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fst-llm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftencentarc%2Fst-llm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fst-llm/lists"}