{"id":26690833,"url":"https://github.com/llyx97/TempCompass","last_synced_at":"2025-03-26T16:01:11.563Z","repository":{"id":225775764,"uuid":"763950951","full_name":"llyx97/TempCompass","owner":"llyx97","description":"[ACL 2024 Findings] \"TempCompass: Do Video LLMs Really Understand Videos?\", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou","archived":false,"fork":false,"pushed_at":"2025-02-23T03:24:56.000Z","size":74533,"stargazers_count":102,"open_issues_count":2,"forks_count":2,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-02-23T04:21:37.732Z","etag":null,"topics":["evaluation","temporal-perception","video-llms"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/llyx97.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-27T08:00:13.000Z","updated_at":"2025-02-23T03:25:00.000Z","dependencies_parsed_at":"2024-05-21T12:54:54.805Z","dependency_job_id":"3bd1ed08-5918-4564-902a-9fc545e5a179","html_url":"https://github.com/llyx97/TempCompass","commit_stats":null,"previous_names":["llyx97/tempcompass"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/llyx97%2FTempCompass","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/llyx97%2FTempCompass/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/llyx97%2FTempCompass/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/llyx97%2FTempCompass/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/llyx97","download_url":"https://codeload.github.com/llyx97/TempCompass/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245689494,"owners_count":20656416,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation","temporal-perception","video-llms"],"created_at":"2025-03-26T16:00:51.620Z","updated_at":"2025-03-26T16:01:11.557Z","avatar_url":"https://github.com/llyx97.png","language":"Python","funding_links":[],"categories":["Benchmark"],"sub_categories":["Multi-modal"],"readme":"\u003ch2 align=\"center\"\u003e \u003ca href=\"https://arxiv.org/abs/2403.00476\"\u003eTempCompass: A benchmark to evaluate the temporal perception ability of Video LLMs\u003c/a\u003e\u003c/h2\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href='https://arxiv.org/abs/2403.00476'\u003e\u003cimg src='https://img.shields.io/badge/ArXiv-2403.00476-red'\u003e\u003c/a\u003e\n    \u003ca href='https://llyx97.github.io/tempcompass/'\u003e\u003cimg src='https://img.shields.io/badge/Project-Page-Green'\u003e\u003c/a\u003e\n    \u003ca href='https://huggingface.co/spaces/lyx97/TempCompass'\u003e\u003cimg src='https://img.shields.io/badge/🤗_Hugging_Face-Leaderboard-blue'\u003e\u003c/a\u003e\n    \u003ca href='https://huggingface.co/datasets/lmms-lab/TempCompass'\u003e\u003cimg src='https://img.shields.io/badge/🤗_Hugging_Face-Datasets-green'\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n\u003cdiv\u003e\n\u003cdiv align=\"center\"\u003e\n    \u003ca href='https://llyx97.github.io/' target='_blank'\u003eYuanxin Liu\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://lscpku.github.io/' target='_blank'\u003eShicheng Li\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://liuyi-pku.github.io/' target='_blank'\u003eYi Liu\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    Yuxiang Wang\u003csup\u003e1\u003c/sup\u003e\u0026emsp;\n    \u003ca href='https://renshuhuai-andy.github.io/' target='_blank'\u003eShuhuai Ren\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003c/br\u003e\n    \u003ca href='https://lilei-nlp.github.io/' target='_blank'\u003eLei Li\u003csup\u003e2\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://pkucss.github.io/' target='_blank'\u003eSishuo Chen\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://xusun26.github.io/' target='_blank'\u003eXu Sun\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://houlu369.github.io/' target='_blank'\u003eLu Hou\u003csup\u003e3\u003c/sup\u003e\u003c/a\u003e\n\u003c/div\u003e\n\u003cdiv\u003e\n\u003cdiv align=\"center\"\u003e\n    \u003csup\u003e1\u003c/sup\u003ePeking University\u0026emsp;\n    \u003csup\u003e2\u003c/sup\u003eThe University of Hong Kong\u0026emsp;\n    \u003csup\u003e3\u003c/sup\u003eHuawei Noah’s Ark Lab\n\u003c/div\u003e\n\n## 📢 News\n\n**[2024-10-30]** 🎉🎉🎉 TempCompass is integrated into [VLMEvalKit](https://github.com/open-compass/VLMEvalKit).\n\n**[2024-08-30]** Results of [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), [GPT-4o](https://openai.com/index/hello-gpt-4o/), [MiniCPM-V-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6), [InternVL-2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B), [LLaVA-OneVision-Qwen-2-7B](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) and [InterLM-XComposer-2.5](https://huggingface.co/internlm/internlm-xcomposer2d5-7b) are added to the [leaderboard](https://huggingface.co/spaces/lyx97/TempCompass). GPT-4o establishes the new SoTA!\n\n**[2024-08-08]** Results of [LLaVA-Next-Video](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Video_0716.md), [VILA-1.5](https://github.com/NVlabs/VILA) and [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA) are added to the [leaderboard](https://huggingface.co/spaces/lyx97/TempCompass).\n\n**[2024-07]** 🎉🎉🎉 TempCompass is integrated into [LMMs-Eval](https://lmms-lab.github.io/posts/lmms-eval-0.2/). See [here](#lmms-eval) for usage examples.\n\n**[2024-06-11]** Result of [Reka-core](https://www.reka.ai/news/reka-core-our-frontier-class-multimodal-language-model) is added to the [leaderboard](https://huggingface.co/spaces/lyx97/TempCompass).\n\n**[2024-05-25]** [TempCompass Leaderboard](https://huggingface.co/spaces/lyx97/TempCompass) is available on HuggingFace Space 🤗.\n\n**[2024-05-16]** 🎊🎊🎊 TempCompass is accepted at ACL 2024 Findings!\n\n**[2024-04-14]** Evaluation [result](#eval_result) of [Gemini-1.5-pro](https://deepmind.google/technologies/gemini/pro/), the current SOTA Video LLM, is add.\n\n**[2024-03-23]** The [answer prompt](#answer_prompt) is improved to better guide Video LLMs to follow the desired answer formats. The [evaluation code](#eval) now provides an option to disable the use of ChatGPT.\n\n**[2024-03-12]** 🔥🔥🔥 The evaluation code is released now! Feel free to evaluate your own Video LLMs.\n\n## 🏆 LeaderBoard\n![](./assets/leaderboard.png)\n\n## ✨ Highlights\n### Diverse Temporal Aspects and Task Formats\n- TempCompass encompasses a diverse set of temporal aspects (left) and task formats (right) to comprehensively evaluate the temporal perception capability of Video LLMs.\n![](./assets/overview.png)\n### Conflicting Videos\n- We construct conflicting videos to prevent the models from taking advantage of single-frame bias and language priors.\n![](./assets/conflicting_videos.jpg)\n  \n- 🤔 Can your Video LLM correctly answer the following question for both two videos?\n  \n    \u003cimg src=\"./assets/1021488277.gif\" alt=\"Raw Video\" style=\"float: left; width: 49%; margin-right: 10px;\"\u003e\n    \u003cimg src=\"./assets/1021488277_reverse.gif\" alt=\"Conflicting Video\" style=\"float: left; width: 49%;\"\u003e\n    \n    \u003e What is happening in the video?    \n    \u003e A. A person drops down the pineapple    \n    \u003e B. A person pushes forward the pineapple    \n    \u003e C. A person rotates the pineapple    \n    \u003e D. A person picks up the pineapple\n\n## 🚀 Quick Start\nTo begin with, clone this repository and install some packages:\n```shell\ngit clone https://github.com/llyx97/TempCompass.git\ncd TempCompass\npip install -r requirements.txt\n```\n\n### Data Preparation\n**1. Task Instructions**\n\nThe task instructions can be found in `questions/`.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cspan id=\"instruct_gen\"\u003e Task Instruction Generation Procedure \u003c/span\u003e\u003c/summary\u003e\n    \n1. Generate **Multi-Choice QA** instructions (`question_gen.py`). \n\n2. Manually validate quality and rectify.\n\n3. Generate task instructions for **Yes/No QA** (`question_gen_yes_no.py`), **Caption Matching** (`question_gen_caption_match.py`) and **Caption Generation** (`question_gen_captioning.py`), based on manually rectified **Multi-Choice QA** instructions.\n   \n4. Manually validate quality and rectify.\n\u003c/details\u003e\n\n**2. Videos**\n\nAll the processed videos can be downloaded from [google drive](https://drive.google.com/file/d/1b0ZIeRqhrUpQYxoCN_Ym_e0UW05cckYJ/view?usp=sharing) or [huggingface](https://huggingface.co/datasets/lmms-lab/TempCompass).\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cspan id=\"instruct_gen\"\u003e As an alternative, you can also download the raw videos and process them yourself \u003c/span\u003e\u003c/summary\u003e\n\nRun the following commands. The videos will be saved to `videos/`.\n```shell\ncd utils\npython download_video.py    # Download raw videos\npython process_videos.py    # Construct conflicting videos\n```\n\n**Note:** If you encounter a `MoviePy error` when running the processing script, please refer to this [issue](https://github.com/llyx97/TempCompass/issues/4).\n\u003c/details\u003e\n\n### Run Inference\nWe use [Video-LLaVA](https://github.com/PKU-YuanGroup/Video-LLaVA) and [Gemini](https://github.com/google-gemini/cookbook/blob/98a74b3cde77e518032928acec2fab8b8f3b41be/preview/file-api/File_API_Video.ipynb) as examples to illustrate how to conduct MLLM inference on our benchmark.\n\n**1. Video-LLaVA**\n\nEnter `run_video_llava` and install the environment as instructed.\n\nThen run the following commands. The prediction results will be saved to `predictions/video-llava/\u003ctask_type\u003e`.\n```shell\n# select \u003ctask_type\u003e from multi-choice, yes_no, caption_matching, captioning\npython inference_dataset.py --task_type \u003ctask_type\u003e\n```\n\n**2. Gemini**\n\nThe inference script for gemini-1.5-pro is `run_gemini.ipynb`. It is recommended to run the script in [Google Colab](https://colab.research.google.com/).\n\n### \u003cspan id=\"eval\"\u003e Run Evaluation \u003c/span\u003e\nAfter obtaining the MLLM predictions, run the following commands to conduct automatic evaluation. Remember to set your own `$OPENAI_API_KEY` in `utils/eval_utils.py`.\n\n- **Multi-Choice QA**\n`python eval_multi_choice.py --video_llm video-llava`\n\n- **Yes/No QA**\n`python eval_yes_no.py --video_llm video-llava`\n\n- **Caption Matching**\n`python eval_caption_matching.py --video_llm video-llava`\n\n- **Caption Generation**\n`python eval_captioning.py --video_llm video-llava`\n\n**Tip**👉: Except for *Caption Generation*, you can set `--disable_llm` when running the scripts, which will disable chatgpt-based evaluation (i.e., entirely rely on rule-based evaluation). **This is useful when you do not want to use ChatGPT API and your MLLM is good at following the instruction to generate answers of specific format.**\n\nThe results of each data point will be saved to `auto_eval_results/video-llava/\u003ctask_type\u003e.json` and the overall results on each temporal aspect will be printed out as follows:\n```\n{'action': 76.0, 'direction': 35.2, 'speed': 35.6, 'order': 37.7, 'attribute_change': 41.0, 'avg': 45.6}\n{'fine-grained action': 58.8, 'coarse-grained action': 90.3, 'object motion': 36.2, 'camera motion': 32.6, 'absolute speed': 47.6, 'relative speed': 28.0, 'order': 37.7, 'color \u0026 light change': 43.6, 'size \u0026 shape change': 39.4, 'combined change': 41.7, 'other change': 38.9}\nMatch Success Rate=100.0\n```\n\n## \u003cspan id=\"lmms-eval\"\u003e LMMs-Eval Evaluation \u003c/span\u003e\nHere we provide an example of how to evaluate LLaVA-Next-Video on TempCompass, using lmms-eval.\n\n**1. Clone the repo from [LLaVA-Next](https://github.com/LLaVA-VL/LLaVA-NeXT) and setup environments**\n```\ngit clone https://github.com/LLaVA-VL/LLaVA-NeXT\ncd LLaVA-NeXT\npip install -e .\n```\n**2. Run inference and evaluation in a single command**\n```\naccelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \\\n    --model llavavid \\\n    --model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \\\n    --tasks tempcompass \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix llava_vid_32B \\\n    --output_path ./logs/\n```\nYou can also evaluate the performance on each task (e.g., multi-choice) seperately:\n```\naccelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \\\n    --model llavavid \\\n    --model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \\\n    --tasks tempcompass_multi_choice \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix llava_vid_32B \\\n    --output_path ./logs/\n```\n**3. Submit results to [TempCompass LeaderBoard](https://huggingface.co/spaces/lyx97/TempCompass)**\n\nPlace the lmms-eval outputs (`tempcompass_multi_choice.json`, `tempcompass_yes_no.json`, `tempcompass_caption_matching.json` and `tempcompass_captioning.json`) into the same folder and run this [script](https://huggingface.co/spaces/lyx97/TempCompass/blob/main/merge_eval_result.py):\n```\npython merge_eval_result.py\n```\nThen submit the output file `merged_result.json` to the leaderboard.\n\n**Note:**\nCurrently, the evaluation results calculated by lmms-eval on specific temporal aspects might be incorrect (the average accuracy on each task is correct). To obtain the correct results, you can use this script: [acc_lmms_eval.py](https://github.com/llyx97/TempCompass/blob/main/utils/acc_lmms_eval.py) or submit the result to our leaderboard.\n\n## 📈 Data Statistics\n![](./assets/data_statistics.png)\n\n## 📊 \u003cspan id=\"eval_result\"\u003e Evaluation Results \u003c/span\u003e\nThe following figures present results of five representative Video LLMs. Results of more Video LLMs and Image LLMs can be found in our [paper](https://arxiv.org/abs/2403.00476) and the [leaderboard](https://huggingface.co/spaces/lyx97/TempCompass).\n\n\u003cimg src=\"./assets/multi-choice.jpg\" alt=\"Multi-Choice\" style=\"float: left; width: 49%; margin-right: 10px;\"\u003e\n\u003cimg src=\"./assets/yes_no.jpg\" alt=\"Yes/No\" style=\"float: left; width: 49%;\"\u003e\n\u003cimg src=\"./assets/caption_matching.jpg\" alt=\"Caption Matching\" style=\"float: left; width: 49%; margin-right: 10px;\"\u003e\n\u003cimg src=\"./assets/captioning.jpg\" alt=\"Caption Generation\" style=\"float: left; width: 49%;\"\u003e\n\n### \u003cspan id=\"answer_prompt\"\u003e Answer Prompt \u003c/span\u003e\nWe update the answer prompt for *Multi-Choice QA* and *Caption Matching*, from \"Best Option:\" to \"Please directly give the best option:\", which can better encourage MLLMs to directly select an option. As such, we can reduce the reliance on ChatGPT API, if an MLLM is good at following the instruction.\n\nThe success rate of rule-based matching is as follows:\n\n**Multi-Choice QA**\n|  | V-LLaVA | SPHINX-v2    | LLaMA-VID | Qwen-VL-Chat | PandaGPT  | Valley  |\n| --- | --- | --- | --- | --- | --- | --- |\n| old prompt | 37.9 | 99.6 | 62.9 | 46.8 | 6.4 | 3.5 |\n| new prompt | 100 | 100 | 97.0 | 98.5 | 3.9 | 0.4 |\n\n**Caption Matching**\n|  | V-LLaVA | SPHINX-v2    | LLaMA-VID | Qwen-VL-Chat | PandaGPT  | Valley  |\n| --- | --- | --- | --- | --- | --- | --- |\n| old prompt | 76.6 | 89.3 | 44.5 | 91.6 | 30.7 | 11.2 |\n| new prompt | 99.5 | 99.5 | 68.3 | 96.0 | 22.5 | 3.7 |\n\n## TODOs\n- [x] Upload scripts to collect and process videos.\n- [x] Upload the code for automatic evaluation.\n- [x] Upload the code for task instruction generation.\n\n## License\nThis dataset is intended for academic research only. It is under [CC BY-NC 4.0 License](https://creativecommons.org/licenses/by-nc/4.0/).\n\n## Citation\n```bibtex\n@article{liu2024tempcompass,\n  title   = {TempCompass: Do Video LLMs Really Understand Videos?},\n  author  = {Yuanxin Liu and Shicheng Li and Yi Liu and Yuxiang Wang and Shuhuai Ren and Lei Li and Sishuo Chen and Xu Sun and Lu Hou},\n  year    = {2024},\n  journal = {arXiv preprint arXiv: 2403.00476}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fllyx97%2FTempCompass","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fllyx97%2FTempCompass","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fllyx97%2FTempCompass/lists"}