{"id":17637875,"url":"https://github.com/rese1f/moviechat","last_synced_at":"2025-05-15T02:10:12.244Z","repository":{"id":176940945,"uuid":"658611851","full_name":"rese1f/MovieChat","owner":"rese1f","description":"[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding","archived":false,"fork":false,"pushed_at":"2025-01-29T11:16:02.000Z","size":82475,"stargazers_count":619,"open_issues_count":41,"forks_count":41,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-05-15T02:10:05.605Z","etag":null,"topics":["computer-vision","dataset","large-language-models","llama","long-video-understanding","multimodal-large-language-models"],"latest_commit_sha":null,"homepage":"https://rese1f.github.io/MovieChat/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rese1f.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-06-26T06:43:31.000Z","updated_at":"2025-05-14T07:14:41.000Z","dependencies_parsed_at":"2024-01-14T09:18:18.200Z","dependency_job_id":"5bfb3261-db8c-41fa-9b21-d39318df6f43","html_url":"https://github.com/rese1f/MovieChat","commit_stats":{"total_commits":104,"total_committers":8,"mean_commits":13.0,"dds":"0.32692307692307687","last_synced_commit":"22c88b833a1a8236da8e45052a866e0717f9a948"},"previous_names":["rese1f/moviechat"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rese1f%2FMovieChat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rese1f%2FMovieChat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rese1f%2FMovieChat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rese1f%2FMovieChat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rese1f","download_url":"https://codeload.github.com/rese1f/MovieChat/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254259387,"owners_count":22040821,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","dataset","large-language-models","llama","long-video-understanding","multimodal-large-language-models"],"created_at":"2024-10-23T03:06:31.087Z","updated_at":"2025-05-15T02:10:07.233Z","avatar_url":"https://github.com/rese1f.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"src/assets/logo.png\" height=\"120px\" align=\"left\"\u003e\n\n# MovieChat\n\n[![](http://img.shields.io/badge/cs.CV-arXiv%3A2307.16449-B31B1B.svg)](https://arxiv.org/abs/2307.16449v4)\n[![](http://img.shields.io/badge/cs.CV-arXiv%3A2404.17176-B31B1B.svg)](https://arxiv.org/abs/2404.17176)\n\n\u003e **MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**  \n\u003e Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang✉️   \n\u003e _CVPR 2024._\n\n\n\u003cimg width=\"1155\" alt=\"image\" src=\"https://github.com/user-attachments/assets/4c0412d3-0729-4f56-af0c-1ee3eeac8f99\"\u003e\n\nMovieChat can handle videos with \u003e10K frames on a 24GB graphics card. MovieChat has a 10000× advantage over other methods in terms of the average increase in GPU memory cost per frame (21.3KB/f to ~200MB/f).\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/assets/wave.gif\" alt=\"MovieChat\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003ch5 align=\"center\"\u003e If you like our project, please give us a star ⭐ on GitHub for the latest update.\u003c/h5\u003e\n\n## 🔢 MovieChat-1K leaderboard\n\nFeel free to PR your new results!\n\n| Model with Link | Comment | Breakpoint Acc | Global Acc |\n|-----------------------------------------------|------------------------------|------------|----------------|\n| [Video-LLaMA](https://arxiv.org/pdf/2306.02858)            | End-to-end                  | 39.1 | 51.7 |\n| [VideoChat](https://arxiv.org/abs/2305.06355)              | End-to-end                  | 46.1 | 57.8 |\n| [TimeChat](https://arxiv.org/pdf/2406.11333)               | CoT, ICL, train on MovieChat| 46.1 | 73.8 |\n| [VideoChatGPT](https://arxiv.org/pdf/2306.05424)           | End-to-end                  | 48.0 | 47.6 |\n| [MovieChat](https://arxiv.org/abs/2307.16449v4) (baseline) | End-to-end                  | 48.3 | 62.3 |\n| [MovieChat+](https://arxiv.org/abs/2404.17176) (baseline)  | End-to-end                  | 49.6 | 71.2 |\n| [Long-LLaVA](https://arxiv.org/abs/2411.13093)             | Eng-to-end                  | 54.0 | 69.6 |\n| [Long-LLaVA + Video-RAG](https://arxiv.org/abs/2411.13093) | Eng-to-end                  | 54.5 | 72.9 |\n| [Streaming Long Video](https://arxiv.org/abs/2405.16009)   | Train on MovieChat          | 54.9 | 90.4 |\n| [DrVideo](https://arxiv.org/pdf/2406.12846)                | RAG                         | 56.7 | 93.1 |\n| [ReWind](https://arxiv.org/pdf/2411.15556)                 | End-to-end                  | 57.2 | 87.6 |\n| [HERMES](https://arxiv.org/pdf/2408.17443)                 | Train on MovieChat          | 57.3 | 78.6 |\n| [Flash-VStream](https://arxiv.org/abs/2406.08085)          | Train on MovieChat          | 59.6 | 96.0 |\n| [MM-Screenplayer](https://arxiv.org/pdf/2406.17309)        | RAG                         | 68.8 | 87.5 |\n| [VILA1.5-8B](https://openreview.net/pdf?id=oS79Tw3G0c)     | End-to-end                  |  -   | 40.0 |\n| [FocusChat](https://arxiv.org/pdf/2412.12833)              | End-to-end                  |  -   | 60.0|\n| [llavaonevision-MovieChat](https://github.com/rese1f/MovieChat) | End-to-end             | -    | 79.0 |\n| [Sullam Jeoung, _et al_](https://arxiv.org/pdf/2410.20252) | Agent                       | -    | 84.8 |\n| [SEAL](https://arxiv.org/pdf/2412.01798)                   | Train on MovieChat          | -    | 86.8 |\n| [HEM-LLM](https://arxiv.org/pdf/2409.06299)                | Unknown training dataset    | -    | 90.6 |\n\n\n## 🔢 Evaluation of MovieChat on Existing Benchmarks\n\nSort in alphabetical order.\n\n| Benchmark | Results |\n|-----------|---------|\n| ActivityNet-QA | Acc. / Score: 45.7 / 3.4 |\n| Charades-STA | R@1(IOU =0.3): 8.8 • R@1(IOU =0.5): 2.9 •  R@1(IOU =0.7): 1.3 |\n| CineClipQA | Overall: 20.86/2.11 • Description: 23.67/2.41 • Intention: 30.19/2.41 • Perception: 21.80/1.97 • Temporality: 16.32/1.97 • Spaciality: 16.40/1.98 |\n| CVRR-ES | Average: 16.41 |\n| EgoSchema | Top 1 Acc: 53.5 |\n| EventBench | Acc: 20.33 |\n| InfiniBench | Global Appearance: 6.59 • Scene transition: 6.41 • Character actions: 4.51 • Temporal order: 36.99 • Local visual: 17.76 • Summarization: 0.14 • Deep context: 0.55 • Spoiler questions: 0.34 • Multiple events: 0.85 • Avg: 14.45/0.47 |\n| InfiniBench-Vision | Acc: 14.2 • Score: 1.2 |\n| LvBench | ER: 21.3 • EU: 23.1 • KIR: 25.9 • TG: 22.3 • Rea: 24.0 • Sum: 17.2 • Overall: 22.5 |\n| LvM-QA | Acc. / Score: 48.3 / 2.57 |\n| MLVU | Holistic TR: 29.5 • AR: 25.0 • VS: 2.33 • Single Detail NQA: 24.2 • ER: 24.7 • PQA: 25.8 • SSC: 3.23 • Multi Detail AO: 28.6 • AC: 22.8 • M-Avg: 25.8 • G-Avg: 2.78 |\n| MovieChat-1K | Global Acc. / Score: 62.3 / 3.23 • Global Acc. / Score: 48.3 / 2.57 |\n| MovieCORE | Acc: 20.33 • Comp: 2.90 • Depth: 2.29 • Evid: 2.14 • Coh: 2.30 • Avg: 2.23 |\n| MSVD-QA | Acc. / Score: 75.2 / 3.8 |\n| MSRVTT-QA | Acc. / Score: 52.7 / 2.6 |\n| NExT-QA | Acc. / Score: 49.9 / 2.7 |\n| QVHighlight | mAP: 11.7 • HIT @1: 16.1 |\n| RVS-Ego | Acc. / Score: 50.7 / 3.4 |\n| RVS-Movie | Acc. / Score: 36.0 / 2.3 |\n| Seed-Bench | Procedure Understanding: 29.82 • Action Recognition: 40.11 |\n| SFD | Multiple-Choice V: 8.4 • L: 16.4 • VL: 8.0 • Open-Ended V: 14.0 • L: 15.7 • VL: 11.8 |\n| SVBench | Dialogue SA: 20.46 • Dialogue CC: 20.05 • Dialogue LC: 27.76 • Dialogue TU: 21.81 • Dialogue IC: 22.21 • Dialogue OS: 21.89 • Streaming SA: 17.99 • Streaming CC: 16.42 • Streaming LC: 20.37 • Streaming TU: 15.77 • Streaming IC: 19.08 • Streaming OS: 17.43 |\n| TV-Caption | BertScore: 38.11 • CIDER: 8.43 • ROUGE-L: 12.09 • SPICE: 9.21 |\n| VCG Bench | CI: 2.76 • DO: 2.93 • CU: 3.01 • TU: 2.24 • CO: 2.42 • Avg: 2.67 |\n| VDC | Camera: 37.25/1.98 • Short: 32.55/1.59 • Background: 28.99/1.54 • Main: 31.97/1.64 • Object: 28.82/1.46 • Avg: 31.92/1.64 |\n| VideoMME | w/o subs: 38.2 • w/o subs (Long): 33.4 |\n| Video-ChatGPT | Avg: 2.67 • CI: 2.76 • DO: 2.93 • CU: 3.01 • TU: 2.24 • CO: 2.42 |\n| VS-Ego | Acc. / Score: 52.2 / 3.4 |\n| VS-Movie | Acc. / Score: 39.1 / 2.3 |\n| YouCook2 | C: 38.5 • M: 18.8 |\n\n\n## :fire: News\n* **[2024.10.26]** :keyboard: We upload MovieChat, MovieChat_OneVision, MovieChat-1K to [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).\n* **[2024.10.26]** :keyboard: We release a new version of MovieChat, which use LLaVA-OneVision as the base model instead of the original VideoLLaMA. The new version is available on [MovieChat_Onevision](https://github.com/rese1f/MovieChat/tree/main/MovieChat_Onevision).\n* **[2024.6.13]** :film_projector: We release the ground truth of MovieChat's test set in [Hugging Face](https://huggingface.co/datasets/Enxin/MovieChat-1K-test). \n* **[2024.5.10]** :film_projector: We release the raw videos of MovieChat's training set in [Hugging Face](https://huggingface.co/datasets/Enxin/MovieChat-1K_train). \n* **[2024.4.29]** :page_with_curl: We update the MovieChat+ [paper](https://arxiv.org/abs/2404.17176) with implementation details, technical evaluations, and dataset information.\n* **[2024.4.25]** :keyboard:We update a new version of MovieChat+. We realse the [MovieChat+ code](https://github.com/rese1f/MovieChat/blob/main/MovieChat/models/moviechat%2B.py) and the corresponding [evaluation code](https://github.com/rese1f/MovieChat/blob/main/eval_code/result_prepare/run_inference_qa_moviechat%2B.py). Our paper is Coming soon!\n* **[2024.4.19]** :keyboard:We update the latest source code of MovieChat to [PyPI](https://pypi.org/). Now you can use MovieChat by `pip install Moviechat` directly!\n* **[2024.3.25]** :bar_chart: We host challenge track 1 of [the 4th International Workshop on Long-form Video Understanding: Towards Multimodal AI Assistant and Copilot](https://cvpr.thecvf.com/Conferences/2024/workshop-list) at CVPR 2024. You can participate in the challenge and submit your results via [Codalab](https://codalab.lisn.upsaclay.fr/competitions/18284?secret_key=bd5e312c-4775-43cf-933b-70726d00bcbe). We will display the results on the [leaderboard](https://espere-1119-song.github.io/LOVEU-CVPR-24-Track-1-Leaderboard/). For each participant, we hope you can submit your results in JSON format and report both the average running time and VRAM usage. We will use these metrics to select the most efficient method. For detailed information about the challenge, please refer to this [link](https://sites.google.com/view/loveucvpr24/track1).\n* **[2024.3.11]** :film_projector: We release the test set of the MovieChat-1K in [Hugging Face](https://huggingface.co/datasets/Enxin/MovieChat-1K-test). Each video contains 3 global questions and 10 breakpoint questions.\n* **[2024.2.27]** :tada: Our paper was accepted by CVPR 2024!\n* **[2024.2.14]** :film_projector: We release the training set of the MovieChat-1K in [Hugging Face](https://huggingface.co/datasets/Enxin/MovieChat-1K_train). Due to copyright restrictions, we share the clip features extracted by [eva_vit_g](https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth), containing 8192 frames of each video.\n* **[2023.11.27]** :page_with_curl: We update the [paper](https://arxiv.org/pdf/2307.16449v2.pdf) with implementation details, technical evaluations, and dataset information.\n* **[2023.11.23]** :keyboard:We update the latest source code of MovieChat.\n* **[2023.8.1]** :page_with_curl: We release the [paper](https://arxiv.org/abs/2307.16449).\n* **[2023.7.31]** :keyboard:We release eval [code and instraction](https://github.com/rese1f/MovieChat/tree/main/eval_code) for short video QA on **MSVD-QA**, **MSRVTT-QA** and **ActivityNet-QA**.\n* **[2023.7.29]** :joystick:We release [Gradio demo](https://github.com/rese1f/MovieChat/tree/main/Gradio_demo) of MovieChat.\n* **[2023.7.22]** :keyboard:We release source code of MovieChat.\n  \n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=moviechat-from-dense-token-to-sparse-memory)\\\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=moviechat-from-dense-token-to-sparse-memory)\\\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=moviechat-from-dense-token-to-sparse-memory)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zero-shot-long-video-global-mode-question)](https://paperswithcode.com/sota/zero-shot-long-video-global-mode-question?p=moviechat-from-dense-token-to-sparse-memory)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zero-shot-long-video-breakpoint-mode-question)](https://paperswithcode.com/sota/zero-shot-long-video-breakpoint-mode-question?p=moviechat-from-dense-token-to-sparse-memory)\n\n## 📊Performance Comparison on MovieChat-1K\n| **Method**         | **Text Decoder**   | **# Frames** | **Global Mode Acc.** | **Global Mode Sco.** |\n|--------------------|--------------------|--------------|----------------------|----------------------|\n| GIT                | non-LLM based      | 6            | 28.8                 | 1.83                 |\n| mPLUG-2            | non-LLM based      | 8            | 31.7                 | 2.13                 |\n| **Video Chat**     | LLM based          | 32           | 57.8                 | 3.00                 |\n| **Video LLaMA**    | LLM based          | 32           | 51.7                 | 2.67                 |\n| **Video-ChatGPT**  | LLM based          | 100          | 47.6                 | 2.55                 |\n| **MovieChat**      | LLM based          | 2048         | 62.3                 | 3.23                 |\n| **MovieChat+**     | LLM based          | 2048         | 71.2                 | 3.51               |\n| **MovieChat-Onevision**  | LLM based    | 2048         | **79.0**             | **4.20**             |\n\n## ✨How to run MovieChat quickly?\n\nWe have packaged MovieChat and uploaded it to PyPI. To run MovieChat quickly, you need to install it firstly. \n```\npip install MovieChat\n```\nWe advise you to install version `0.6.3` for now. Since `MovieChat` will download checkpoints from Huggingface automatically, if your service doesn't support `git clone from \u003cHuggingFace  url\u003e`, we recommend you to download the checkpoint to your service, and change the respective path in the package, including [q_former_model](https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth), [ckpt_path](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna7b-v2.pth?download=true), and [llama_model](https://huggingface.co/Enxin/MovieChat-vicuna). \n\nBefore you run the following inference code, we hope you can verify the installation of `ffprobe` via `ffprobe -version`. This command should return the version of ffprobe if it is correctly installed. Otherwise, you should install it via `sudo apt-get install ffmpeg` (Ubuntu).\n\n```\nfrom PIL import Image\nimport cv2\n\nfrom MovieChat.processors.video_processor import AlproVideoEvalProcessor\nfrom MovieChat.models.chat_model import Chat\nfrom MovieChat.models.moviechat import MovieChat\n\ndevice = 'cuda:0'\nprint('Initializing Chat')\nmoviechat_model = MovieChat.from_config(device=device).to(device)\nvis_processor_cfg = {'name': 'alpro_video_eval', 'n_frms': 8, 'image_size': 224}\nframe_processor = AlproVideoEvalProcessor.from_config(vis_processor_cfg)\nchat = Chat(moviechat_model, frame_processor, device=device)\nprint('Initialization Finished')\n\nvideo_path = \"Your video path, end with mp4\"\nfragment_video_path = \"The path to store tmp video clips\"\nmiddle_video = False # True-\u003eBreakpoint mode, False-\u003eGlobal mode\nquestion = \"Your Question\"\ncur_min = 0 # Change it when Breakpoint mode\ncur_sec = 0 # Change it when Breakpoint mode\n\ncap = cv2.VideoCapture(video_path)\ncur_fps = cap.get(cv2.CAP_PROP_FPS)\ncap.set(cv2.CAP_PROP_POS_FRAMES, cur_fps)\nret, frame = cap.read()\nrgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)\npil_image = Image.fromarray(rgb_frame)\nimage = chat.image_vis_processor(pil_image).unsqueeze(0).unsqueeze(2).half().to(device)\ncur_image = chat.model.encode_image(image)\n\nimg_list = []\nmsg = chat.upload_video_without_audio(\n    video_path=video_path, \n    fragment_video_path=fragment_video_path,\n    cur_min=cur_min, \n    cur_sec=cur_sec, \n    cur_image=cur_image, \n    img_list=img_list, \n    middle_video=middle_video,\n    question=question\n)\nanswer = chat.answer(\n    img_list=img_list,\n    input_text=question,\n    msg = msg,\n    num_beams=1,\n    temperature=1.0,\n    max_new_tokens=300,\n    max_length=2000)[0]\n\nprint(answer)\n```\n\nNote that if you receive a RuntimeError like `\"Error reading \u003cfilename.mp4\u003e\"`, one solution is to initialize `\u003cfilename.mp4\u003e` with any other video file.\n\n## 💡 Overview\n\n![](src/assets/overview.png)\n\n## 📣 Demo Video\n\n[![Alt text](https://img.youtube.com/vi/Dx5BQmgK4n8/0.jpg)](https://www.youtube.com/embed/Dx5BQmgK4n8?si=FN9pLyQBN--vJBZA)\n\n## ⚡ Comparison Case\n\n\u003cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"\u003e Question and answer about a clip from YouTube, which is a tutorial on how to cook steak. The entire instructional process begins with marinating the steak, followed by pan-searing it, preparing side dishes, and ultimately plating the meal. Green ( Red ) highlights the correct (wrong) answer and yellow indicates that the model is hallucinating.\n\u003c/div\u003e\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/compare_case.png\"  style=\"width: 80%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n## 😍 Examples\n\n\u003cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"\u003e Question and answer about clips from Zootopia, a cartoon, which tells the story of a determined police officer rabbit named Judy\nwho pairs up with a cunning fox to uncover a conspiracy about missing animals and develop an unexpected friendship.\n\u003c/div\u003e\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/example1_00.png\"  style=\"width: 80%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\n\u003cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"\u003e Question and answer about clips from Goblin, which tells the story of Kim Shin, an immortal ”goblin” who needs to find a human\nbride to end his endless life but instead meets Ji Eun-tak, a girl fated to die who claims to be the ”goblin’s bride,” leading to a romantic tale unfolding bet.\n\u003c/div\u003e\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/example2_00.png\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"\u003e  Question and answer about clips from Game of Thrones, which tells the epic fantasy tale of power struggles and political intrigue among the Seven Kingdoms, entwined with intricate family relationships, all set against the backdrop of an ancient, mystical threat.\n\u003c/div\u003e\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/example3_00.png\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"\u003e Question and answer about clips from YouTube, which contains a compilation of some inspirational movies scenes. This video clip comprises several segments from The Death Crawl, Coach Carter, Rocky Balboa, and We Are Marshall,  which vary in duration.\n\u003c/div\u003e\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/example4_00.png\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n## 🚀 Benchmark: MovieChat-1K \n\nTo better evaluate the performance of MovieChat, we collect a new benchmark for long video understanding tasks, MovieChat-1K, which contains 1K high quality video clips sourced from various movies and TV series with 14K manual annotations.\n\nTo the best of our knowledge, a long video understanding dataset has not yet been established. Our work represents the initial step in creating and making it publicly available.We create MovieChat1K, containing 1k long\nvideos and corresponding 1k dense captions, and 13k visual question-answer pairs.For each video, we manually set and provide 1 dense caption for the whole video, 3 question-answering pairs for global mode and 10 question-answering pairs with timestamps for breakpoint mode. \n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/benchmark/dataset1.png\" style=\"width: 100%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\nWe collect videos from 15 popular categories with varying distribution, including documentary film, detective film, animation film, and so on. Among these, each video comprises multiple alternating scenes, contributing to a diverse and dynamic visual narrative within the context of the collection. Over 90% of the videos exhibit a duration ranging from 10K to 12K frames, while 14.6% of videos extending beyond 12K frames. Only 8.6% of videos have duration less than 10k frames.\n\n\n### Question-answering Pairs\n\n#### Word Distribution\nNote that MovieChat-1K is specifically designed for long video comprehension tasks, the majority of questions are open-ended, with only a quarter classified as multiple-choice questions, marked by initiators such as ‘Do,’ ‘Does,’ ‘Is,’ or ‘Are.’ We also compute the word distributions of our provided\nquestion-answer pairs, which includes common objects (people, clothes, etc.), time (day, night, etc.), scenes (indoor, outdoor, etc.), and so on.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/benchmark/wordcloud.png\" style=\"width: 40%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\n#### Sentence length distribution\nMovieChat1K exhibits diverse lengths of question-answer pairs in the segmented clip level. Despite the distribution of questionanswer pairs varies between the global mode and breakpoint mode, the majority of questions tends to concentrate between 5-15 words in length, while the length of answers generally have fewer than 10 words.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/benchmark/length.png\" style=\"width: 70%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\n### Dense Captions\n\nTo facilitate a more detailed understanding of long videos, we provide\na dense caption for each video. MovieChat-1K exhibits diverse caption lengths in the segmented clip level. Approximately two-thirds of the clips\nhave captions with 100-149 words, while one-fifth of the\nclip captions have fewer than 100 words. About 11% of\nclips have long captions with more than 150 words.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/benchmark/caption_dis.png\" style=\"width: 40%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\nTo analyze the word distribution of our generated captions, we compute their distributions. The resulting word\ndistribution of the captions is presented in Fig. B6, which\nincludes common objects (man, woman, people, girl, etc.),\nattributes (detective, various, small, white, etc.), locations\n(inside, behind, south, next, etc.), scenes (room, house,\nbuilding, office, etc.), actions/events (talk, enter, leave, take,\netc.), and more.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca target=\"_blank\"\u003e\u003cimg src=\"src/benchmark/caption_wordcloud.png\" style=\"width: 45%; min-width: 200px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\nIn terms of actionness, MovieChat-1K captions contains nearly the same number of verbs as with the WebVid10M dataset. To evaluate this, we use the NLTK toolkit to\nanalyze the number of verbs in captions, focusing on extracting and tagging all unique verbs. We find a total of\n109,485 verbs in the WebVid10M caption dataset, while the\nMovieChat-1K captions contain 102,988 unique instances\nof verbs. While these counts may not be entirely accurate\ndue to our simple counting method, we believe they provide\na rough indication of the actionness of the two datasets.\n\n\u003c!-- ## Comparison between MovieChat-1K and other benchmarks\n\nMovieChat-1K provides a large-scale benchmark\nfor long video understanding, which contains 1K movies,\n1K dense captions and 13k question-answer pairs. The\ncomparison between different datasets are shown in Tab. 8.\nIt is evident that MovieChat-1K provides the longest\naverage duration for movie clips. MovieQA exclusively offers question-answer pairs related to movies,\nwhile MovieGraphs supplies captions associated with\nmovies. Unlike other datasets, MovieNet encompasses\nthree main types of texts: subtitle, synopsis, and script,\nexcluding question-answer pairs. Additionally, the synopsis category is designed for the entire movie rather than\nvideo clips. Consequently, MovieChat-1K is more suitable\nfor studying long video comprehension compared to other\ndatasets.\n\n\u003cdiv align=\"center\"\u003e\n\u003ctable border=\"1\" width=\"100%\"\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eDataset\u003c/th\u003e\u003cth\u003eAvg. Duration (min)\u003c/th\u003e\u003cth\u003eNumber of Captions\u003c/th\u003e\u003cth\u003eAvg. Caption Length\u003c/th\u003e\u003cth\u003eNumber of Question-Answer Pairs\u003c/th\u003e\u003cth\u003eAvg. Question Length\u003c/th\u003e\u003cth\u003eAvg. Answer Length\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003e\u003ca href=\"https://arxiv.org/abs/1512.02902\"\u003eMovieQA\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e3.5\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e14.9K\u003c/td\u003e\u003ctd\u003e9.3\u003c/td\u003e\u003ctd\u003e5.1\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003e\u003ca href=\"https://arxiv.org/abs/1712.06761\"\u003eMovieGraphs\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e0.73\u003c/td\u003e\u003ctd\u003e15K\u003c/td\u003e\u003ctd\u003e35\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003e\u003ca href=\"https://arxiv.org/abs/2007.10937\"\u003eMovieNet\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e2.1\u003c/td\u003e\u003ctd\u003e2.5K\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\u003ctd\u003e-\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eMovieChat-1K\u003c/td\u003e\u003ctd\u003e9.4\u003c/td\u003e\u003ctd\u003e1K\u003c/td\u003e\u003ctd\u003e121\u003c/td\u003e\u003ctd\u003e13K\u003c/td\u003e\u003ctd\u003e7.8\u003c/td\u003e\u003ctd\u003e2.3\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\u003c/div\u003e --\u003e\n\n🔐 \u0026#x00A9; **Due to the copyright concers and the size limitations of the movies, we  plan to release the features of the dataset. Please wait for a few weeks.**\n\n## 🛠️ Install \n\n### Environment Preparation\n\nFirst, create a conda environment:\n```\nconda env create -f environment.yml\nconda activate moviechat\n```\n\n### Prerequisites\n\nBefore using the repository, make sure you have obtained the following checkpoints:\n\n#### Pre-trained Language Decoder\n\n- Get the original LLaMA weights in the Hugging Face format by following the instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).\n- Download Vicuna delta weights :point_right: [[7B](https://huggingface.co/lmsys/vicuna-7b-delta-v0)] (Note: we use **v0 weights** instead of v1.1 weights). \n- Use the following command to add delta weights to the original LLaMA weights to obtain the Vicuna weights:\n\n```\npython apply_delta.py \\\n    --base ckpt/LLaMA/7B_hf \\\n    --target ckpt/Vicuna/7B \\\n    --delta ckpt/Vicuna/vicuna-7b-delta-v0 \\\n```\n\n#### Pre-trained Visual Encoder for MovieChat\n- Download the MiniGPT-4 model (trained linear layer) from this [link](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view).\n\n#### Download Pretrained Weights\n\n- Download pretrained weights to run MovieChat with Vicuna-7B as language decoder locally from this [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna7b-v2.pth).\n\n## 🤖 How to Run Demo Locally\n\nFirstly, set the `llama_model`, `llama_proj_model` and `ckpt` in [eval_configs/MovieChat.yaml](./eval_configs/MovieChat.yaml).\nThen run the script:\n```\npython inference.py \\\n    --cfg-path eval_configs/MovieChat.yaml \\\n    --gpu-id 0 \\\n    --num-beams 1 \\\n    --temperature 1.0 \\\n    --text-query \"What is he doing?\" \\\n    --video-path src/examples/Cooking_cake.mp4 \\\n    --fragment-video-path src/video_fragment/output.mp4 \\\n    --cur-min 1 \\\n    --cur-sec 1 \\\n    --middle-video 1 \\\n```\nNote that, if you want to use the global mode (understanding and question-answering for the **whole** video), remember to change middle-video into 0.\n\n\u003c!-- ## 👍 Main Results\n### Short video question-answering\nWe use several widely\nused open-ended datasets: MSVD-QA, MSRVTT-QA, and ActivityNet-QA for short video question-answering tasks. The evaluation process is under the assistance of LLM with the default hyper-parameter settings. The accuracy and relative scores on a scale of 0 to 5 are reported. Compared to previous methods, MovieChat achieves comparable performance even it is not\nspecifically designed for short video question-answering tasks,\n\n\u003cdiv align=\"center\"\u003e\n\u003ctable border=\"1\" width=\"100%\"\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eMethods\u003c/th\u003e\u003cth\u003eLLM\u003c/th\u003e\u003cth\u003eConversation\u003c/th\u003e\u003cth\u003eDetail Description\u003c/th\u003e\u003cth\u003eComplex Reasoning\u003c/th\u003e\u003cth\u003eAll\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003e\u003ca href=\"https://huggingface.co/Chat-UniVi/Chat-UniVi\"\u003eChat-UniVi-7B\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"https://huggingface.co/lmsys/vicuna-7b-v1.5\"\u003eVicuna-7B\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e84.1\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e74.2\u003c/td\u003e\u003ctd\u003e93.7\u003c/td\u003e\u003ctd\u003e84.2\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003e\u003ca href=\"https://huggingface.co/Chat-UniVi/Chat-UniVi-13B\"\u003eChat-UniVi-13B\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"https://huggingface.co/lmsys/vicuna-13b-v1.5\"\u003eVicuna-13B\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e84.1\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e79.4\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e94.7\u003c/b\u003e\u003c/td\u003e\u003ctd\u003e\u003cb\u003e86.1\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\u003c/div\u003e --\u003e\n\n## 🤝 Acknowledgement\nWe are grateful for the following awesome projects our MovieChat arising from:\n* [Video-LLaMA](https://github.com/DAMO-NLP-SG/Video-LLaMA): An Instruction-tuned Audio-Visual Language Model for Video Understanding\n* [Token Merging](https://github.com/facebookresearch/ToMe): Your ViT but Faster\n* [XMem](https://github.com/hkchengrex/XMem): Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model\n* [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4): Enhancing Vision-language Understanding with Advanced Large Language Models\n* [FastChat](https://github.com/lm-sys/FastChat): An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots\n* [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models \n* [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale\n* [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models\n* [VideoChat](https://github.com/OpenGVLab/Ask-Anything): Chat-Centric Video Understanding\n* [LLaVA](https://github.com/haotian-liu/LLaVA): Large Language and Vision Assistant\n\n\n## 🔒 Term of Use\nOur MovieChat is just a research preview intended for non-commercial use only. You must **NOT** use our MovieChat for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines. \n\n## ✏️ Citation\n\nIf you find MovieChat useful for your your research and applications, please cite using this BibTeX:\n\n```bibtex\n@article{song2023moviechat,\n  title={MovieChat: From Dense Token to Sparse Memory for Long Video Understanding},\n  author={Song, Enxin and Chai, Wenhao and Wang, Guanhong and Zhang, Yucheng and Zhou, Haoyang and Wu, Feiyang and Guo, Xun and Ye, Tian and Lu, Yan and Hwang, Jenq-Neng and others},\n  journal={arXiv preprint arXiv:2307.16449},\n  year={2023}\n}\n\n@article{song2024moviechat+,\n  title={MovieChat+: Question-aware Sparse Memory for Long Video Question Answering},\n  author={Song, Enxin and Chai, Wenhao and Ye, Tian and Hwang, Jenq-Neng and Li, Xi and Wang, Gaoang},\n  journal={arXiv preprint arXiv:2404.17176},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frese1f%2Fmoviechat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frese1f%2Fmoviechat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frese1f%2Fmoviechat/lists"}