{"id":26856578,"url":"https://github.com/HKUDS/VideoRAG","last_synced_at":"2025-03-31T00:03:07.512Z","repository":{"id":275934301,"uuid":"926334425","full_name":"HKUDS/VideoRAG","owner":"HKUDS","description":"\"VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos\"","archived":false,"fork":false,"pushed_at":"2025-03-26T16:21:08.000Z","size":5721,"stargazers_count":500,"open_issues_count":7,"forks_count":56,"subscribers_count":19,"default_branch":"main","last_synced_at":"2025-03-26T17:30:45.116Z","etag":null,"topics":["large-language-models","llms","long-video-understanding","multi-modal-llms","rag","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2502.01549","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HKUDS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-03T03:59:11.000Z","updated_at":"2025-03-26T16:21:12.000Z","dependencies_parsed_at":"2025-02-25T14:29:52.454Z","dependency_job_id":"9448cf24-e116-4467-a536-47b4c4e3fd28","html_url":"https://github.com/HKUDS/VideoRAG","commit_stats":null,"previous_names":["hkuds/videorag"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HKUDS%2FVideoRAG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HKUDS%2FVideoRAG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HKUDS%2FVideoRAG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HKUDS%2FVideoRAG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HKUDS","download_url":"https://codeload.github.com/HKUDS/VideoRAG/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246395595,"owners_count":20770243,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-models","llms","long-video-understanding","multi-modal-llms","rag","retrieval-augmented-generation"],"created_at":"2025-03-31T00:02:36.796Z","updated_at":"2025-03-31T00:03:07.505Z","avatar_url":"https://github.com/HKUDS.png","language":"Python","funding_links":[],"categories":["视频生成_补帧_摘要","Repos"],"sub_categories":["资源传输下载"],"readme":"# VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos\n\n\u003ca href='https://arxiv.org/abs/2502.01549'\u003e\u003cimg src='https://img.shields.io/badge/arXiv-2502.01549-b31b1b'\u003e\u003c/a\u003e\n\u003ca href='https://github.com/HKUDS/VideoRAG/issues/1'\u003e\u003cimg src='https://img.shields.io/badge/群聊-wechat-green'\u003e\u003c/a\u003e\n\u003ca href='https://discord.gg/ZzU55kz3'\u003e\u003cimg src='https://discordapp.com/api/guilds/1296348098003734629/widget.png?style=shield'\u003e\u003c/a\u003e\n\n\n\u003cimg src='VideoRAG_cover.png' /\u003e\n\n This is the PyTorch implementation for VideoRAG proposed in this paper:\n\n \u003e**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**  \n \u003eXubin Ren*, Lingrui Xu*, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang†\n\n\\* denotes equal contribution.\n† denotes corresponding author\n\n In this paper, we proposed a retrieval-augmented generation framework specifically designed for processing and understanding **extremely long-context videos**.\n\n## VideoRAG Framework\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"VideoRAG.png\" alt=\"VideoRAG\" /\u003e\n\u003c/p\u003e\n\nVideoRAG introduces a novel dual-channel architecture that synergistically combines graph-driven textual knowledge grounding for modeling cross-video semantic relationships with hierarchical multimodal context encoding to preserve spatiotemporal visual patterns, enabling unbounded-length video understanding through dynamically constructed knowledge graphs that maintain semantic coherence across multi-video contexts while optimizing retrieval efficiency via adaptive multimodal fusion mechanisms.\n\n💻 **Efficient Extreme Long-Context Video Processing**\n- Leveraging a Single NVIDIA RTX 3090 GPU (24G) to comprehend Hundreds of Hours of video content 💪\n\n🗃️ **Structured Video Knowledge Indexing**\n- Multi-Modal Knowledge Indexing Framework distills hundreds of hours of video into a concise, structured knowledge graph 🗂️\n\n🔍 **Multi-Modal Retrieval for Comprehensive Responses**\n- Multi-Modal Retrieval Paradigm aligns textual semantics and visual content to identify the most relevant video for comprehensive responses 💬\n\n📚 **The New Established LongerVideos Benchmark**\n- The new established LongerVideos Benchmark features over 160 Videos totaling 134+ Hours across lectures, documentaries, and entertainment 🎬\n\n## Installation\n\nTo utilize VideoRAG, please first create a conda environment with the following commands:\n```bash\nconda create --name videorag python=3.11\nconda activate videorag\n\npip install numpy==1.26.4\npip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2\npip install accelerate==0.30.1\npip install bitsandbytes==0.43.1\npip install moviepy==1.0.3\npip install git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d\npip install timm ftfy regex einops fvcore eva-decord==0.6.1 iopath matplotlib types-regex cartopy\npip install ctranslate2==4.4.0 faster_whisper==1.0.3 neo4j hnswlib xxhash nano-vectordb\npip install transformers==4.37.1\npip install tiktoken openai tenacity\n\n# Install ImageBind using the provided code in this repository, where we have removed the requirements.txt to avoid environment conflicts.\ncd ImageBind\npip install .\n```\n\nThen, please download the necessary checkpoints in **the repository's root folder** for MiniCPM-V, Whisper, and ImageBind as follows:\n```bash\n# Make sure you have git-lfs installed (https://git-lfs.com)\ngit lfs install\n\n# minicpm-v\ngit lfs clone https://huggingface.co/openbmb/MiniCPM-V-2_6-int4\n\n# whisper\ngit lfs clone https://huggingface.co/Systran/faster-distil-whisper-large-v3\n\n# imagebind\nmkdir .checkpoints\ncd .checkpoints\nwget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth\ncd ../\n```\n\nYour final directory structure after downloading all checkpoints should look like this:\n```shell\nVideoRAG\n├── .checkpoints\n├── faster-distil-whisper-large-v3\n├── ImageBind\n├── LICENSE\n├── longervideos\n├── MiniCPM-V-2_6-int4\n├── README.md\n├── reproduce\n├── notesbooks\n├── videorag\n├── VideoRAG_cover.png\n└── VideoRAG.png\n```\n\n## Quick Start\n\nVideoRAG is capable of extracting knowledge from multiple videos and answering queries based on those videos. Now, try VideoRAG with your own videos 🤗.\n\n\u003e [!NOTE]\n\u003e Currently, VideoRAG has only been tested in an English environment. To process videos in multiple languages, it is recommended to modify the  ```WhisperModel``` in [asr.py](https://github.com/HKUDS/VideoRAG/blob/main/videorag/_videoutil/asr.py). For more details, please refer to [faster-whisper](https://github.com/systran/faster-whisper).\n\n**At first**, let the VideoRAG extract and indexing the knowledge from given videos (Only one GPU with 24GB of memory is sufficient, such as the RTX 3090):\n```python\nimport os\nimport logging\nimport warnings\nimport multiprocessing\n\nwarnings.filterwarnings(\"ignore\")\nlogging.getLogger(\"httpx\").setLevel(logging.WARNING)\n\n# Please enter your openai key\nos.environ[\"OPENAI_API_KEY\"] = \"\"\n\nfrom videorag._llm import openai_4o_mini_config\nfrom videorag import VideoRAG, QueryParam\n\n\nif __name__ == '__main__':\n    multiprocessing.set_start_method('spawn')\n\n    # Please enter your video file path in this list; there is no limit on the length.\n    # Here is an example; you can use your own videos instead.\n    video_paths = [\n        'movies/Iron-Man.mp4',\n        'movies/Spider-Man.mkv',\n    ]\n    videorag = VideoRAG(llm=openai_4o_mini_config, working_dir=f\"./videorag-workdir\")\n    videorag.insert_video(video_path_list=video_paths)\n```\n\n**Then**, ask any questions about the videos! Here is an exmaple:\n```python\nimport os\nimport logging\nimport warnings\nimport multiprocessing\n\nwarnings.filterwarnings(\"ignore\")\nlogging.getLogger(\"httpx\").setLevel(logging.WARNING)\n\n# Please enter your openai key\nos.environ[\"OPENAI_API_KEY\"] = \"\"\n\nfrom videorag._llm import *\nfrom videorag import VideoRAG, QueryParam\n\n\nif __name__ == '__main__':\n    multiprocessing.set_start_method('spawn')\n\n    query = 'What is the relationship between Iron Man and Spider-Man? How do they meet, and how does Iron Man help Spider-Man?'\n    param = QueryParam(mode=\"videorag\")\n    # if param.wo_reference = False, VideoRAG will add reference to video clips in the response\n    param.wo_reference = True\n\n    videorag = videorag = VideoRAG(llm=openai_4o_mini_config, working_dir=f\"./videorag-workdir\")\n    videorag.load_caption_model(debug=False)\n    response = videorag.query(query=query, param=param)\n    print(response)\n```\n\n## Experiments\n\n### LongerVideos\nWe constructed the LongerVideos benchmark to evaluate the model's performance in comprehending multiple long-context videos and answering open-ended queries. All the videos are open-access videos on YouTube, and we record the URLs of the collections of videos as well as the corresponding queries in the [JSON](https://github.com/HKUDS/VideoRAG/longervideos/dataset.json) file.\n\n| Video Type       | #video list | #video | #query | #avg. queries per list | #overall duration      |\n|------------------|------------:|-------:|-------:|-----------------------:|-------------------------|\n| **Lecture**      | 12          | 135    | 376    | 31.3                   | ~ 64.3 hours           |\n| **Documentary**  | 5           | 12     | 114    | 22.8                   | ~ 28.5 hours           |\n| **Entertainment**| 5           | 17     | 112    | 22.4                   | ~ 41.9 hours           |\n| **All**          | 22          | 164    | 602    | 27.4                   | ~ 134.6 hours          |\n\n### Process LongerVideos with VideoRAG\n\nHere are the commands you can refer to for preparing the videos used in LongerVideos.\n\n```shell\ncd longervideos\npython prepare_data.py # create collection folders\nsh download.sh # obtain videos\n```\n\nThen, you can run the following example command to process and answer queries for LongerVideos with VideoRAG:\n\n```shell\n# Please enter your openai_key in line 19 at first\npython videorag_longervideos.py --collection 4-rag-lecture --cuda 0\n```\n\n### Evaluation\n\nWe conduct win-rate comparisons as well as quantitative comparisons with RAG-based baselines and long-context video understanding methods separately. **NaiveRAG, GraphRAG and LightRAG** are implemented using the `nano-graphrag` library, which is consistent with our VideoRAG, ensuring a fair comparison.\n\nIn this part, we directly provided the **answers from all the methods** (including VideoRAG) as well as the evaluation codes for experiment reproduction. Please utilize the following commands to download the answers:\n\n```shell\ncd reproduce\nwget https://archive.org/download/videorag/all_answers.zip\nunzip all_answers\n```\n\n#### Win-Rate Comparison\n\nWe conduct the win-rate comparison with RAG-based baselines. To reproduce the results, please follow these steps:\n\n```shell\ncd reproduce/winrate_comparison\n\n# First Step: Upload the batch request to OpenAI (remember to enter your key in the file, same for the following steps).\npython batch_winrate_eval_upload.py\n\n# Second Step: Download the results. Please enter the batch ID and then the output file ID in the file. Generally, you need to run this twice: first to obtain the output file ID, and then to download it.\npython batch_winrate_eval_download.py\n\n# Third Step: Parsing the results. Please the output file ID in the file.\npython batch_winrate_eval_parse.py\n\n# Fourth Step: Calculate the results. Please enter the parsed result file name in the file.\npython batch_winrate_eval_calculate.py\n\n```\n\n#### Quantitative Comparison\n\nWe conduct a quantitative comparison, which extends the win-rate comparison by assigning a 5-point score to long-context video understanding methods. We use the answers from NaiveRAG as the baseline response for scoring each query. To reproduce the results, please follow these steps:\n\n```shell\ncd reproduce/quantitative_comparison\n\n# First Step: Upload the batch request to OpenAI (remember to enter your key in the file, same for the following steps).\npython batch_quant_eval_upload.py\n\n# Second Step: Download the results. Please enter the batch ID and then the output file ID in the file. Generally, you need to run this twice: first to obtain the output file ID, and then to download it.\npython batch_quant_eval_download.py\n\n# Third Step: Parsing the results. Please the output file ID in the file.\npython batch_quant_eval_parse.py\n\n# Fourth Step: Calculate the results. Please enter the parsed result file name in the file.\npython batch_quant_eval_calculate.py\n```\n\n## Ollama Support\n\nThis project also supports ollama.  To use, edit the ollama_config in [_llm.py](https://github.com/HKUDS/VideoRAG/blob/main/videorag/_llm.py).\nAdjust the paramters of the models being used\n\n```\nollama_config = LLMConfig(\n    embedding_func_raw = ollama_embedding,\n    embedding_model_name = \"nomic-embed-text\",\n    embedding_dim = 768,\n    embedding_max_token_size=8192,\n    embedding_batch_num = 1,\n    embedding_func_max_async = 1,\n    query_better_than_threshold = 0.2,\n    best_model_func_raw = ollama_complete ,\n    best_model_name = \"gemma2:latest\", # need to be a solid instruct model\n    best_model_max_token_size = 32768,\n    best_model_max_async  = 1,\n    cheap_model_func_raw = ollama_mini_complete,\n    cheap_model_name = \"olmo2\",\n    cheap_model_max_token_size = 32768,\n    cheap_model_max_async = 1\n)\n```\nAnd specify the config when creating your VideoRag instance\n\n### Jupyter Notebook\nTo  test the solution on a single video, just load the notebook in the [notebook folder](VideoRAG/nodebooks) and\nupdate the paramters to fit your situation.\n\n## Citation\nIf you find this work is helpful to your research, please consider citing our paper:\n```bibtex\n@article{VideoRAG,\n  title={VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos},\n  author={Ren, Xubin and Xu, Lingrui and Xia, Long and Wang, Shuaiqiang and Yin, Dawei and Huang, Chao},\n  journal={arXiv preprint arXiv:2502.01549},\n  year={2025}\n}\n```\n\n**Thank you for your interest in our work!**\n\n### Acknowledgement\nYou may refer to related work that serves as foundations for our framework and code repository, \n[nano-graphrag](https://github.com/gusye1234/nano-graphrag) and [LightRAG](https://github.com/HKUDS/LightRAG). Thanks for their wonderful works.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHKUDS%2FVideoRAG","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FHKUDS%2FVideoRAG","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHKUDS%2FVideoRAG/lists"}