{"id":28653869,"url":"https://github.com/tiger-ai-lab/quickvideo","last_synced_at":"2025-06-13T07:07:57.237Z","repository":{"id":285666815,"uuid":"953707292","full_name":"TIGER-AI-Lab/QuickVideo","owner":"TIGER-AI-Lab","description":"Quick Long Video Understanding ","archived":false,"fork":false,"pushed_at":"2025-06-08T17:31:13.000Z","size":53902,"stargazers_count":46,"open_issues_count":2,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-08T18:28:29.400Z","etag":null,"topics":["llm","multimodal","multimodal-learning","video"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2505.16175","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TIGER-AI-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-24T00:06:30.000Z","updated_at":"2025-06-08T17:31:17.000Z","dependencies_parsed_at":"2025-04-02T02:39:28.605Z","dependency_job_id":"ee8f3ad4-a6b8-4972-a612-968eed3c1d39","html_url":"https://github.com/TIGER-AI-Lab/QuickVideo","commit_stats":null,"previous_names":["jdf-prog/lvu","tiger-ai-lab/quickvideo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TIGER-AI-Lab/QuickVideo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FQuickVideo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FQuickVideo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FQuickVideo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FQuickVideo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TIGER-AI-Lab","download_url":"https://codeload.github.com/TIGER-AI-Lab/QuickVideo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FQuickVideo/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259599331,"owners_count":22882357,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","multimodal","multimodal-learning","video"],"created_at":"2025-06-13T07:07:55.395Z","updated_at":"2025-06-13T07:07:57.217Z","avatar_url":"https://github.com/TIGER-AI-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# QuickVideo\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/TIGER-AI-Lab/QuickVideo/raw/main/assets/logo.png\" alt=\"QuickVideo Logo\" width=\"340\"/\u003e\n\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003e\nEfficient video loading and context prefill for hour-long video understanding\n\u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cem\u003e\u003cstrong\u003eBenjamin Schneider\u003c/strong\u003e\u003csup\u003e*\u003c/sup\u003e • \u003cstrong\u003eDongfu Jiang\u003c/strong\u003e\u003csup\u003e*\u003c/sup\u003e • \u003cstrong\u003eChao Du\u003c/strong\u003e • \u003cstrong\u003eTianyu Pang\u003c/strong\u003e • \u003cstrong\u003eWenhu Chen\u003c/strong\u003e\u003c/em\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003csub\u003eUniversity of Waterloo • SeaAI Lab\u003c/sub\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003csub\u003e\u003csup\u003e*\u003c/sup\u003eEqual contribution\u003c/sub\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n| \n\u003ca href=\"https://github.com/TIGER-AI-Lab/QuickVideo?tab=readme-ov-file#-quick-start\"\u003e\u003cb\u003eQuick Start\u003c/b\u003e\u003c/a\u003e | \n\u003ca href=\"https://arxiv.org/abs/2505.16175\"\u003e\u003cb\u003ePaper\u003c/b\u003e\u003c/a\u003e | \n\u003ca href=\"https://github.com/TIGER-AI-Lab/QuickCodec\"\u003e\u003cb\u003eQuickCodec\u003c/b\u003e\u003c/a\u003e |\n\u003ca href=\"https://github.com/TIGER-AI-Lab/QuickVideo?tab=readme-ov-file#2-run-quickvideo-recommended\"\u003e\u003cb\u003eQuickPrefill\u003c/b\u003e\u003c/a\u003e \n|\n\u003c/p\u003e\n\n---\n\n## 🎯 Overview\n\nLong video understanding has emerged as a crucial capability for real-world applications such as meeting summarization, video surveillance, educational lecture analysis, and content moderation. However, it remains computationally prohibitive for VideoLLMs due to two critical bottlenecks:\n\n1. **Sequential video decoding** - Converting raw bit streams to RGB frames can take up to a minute for hour-long videos\n2. **Costly prefilling** - Processing millions of tokens for LLM inference results in high latency and memory usage\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/imgs/teaser.png\" alt=\"QuickVideo System Overview\" width=\"100%\"/\u003e\n\u003c/p\u003e\n\n**QuickVideo** is a system-algorithm co-design that achieves **3.5× speedup** (from 70s to 20s for 1-hour videos) while maintaining **97% performance** with **50% less memory**.\n\n## 🚀 Key Innovations\n\n### 🔧 QuickDecoder\n- **Parallelized CPU-based decoder** that splits videos into keyframe-aligned intervals\n- **2-3× faster** than sequential processing through concurrent execution\n\n### ⚡ QuickPrefill\n- **Group-based prefilling** for memory-efficient activation handling\n- **KV-cache pruning** using key norm selection (L2) to retain only essential tokens\n- **50% memory reduction** while preserving 97% of original performance\n\n### 🔄 Overlapping Pipeline\n- **Concurrent CPU decoding and GPU inference** to minimize end-to-end latency\n- Intelligent scheduling reduces total processing time significantly\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/imgs/interleaving_time.png\" alt=\"Pipeline Optimization\" width=\"100%\"/\u003e\n\u003c/p\u003e\n\n## 📊 Performance Results\n\nWe evaluate both QuickCodec on video decoding efficiency (left figure) and QuickPrefill on avg QA accuracy results on 4 long video understanding benchmarks: VideoMME, LongVideoBench, LVBench, MLVU (right figure and hidden table). Results show significant speedup and memory saving while preserving 97% of the original performance.\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd width=\"34%\"\u003e\n      \u003cimg src=\"./assets/imgs/video_processing_times.png\" alt=\"Video Processing Times\" width=\"100%\"\u003e\n    \u003c/td\u003e\n    \u003ctd width=\"66%\"\u003e\n      \u003cimg src=\"./assets/imgs/kv_pruning_avg_performance.png\" alt=\"KV Pruning Average Performance\" width=\"100%\"\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003ePerformance Table\u003c/b\u003e\u003c/summary\u003e\n\n\u003ctable style=\"width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;\"\u003e\n  \u003ccolgroup\u003e\n    \u003ccol style=\"width: 10%;\"\u003e\n    \u003ccol style=\"width: 20%;\"\u003e\n    \u003ccol style=\"width: 8%;\"\u003e\n    \u003ccol style=\"width: 10%;\"\u003e\n    \u003ccol style=\"width: 15%;\"\u003e\n    \u003ccol style=\"width: 10%;\"\u003e\n    \u003ccol style=\"width: 10%;\"\u003e\n    \u003ccol style=\"width: 7%;\"\u003e\n    \u003ccol style=\"width: 10%;\"\u003e\n  \u003c/colgroup\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"background-color: #f2f2f2;\"\u003e\n      \u003cth style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eGroup Size\u003c/th\u003e\n      \u003cth style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eKV Pruning method\u003c/th\u003e\n      \u003cth style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eρ\u003c/th\u003e\n      \u003cth style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eVideoMME\u003c/th\u003e\n      \u003cth style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eLongVideoBench (val)\u003c/th\u003e\n      \u003cth style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eLVBench\u003c/th\u003e\n      \u003cth style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eMLVU (dev)\u003c/th\u003e\n      \u003cth style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eAvg\u003c/th\u003e\n      \u003cth style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003ePerformance\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr style=\"background-color: #e8f4f8;\"\u003e\n      \u003ctd colspan=\"9\" style=\"border: 1px solid #ddd; padding: 8px; font-weight: bold; text-align: center;\"\u003e64 Frames\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e-\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e-\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e1\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e62.41\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e59.69\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e40.09\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e63.86\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e56.51\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e100.00%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eValue Norms\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e47.63\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e35.98\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e30.92\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e31.38\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e36.48\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e64.55%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eAttention Scores\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e58.63\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e52.95\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e37.83\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e59.87\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e52.32\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e92.58%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eKey Norms (↓)\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e60.56\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e56.17\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e37.70\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e62.34\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e54.19\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e95.90%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr style=\"background-color: #e8f4f8;\"\u003e\n      \u003ctd colspan=\"9\" style=\"border: 1px solid #ddd; padding: 8px; font-weight: bold; text-align: center;\"\u003e128 Frames\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e-\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e-\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e1\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e66.41\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e60.96\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e42.87\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e66.86\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e59.27\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e100.00%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eValue Norms\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e48.56\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e37.32\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e30.73\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e38.51\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e38.78\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e65.42%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eAttention Scores\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e60.96\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e55.20\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e39.70\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e64.36\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e55.06\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e92.89%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eKey Norms (↓)\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e63.41\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e58.19\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e39.57\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e64.99\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e56.54\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e95.39%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr style=\"background-color: #e8f4f8;\"\u003e\n      \u003ctd colspan=\"9\" style=\"border: 1px solid #ddd; padding: 8px; font-weight: bold; text-align: center;\"\u003e256 Frames\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e-\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e-\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e1\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e65.78\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e61.56\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e43.90\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e68.65\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e59.97\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e100.00%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eValue Norms\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e48.33\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e38.89\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e31.38\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e37.74\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e39.08\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e65.17%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eAttention Scores\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e62.52\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e57.22\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e41.96\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e67.27\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e57.24\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e95.45%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eKey Norms (↓)\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e64.04\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e60.21\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e41.90\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e66.73\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e58.22\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e97.08%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr style=\"background-color: #e8f4f8;\"\u003e\n      \u003ctd colspan=\"9\" style=\"border: 1px solid #ddd; padding: 8px; font-weight: bold; text-align: center;\"\u003e1024 Frames\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e-\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e-\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e1\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e62.00\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e60.43\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e42.29\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e63.48\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e57.05\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e100.00%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eValue Norms\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e47.37\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e33.66\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e29.18\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e32.65\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e35.71\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e62.60%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eAttention Scores\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e62.22\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e58.49\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e42.03\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e64.45\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e56.80\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e99.56%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e16\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003eKey Norms\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e0.5\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e59.99\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e61.59\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e40.80\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e64.76\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e56.78\u003c/td\u003e\n      \u003ctd style=\"border: 1px solid #ddd; padding: 8px; text-align: center;\"\u003e99.53%\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\n\n\u003c/details\u003e\n\n## 🛠️ Installation\n\n```bash\n# Clone and setup environment\nuv sync\nsource .venv/bin/activate\nuv pip install -e .\nuv pip install flash-attn --no-build-isolation\n```\n\n**Important**\nPlease use `transformers==4.50.0` to run and it has been tested. Higher version's transformers library may not work because they have updated the source code of Qwen VL models at some versions after it (e.g. `transformers==4.52.4`). We will try to make it compatible with the latest version in the future.\n\n## 🎮 Quick Start\n\n### 1. Download Example Video\n```bash\nwget https://github.com/TIGER-AI-Lab/QuickVideo/raw/refs/heads/dev/video/Q8AZ16uBhr8_resized_fps2_mute.mp4\nvideo_path=\"Q8AZ16uBhr8_resized_fps2_mute.mp4\"\n```\n\n### 2. Run QuickVideo (Recommended)\n**With interleaved processing + KV pruning** - ⚡ **Fastest configuration**\n\n```python\nfrom lvu import LVU, LVUConfig\n\n# Configure QuickVideo with all optimizations\nconfig = LVUConfig(\n    model_name_or_path=\"Qwen/Qwen2.5-VL-7B-Instruct\",\n    model_type=\"qwen25_lvu_interleaved\",  # Enable interleaved processing\n    top_k_predict_type=\"key_norms_small\",  # Use key norm pruning\n    video_group_size=16,     # Process 16 frames per group\n    top_k=64,               # Keep 64 most important tokens per group\n    num_frames=1024,        # Process up to 1024 frames\n    use_tqdm=True,\n)\n\nlvu = LVU(config)\nquestion = \"Describe this video.\"\nvideo_path = \"Q8AZ16uBhr8_resized_fps2_mute.mp4\"\n\n# Generate response\noutput = lvu.generate(question, video_path, max_new_tokens=128, do_sample=False)\nprint(output)\n```\n\n**Expected Output:**\n```\n⏱️  Performance Metrics:\n• Frame fetching: 0.33s\n• Processing: 10.44s  \n• Prefill: 22.95s\n• End-to-end: 27.65s (vs 57.86s baseline)\n• Time saved: 10.57s ⚡\n\n🎬 Generated Response:\n['The video is a compilation of classic animated shorts featuring iconic characters from the 1940s and 1950s, showcasing slapstick humor and vibrant animation styles typical of that era. The clips include:\\n\\n1. **\"A Bug\\'s Life\"**: A rabbit character is seen in a desert setting, engaging in a comedic chase sequence with a carrot. The rabbit exhibits exaggerated expressions and movements, typical of the cartoon\\'s slapstick style.\\n\\n2. **\"The Wabbit Who Could\"**: Bugs Bunny appears in a whimsical scene where he is performing a magic trick involving a carrot. The animation is colorful and lively']\n\"The video is a compilation of classic animated shorts featuring iconic \ncharacters from the 1940s and 1950s, showcasing slapstick humor and \nvibrant animation styles typical of that era...\"\n```\n\n**Important**: We recommend to run the interleaved version on **at least 2 cpu cores**, otherwise the interleaving strategy will do no better than the standard sequential processing. If you find no improvement using interleaved processing, then please check the number of CPU cores available on your machine.\n\n### 3. Baseline Comparison\n**Without interleaved processing** - 🐌 **Slower but still optimized**\n\n```python\nconfig = LVUConfig(\n    model_name_or_path=\"Qwen/Qwen2.5-VL-7B-Instruct\",\n    model_type=\"qwen25_lvu\",  # Standard processing\n    video_group_size=16,\n    top_k=64,\n    num_frames=1024,\n    use_tqdm=True,\n)\n# Same usage as above - notice the 2x slower processing time\n```\n\n## 🔬 Benchmark Evaluation\n\nEvaluate QuickVideo performance on standard video understanding benchmarks:\n\n```bash\n# Setup evaluation environment\ngit submodule update --init --recursive\ncd lmms-eval\nuv pip install -e .\n\n# Configure environment\nexport DEEPCODEC_CORES=8\nexport FORCE_QWENVL_VIDEO_READER='deepcodec'\n```\n\n**Run comprehensive evaluation:**\n\n```bash\n# Example evaluation script\nnum_frame=1024\nbenchmark_name=\"videomme,longvideobench_val_v,lvbench,mlvu_dev\"\n\naccelerate launch --num_processes 8 --main_process_port 12351 -m lmms_eval \\\n    --model qwen2_5_vl \\\n    --model_args \"pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_num_frames=$num_frame,use_flash_attention_2=True,adaptive_local_attention=True,local_attention_group_size=16,top_k=64,predict_type=key_norms_small\" \\\n    --tasks $benchmark_name \\\n    --batch_size 1 \\\n    --log_samples \\\n    --output_path ./logs/quickvideo_evaluation\n```\n\n## 🧪 Advanced Configuration\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eConfiguration Parameters\u003c/b\u003e\u003c/summary\u003e\n\n| Parameter | Description | Default | Options |\n|-----------|-------------|---------|---------|\n| `model_type` | Processing mode | `qwen25_lvu` | `qwen25_lvu`, `qwen25_lvu_interleaved` |\n| `video_group_size` | Frames per processing group | `16` | `8`, `16`, `32`, ... |\n| `top_k` | Tokens to keep per group | `64` | Any positive integer |\n| `top_k_predict_type` | Pruning strategy | `key_norms_small` | `key_norms_small`, `attention_scores`, `value_norms` |\n| `num_frames` | Maximum frames to process | `1024` | `64`, `128`, `256`, `1024`, ... |\n| `top_p` | Percentage-based pruning | `None` | `0.0` to `1.0` |\n\n\u003c/details\u003e\n\n## 🤝 Contributing\n\nWe welcome contributions! To add new models or KV pruning methods:\n\n1. **Fork the repository**\n2. **Create a feature branch**: `git checkout -b feature/new-model`\n3. **Implement your changes** following our coding standards\n4. **Add tests** and documentation\n5. **Submit a pull request**\n\nSee our [contribution guidelines](CONTRIBUTING.md) for detailed instructions. (under construction)\n\n## 📜 Citation\n\nIf you find QuickVideo useful in your research, please cite our paper:\n\n```bibtex\n@inproceedings{Schneider2025QuickVideoRL,\n  title={QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design},\n  author={Benjamin Schneider and Dongfu Jiang and Chao Du and Tianyu Pang and Wenhu Chen},\n  year={2025},\n  url={https://api.semanticscholar.org/CorpusID:278789043}\n}\n```\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=TIGER-AI-Lab/QuickVideo\u0026type=Date)](https://www.star-history.com/#TIGER-AI-Lab/QuickVideo\u0026Date)\n\n---\n\n\u003cp align=\"center\"\u003e\nMade with ❤️ by the \u003ca href=\"https://github.com/TIGER-AI-Lab\"\u003eTIGER AI Lab\u003c/a\u003e team\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fquickvideo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftiger-ai-lab%2Fquickvideo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fquickvideo/lists"}