{"id":28653888,"url":"https://github.com/tiger-ai-lab/vamba","last_synced_at":"2025-06-22T07:04:40.573Z","repository":{"id":282500983,"uuid":"945664622","full_name":"TIGER-AI-Lab/Vamba","owner":"TIGER-AI-Lab","description":"Code for the paper \"Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers\"","archived":false,"fork":false,"pushed_at":"2025-03-18T17:16:47.000Z","size":18058,"stargazers_count":70,"open_issues_count":1,"forks_count":9,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-13T07:08:02.040Z","etag":null,"topics":["llm","video","vlm"],"latest_commit_sha":null,"homepage":"https://tiger-ai-lab.github.io/Vamba/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TIGER-AI-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-09T23:25:47.000Z","updated_at":"2025-06-11T12:06:24.000Z","dependencies_parsed_at":"2025-03-15T02:46:26.413Z","dependency_job_id":null,"html_url":"https://github.com/TIGER-AI-Lab/Vamba","commit_stats":null,"previous_names":["tiger-ai-lab/vamba"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TIGER-AI-Lab/Vamba","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVamba","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVamba/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVamba/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVamba/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TIGER-AI-Lab","download_url":"https://codeload.github.com/TIGER-AI-Lab/Vamba/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVamba/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261250272,"owners_count":23130540,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","video","vlm"],"created_at":"2025-06-13T07:08:00.613Z","updated_at":"2025-06-22T07:04:35.554Z","avatar_url":"https://github.com/TIGER-AI-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Vamba\n\nThis repo contains code for [Vamba](https://arxiv.org/abs/TODO), a hybrid Mamba-Transformer model that leverages cross-attention layers and Mamba-2 blocks for efficient hour-long video understanding.\n\n[**🌐 Homepage**](https://tiger-ai-lab.github.io/Vamba/) | [**📖 arXiv**](https://arxiv.org/abs/2503.11579) | [**💻 GitHub**](https://github.com/TIGER-AI-Lab/Vamba) | [**🤗 Model**](https://huggingface.co/TIGER-Lab/Vamba-Qwen2-VL-7B)\n\n## Install\nPlease use the following commands to install the required packages:\n```bash\nconda env create -f environment.yaml\nconda activate vamba\npip install flash-attn --no-build-isolation\n```\n## Model Inference\n```python\n# git clone https://github.com/TIGER-AI-Lab/Vamba\n# cd Vamba\n# export PYTHONPATH=.\nfrom tools.vamba_chat import Vamba\nmodel = Vamba(model_path=\"TIGER-Lab/Vamba-Qwen2-VL-7B\", device=\"cuda\")\ntest_input = [\n    {\n        \"type\": \"video\",\n        \"content\": \"assets/magic.mp4\",\n        \"metadata\": {\n            \"video_num_frames\": 128,\n            \"video_sample_type\": \"middle\",\n            \"img_longest_edge\": 640,\n            \"img_shortest_edge\": 256,\n        }\n    },\n    {\n        \"type\": \"text\",\n        \"content\": \"\u003cvideo\u003e Describe the magic trick.\"\n    }\n]\nprint(model(test_input))\n\ntest_input = [\n    {\n        \"type\": \"image\",\n        \"content\": \"assets/old_man.png\",\n        \"metadata\": {}\n    },\n    {\n        \"type\": \"text\",\n        \"content\": \"\u003cimage\u003e Describe this image.\"\n    }\n]\nprint(model(test_input))\n```\n\n## Model Training\n1. Modify the data configuration files under `train/data_configs/` to point to the correct paths of the datasets. You should refer to [CC12M](https://huggingface.co/datasets/pixparse/cc12m-wds), [PixelProse](https://huggingface.co/datasets/tomg-group-umd/pixelprose), [LLaVA-OneVision-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) and [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) for preparing the training datasets.\n2. Follow the commands below to train Vamba model:\n```bash\n# pretraining\nbash scripts/pretrain_vamba.sh\n\n# instruction-tuning\nbash scripts/sft_vamba.sh\n```\n\n## Evaluation\nUse the scripts under `eval/` to evaluate Vamba models. For example, to evaluate Video-MME, use the command:\n```bash\ncd Vamba\nexport PYTHONPATH=.\npython eval/eval_videomme.py --model_type vamba --model_name_or_path TIGER-Lab/Vamba-Qwen2-VL-7B --num_frames 512 --data_dir \u003cpath_to_videomme_data\u003e\n```\n\n## Vamba Model Architecture\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://tiger-ai-lab.github.io/Vamba/static/images/vamba_main.png\" width=\"900\"\u003e\n\u003c/p\u003e\n\nThe main computation overhead in the transformer-based LMMs comes from the quadratic complexity of the self-attention in the video tokens. To overcome this issue, we design a hybrid Mamba Transformer architecture to process text and video tokens differently. The key idea of our method is to split the expensive self-attention operation over the entire video and text token sequence into two more efficient components. Since video tokens typically dominate the sequence while text tokens remain few, we maintain the self-attention mechanism exclusively for the text tokens and eliminate it for the video tokens. Instead, we add cross-attention layers that use text tokens as queries and video tokens as keys and values. In the meantime, we propose employing Mamba blocks to effectively process the video tokens.\n\n\n\n## Citation\nIf you find our paper useful, please cite us with\n```\n@misc{ren2025vambaunderstandinghourlongvideos,\n      title={Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers}, \n      author={Weiming Ren and Wentao Ma and Huan Yang and Cong Wei and Ge Zhang and Wenhu Chen},\n      year={2025},\n      eprint={2503.11579},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https://arxiv.org/abs/2503.11579}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fvamba","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftiger-ai-lab%2Fvamba","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fvamba/lists"}