{"id":19425615,"url":"https://github.com/rese1f/aurora","last_synced_at":"2025-08-19T00:13:39.084Z","repository":{"id":258763389,"uuid":"852582085","full_name":"rese1f/aurora","owner":"rese1f","description":"[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark","archived":false,"fork":false,"pushed_at":"2025-06-04T06:24:04.000Z","size":26545,"stargazers_count":108,"open_issues_count":4,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-06-04T13:16:09.121Z","etag":null,"topics":["benchmark","computer-vision","deploy","llm","multimodal-large-language-models","transformers","video-understanding"],"latest_commit_sha":null,"homepage":"https://wenhaochai.com/aurora-web","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rese1f.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-09-05T03:59:38.000Z","updated_at":"2025-06-04T06:24:06.000Z","dependencies_parsed_at":"2025-04-22T07:25:33.390Z","dependency_job_id":"53895607-248c-4b48-989e-7bc9075b5a05","html_url":"https://github.com/rese1f/aurora","commit_stats":null,"previous_names":["rese1f/aurora"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rese1f/aurora","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rese1f%2Faurora","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rese1f%2Faurora/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rese1f%2Faurora/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rese1f%2Faurora/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rese1f","download_url":"https://codeload.github.com/rese1f/aurora/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rese1f%2Faurora/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271078704,"owners_count":24695490,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","computer-vision","deploy","llm","multimodal-large-language-models","transformers","video-understanding"],"created_at":"2024-11-10T14:04:14.822Z","updated_at":"2025-08-19T00:13:39.066Z","avatar_url":"https://github.com/rese1f.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Aurora Series\nA more efficient multimodal large language model series.\n\n\u003ctable\u003e\u003ctr\u003e\u003ctd\u003e\n    \u003cstrong\u003eAuroraCap\u003c/strong\u003e: Efficient, Performant Video Detailed Captioning and a New Benchmark, ICLR, 2025.\n\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\n[![](https://img.shields.io/badge/AuroraCap-docs-922133)](docs/auroracap/README.md)\n[![](https://img.shields.io/badge/web-922133)](https://rese1f.github.io/aurora-web/)\n[![](http://img.shields.io/badge/arXiv-922133)](https://arxiv.org/abs/2410.03051)\n[![](https://img.shields.io/badge/%F0%9F%A4%97%20_AuroraCap_model-ffc107?color=ffc107\u0026logoColor=white)](https://huggingface.co/collections/wchai/auroracap-66d117ffe13bedda96702013)\n[![](https://img.shields.io/badge/%F0%9F%A4%97%20_VDC_benchmark-ffc107?color=ffc107\u0026logoColor=white)](https://huggingface.co/datasets/wchai/Video-Detailed-Caption)\n[![](https://img.shields.io/badge/%F0%9F%A4%97%20_Trainset-ffc107?color=ffc107\u0026logoColor=white)](https://huggingface.co/datasets/wchai/AuroraCap-trainset)\n\n\n\u003cimg src=\"assets/auroracap/vdc_baseline.png\" align=\"center\"\u003e\n\n## News\n\n- [2024/10/26] VDC benchmark and AuroraCap baseline are supported in [EvolvingLMMs-Lab/lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).\n- [2024/10/07] Release technical report at [arXiv](https://arxiv.org/abs/2410.03051).\n- [2024/10/01] Release AuroraCap model and VDC benchmark, as well as the training and evaluation code.\n\n## Future Updates\n\n- [x] PR to [EvolvingLMMs-Lab/lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) with model and benchmark for fast and easy evaluation.\n- [ ] PR to [HuggingFace transformers](https://github.com/huggingface/transformers), but you can also use our [dev branch](https://github.com/rese1f/transformers/tree/aurora) for now.\n- [ ] Support [SGLang](https://github.com/sgl-project/sglang) deployment.\n- [ ] Support training with [Xtuner-lite](https://github.com/hhaAndroid/xtuner) for faster training and easier configuration.\n\n## Quick Start  \n\n### Installation\n\nWe recommend installing aurora in a virtual environment from Conda (Python\u003e=3.10).\n```\nconda create -n aurora python=3.10\nconda activate aurora\n```\n\nInstall PyTorch following [instruction](https://pytorch.org/get-started/locally/). Currently PyTorch 2.5 is not supported.\n```\npip install torch torchvision\n```\n\nClone this repository and install from source.\n```\ngit clone https://github.com/rese1f/aurora.git \u0026\u0026 cd aurora\n```\n\nFor training, install additional dependencies.\n```\ncd src/xtuner \u0026\u0026 pip install -e '.[all]'\n```\n\nFor evaluation, install additional dependencies.\n```\ncd src/lmms-eval \u0026\u0026 pip install -e .\n```\n\n\n### Play with AuroraCap\n\n```\npython inference.py \\\n    --model_path wchai/AuroraCap-7B-VID-xtuner \\\n    --prompt \"Describe the video in detail.\" \\\n    --visual_input assets/auroracap/test.mp4 \\\n    --num_frm 8 \\\n    --token_kept_ratio 0.8 \\\n    --temperature 0.0 \\\n    --top_p 1.0 \\\n    --num_beams 1 \\\n    --max_new_tokens 2048\n```\n\nwith Gradio GUI\n\n```\npython gradio_gui.py\n```\n\nbeta version with transformers see [here](docs/auroracap/README.md#beta-version-with-transformers).\n\n\n## FAQ\n\nQ: Can I only use token merging during inference?\n\nA: No, our experiments show that token merging is also a way to accelerate training while maintaining similar performance. Additionally, besides auroracap, you can also use token merging on other llava-like models.\n\nQ: How should I set the `token_kept_ratio` parameter?\n\nA: AuroraCap uses token merging technique to reduce the number of visual tokens before fed into the llm decoder. We using `token_kept_ratio` range from 0 to 1 to control the number of visual tokens kept. For example, if `token_kept_ratio` is 0.5, then 50% of the visual tokens will be kept. We recommend to use `token_kept_ratio` in the range of 0.2 to 0.4 for better performance-cost trade-off for captioning tasks, above 0.5 for visual question answering tasks, and above 0.8 for OCR-related tasks.\n\nQ: Why do we provide both Huggingface format and Xtuner format weights for AuroraCap?\n\nA: While Xtuner supports saving checkpoints in multiple formats, it currently only allows continued training with the Xtuner format. Therefore, we currently provide the model in the Xtuner format for both continued training and inference. In the future, we will provide the model in the Huggingface format for both training and inference, enabling quicker SGLang deployment and integration with the transformers.\n\nQ: In the _default_template_yaml of VDC in lmms-eval, why is gpt_eval_model_name set to gpt-4o-mini instead of Llama-3.1-8B?\n\nA: The lmms-eval framework requires the gpt_eval_model_name field to be specified, but it only supports API models in that field. Since we couldn’t modify the main function, we used gpt-4o-mini as a placeholder. However, the actual evaluation is still conducted using Llama-3.1-8B, which is consistent with both the paper and the implementation.\n\n## Citation\n\n```bibtex\n@article{chai2024auroracap,\n  title={AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark },\n  author={Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning},\n  journal={arXiv preprint arXiv:2410.03051},\n  year={2024}\n}\n```\n\n## License\n\nThis project is released under the [Apache License 2.0](LICENSE). Please also adhere to the Licenses of models and datasets being used.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frese1f%2Faurora","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frese1f%2Faurora","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frese1f%2Faurora/lists"}