{"id":28857746,"url":"https://github.com/visresearch/llava-stf","last_synced_at":"2026-04-29T04:40:49.791Z","repository":{"id":298224916,"uuid":"987537118","full_name":"visresearch/LLaVA-STF","owner":"visresearch","description":"The official implementation of \"Learning Compact Vision Tokens for Efficient Large Multimodal Models\"","archived":false,"fork":false,"pushed_at":"2025-06-11T02:25:39.000Z","size":2747,"stargazers_count":27,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-20T02:02:13.419Z","etag":null,"topics":["efficient-deep-learning","efficient-inference","large-multimodal-models","large-vision-language-models","llama","llava","token-fusion","token-merging","vision-token-merging"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/visresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-21T08:11:14.000Z","updated_at":"2025-06-17T17:33:57.000Z","dependencies_parsed_at":"2025-06-10T03:36:49.383Z","dependency_job_id":null,"html_url":"https://github.com/visresearch/LLaVA-STF","commit_stats":null,"previous_names":["visresearch/llava-stf"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/visresearch/LLaVA-STF","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/visresearch%2FLLaVA-STF","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/visresearch%2FLLaVA-STF/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/visresearch%2FLLaVA-STF/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/visresearch%2FLLaVA-STF/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/visresearch","download_url":"https://codeload.github.com/visresearch/LLaVA-STF/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/visresearch%2FLLaVA-STF/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260862925,"owners_count":23074181,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["efficient-deep-learning","efficient-inference","large-multimodal-models","large-vision-language-models","llama","llava","token-fusion","token-merging","vision-token-merging"],"created_at":"2025-06-20T02:02:13.046Z","updated_at":"2026-04-29T04:40:49.786Z","avatar_url":"https://github.com/visresearch.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Learning Compact Vision Tokens for Efficient Large Multimodal Models\n\nThis repository is the official implementation of \"Learning Compact Vision Tokens for Efficient Large Multimodal Models\".\n\n[[Paper](https://arxiv.org/abs/2506.07138)]    [[BibTex](#Citation)]   [[HuggingFace](https://huggingface.co/visresearch/LLaVA-STF/tree/main)]\n\n![framework](images/tang2025compact.png)\n\n**LLaVA-STF** explores the **spatial redundancy among vision tokens** and **shorten the length of vision token sequences** for **inference acceleration**, where spatial-adjacent tokens are fused into one. \n\nMeanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STC and MLTC module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities. \n\nExperimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only 25% vision tokens of baseline. \n\nThe main results are illustrated in the below figure.\n\n\u003cimg src=\"images/main_results.png\" alt=\"result\" width=\"500px\" /\u003e\n\n### Install\n\n1. Clone this repository and navigate to LLaVA folder\n```bash\ngit clone [link]\ncd LLaVA\n```\n\n2. Install Package\n```bash\nconda create -n llava python=3.10 -y\nconda activate llava\npip install --upgrade pip \npip install -e .\n```\n\n3. Install additional packages for training cases\n```bash\npip install -e \".[train]\"\npip install flash-attn --no-build-isolation\n```\n\n### Training\n\nWe follow the original LLaVA to conduct two-stage training: a pretraining stage for feature alignment, and a full parameter fine-tuning stage for visual instruction tuning.\nThe training details are as follows.\n\n1. Download the training data for both pretraining and fine-tuning from the original LLaVA repository.\n2. Run the following command to pretrain the model:\n    ```bash\n    bash scripts/v1_5/pretrain.sh\n    ```\n3. Run the following command to fine-tune the model:\n    ```bash\n    bash scripts/v1_5/finetune.sh\n    ```\n\n### Hyperparameters\nWe use a similar set of hyperparameters as the original LLaVA.  Both hyperparameters used in pretraining and fine-tuning are provided below.\n\n1. Pretraining\n\n| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| LLaVA-v1.5-7B | 256 | 1e-3 | 1 | 2048 | 0 |\n\n2. Fine-tuning\n\n| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| LLaVA-v1.5-7B | 128 | 2e-5 | 1 | 2048 | 0 |\n\n### Model Weights\n| Model | Schedule | Checkpoint | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN |\n|----------|-----------|-----------|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|\n| LLaVA-v1.5-7B (pretrain) | 1 epoch | [download](https://huggingface.co/visresearch/LLaVA-STF/tree/main/pretrain/LLaVA-vicuna-1.5-7B) | / | / | / | / | / | / | / | / | / |\n| LLaVA-v1.5-7B (finetune) | full_ft-1e | [download](https://huggingface.co/visresearch/LLaVA-STF/tree/main/full-parameter-finetune) | 78.1 | 61.9 | 51.1 | 70.5 | 57.4 | 86.0 | 1482.8 | 66.2 | 58.9 |\n\n### Evaluation\n\nWe evaluate models on the following 9 benchmarks.\n\n#### VQAv2\n\n1. Download [`test2015`](http://images.cocodataset.org/zips/test2015.zip) and put it under `./playground/data/eval/vqav2`.\n2. Multi-GPU inference.\n```Shell\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh\n```\n3. Submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/830/my-submission).\n\n#### GQA\n\n1. Download the [data](https://cs.stanford.edu/people/dorarad/gqa/download.html) and [evaluation scripts](https://cs.stanford.edu/people/dorarad/gqa/evaluate.html) following the official instructions and put under `./playground/data/eval/gqa/data`.\n2. Multi-GPU inference.\n```Shell\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh\n```\n\n#### VisWiz\n\n1. Download [`test.json`](https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip) and extract [`test.zip`](https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip) to `test`. Put them under `./playground/data/eval/vizwiz`.\n2. Single-GPU inference.\n```Shell\nCUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh\n```\n3. Submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2185/my-submission).\n\n#### ScienceQA\n\n1. Under `./playground/data/eval/scienceqa`, download `images`, `pid_splits.json`, `problems.json` from the `data/scienceqa` folder of the ScienceQA [repo](https://github.com/lupantech/ScienceQA).\n2. Single-GPU inference and evaluate.\n```Shell\nCUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh\n```\n\n#### TextVQA\n\n1. Download [`TextVQA_0.5.1_val.json`](https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json) and [images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) and extract to `./playground/data/eval/textvqa`.\n2. Single-GPU inference and evaluate.\n```Shell\nCUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh\n```\n\n#### POPE\n\n1. Download `coco` from [POPE](https://github.com/AoiDragon/POPE/tree/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco) and put under `./playground/data/eval/pope`.\n2. Single-GPU inference and evaluate.\n```Shell\nCUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh\n```\n\n#### MME\n\n1. Download the data following the official instructions [here](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation).\n2. Downloaded images to `MME_Benchmark_release_version`.\n3. put the official `eval_tool` and `MME_Benchmark_release_version` under `./playground/data/eval/MME`.\n4. Single-GPU inference and evaluate.\n```Shell\nCUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh\n```\n\n### MMBench\n\n1. Download [`mmbench_dev_20230712.tsv`](https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv) and put under `./playground/data/eval/mmbench`.\n2. Single-GPU inference.\n```Shell\nCUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh\n```\n3. Submit the results to the [evaluation server](https://opencompass.org.cn/leaderboard-multimodal).\n\n### MMBench-CN\n\n1. Download [`mmbench_dev_cn_20231003.tsv`](https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_cn_20231003.tsv) and put under `./playground/data/eval/mmbench`.\n2. Single-GPU inference.\n```Shell\nCUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh\n```\n3. Submit the results to the [evaluation server](https://opencompass.org.cn/leaderboard-multimodal).\n\n\n\n### License\n\nThis project is under the CC-BY-NC 4.0 license. See [LICENSE](LICENSE) for details.\n\n### Citation\n\n```bibtex\n@article{tang2025compact,\n  author  = {Tang, Hao and Shen, Chengchao},\n  title   = {Learning Compact Vision Tokens for Efficient Large Multimodal Models},\n  journal = {arXiv preprint arXiv:2506.07138},\n  year    = {2025},\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvisresearch%2Fllava-stf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvisresearch%2Fllava-stf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvisresearch%2Fllava-stf/lists"}