{"id":14964607,"url":"https://github.com/foundationvision/groma","last_synced_at":"2025-04-04T15:09:43.245Z","repository":{"id":234971901,"uuid":"789669647","full_name":"FoundationVision/Groma","owner":"FoundationVision","description":"[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization","archived":false,"fork":false,"pushed_at":"2024-06-07T06:51:14.000Z","size":14147,"stargazers_count":553,"open_issues_count":2,"forks_count":58,"subscribers_count":35,"default_branch":"main","last_synced_at":"2024-10-29T17:12:20.623Z","etag":null,"topics":["foundation-models","grounding","large-language-models","llama","llama2","llm","mllm","multimodal","vision-language-model"],"latest_commit_sha":null,"homepage":"https://groma-mllm.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FoundationVision.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-21T08:08:59.000Z","updated_at":"2024-10-27T22:55:51.000Z","dependencies_parsed_at":"2024-04-21T23:03:01.988Z","dependency_job_id":"0297bdc3-622c-421c-bccf-66aa73a9e6b6","html_url":"https://github.com/FoundationVision/Groma","commit_stats":null,"previous_names":["foundationvision/groma"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGroma","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGroma/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGroma/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGroma/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FoundationVision","download_url":"https://codeload.github.com/FoundationVision/Groma/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247198463,"owners_count":20900080,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["foundation-models","grounding","large-language-models","llama","llama2","llm","mllm","multimodal","vision-language-model"],"created_at":"2024-09-24T13:33:29.526Z","updated_at":"2025-04-04T15:09:43.227Z","avatar_url":"https://github.com/FoundationVision.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003e Groma: Grounded Multimodal Assistant \u003c/h1\u003e\n\n\u003e [**Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models**](https://arxiv.org/abs/2404.13013)               \n\u003e **Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi**\n\u003e \n\u003e\u003ca href=\"https://arxiv.org/abs/2404.13013\"\u003e\u003cimg src='https://img.shields.io/badge/arXiv-Groma-red' alt='Paper PDF'\u003e\u003c/a\u003e\n\u003e\u003ca href='https://groma-mllm.github.io/'\u003e\u003cimg src='https://img.shields.io/badge/Project_Page-Groma-green' alt='Project Page'\u003e\u003c/a\u003e\n\u003e\u003ca href='https://huggingface.co/FoundationVision/groma-7b-finetune'\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'\u003e\u003c/a\u003e\n\u003e\u003ca href='https://huggingface.co/datasets/FoundationVision/groma_instruct'\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-yellow'\u003e\u003c/a\u003e\n\n\u003cimg src='docs/teaser.png' align=\"center\" width=\"80%\"\u003e\n\u003cp align=\"left\"\u003eGroma is an MLLM with exceptional region understanding and visual grounding capabilities. It can take user-defined region inputs (boxes) as well as generate long-form responses that are grounded to visual context.\u003c/p\u003e\n\n\u003cimg src='docs/paradigm.png' align=\"center\" width=\"80%\"\u003e\n\u003cp align=\"left\"\u003eGroma presents a novel paradigm of grounded MLLMs. (a) LLM for localization (e.g., Kosmos-2, Shikra); (b) External modules for localization (e.g., Lisa); and (c) \u003cb\u003eVisual tokenier for localization (Groma)\u003c/b\u003e.\n\n\u003c/div\u003e\n\n\n## Contents\n- [Install](#installation)\n- [Model](#model-weights)\n- [Data](#prepare-data)\n- [Training](#training)\n- [Inference](#inference)\n- [Evaluation](#evaluation)\n\n\n\n## Performance\nState-of-the-art performance on referring expression comprehension (REC) benchmarks among multimodal\nlarge language models.\n\n\u003ctable\u003e\n    \u003cthead\u003e\n    \u003ctr\u003e\n        \u003cth rowspan=\"2\"\u003eMethod\u003c/th\u003e\n        \u003cth colspan=\"3\"\u003eRefCOCO\u003c/th\u003e\n        \u003cth colspan=\"3\"\u003eRefCOCO+\u003c/th\u003e\n        \u003cth colspan=\"2\"\u003eRefCOCOg\u003c/th\u003e\n        \u003cth rowspan=\"2\"\u003eAvergae\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003cth\u003eval\u003c/th\u003e\n        \u003cth\u003etestA\u003c/th\u003e\n        \u003cth\u003etestB\u003c/th\u003e\n        \u003cth\u003eval\u003c/th\u003e\n        \u003cth\u003etestA\u003c/th\u003e\n        \u003cth\u003etestB\u003c/th\u003e\n        \u003cth\u003eval\u003c/th\u003e\n        \u003cth\u003etest\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003c/thead\u003e\n    \u003ctbody\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eShikra\u003c/td\u003e\n        \u003ctd\u003e87.01\u003c/td\u003e\n        \u003ctd\u003e90.61\u003c/td\u003e\n        \u003ctd\u003e80.24\u003c/td\u003e\n        \u003ctd\u003e81.60\u003c/td\u003e\n        \u003ctd\u003e87.36\u003c/td\u003e\n        \u003ctd\u003e72.12\u003c/td\u003e\n        \u003ctd\u003e82.27\u003c/td\u003e\n        \u003ctd\u003e82.19\u003c/td\u003e\n        \u003ctd\u003e82.93\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eFerret\u003c/td\u003e\n        \u003ctd\u003e87.49\u003c/td\u003e\n        \u003ctd\u003e91.35\u003c/td\u003e\n        \u003ctd\u003e82.45\u003c/td\u003e\n        \u003ctd\u003e80.78\u003c/td\u003e\n        \u003ctd\u003e87.38\u003c/td\u003e\n        \u003ctd\u003e73.14\u003c/td\u003e\n        \u003ctd\u003e83.93\u003c/td\u003e\n        \u003ctd\u003e84.76\u003c/td\u003e\n        \u003ctd\u003e83.91\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eMiniGPT-v2\u003c/td\u003e\n        \u003ctd\u003e88.69\u003c/td\u003e\n        \u003ctd\u003e91.65\u003c/td\u003e\n        \u003ctd\u003e85.33\u003c/td\u003e\n        \u003ctd\u003e79.97\u003c/td\u003e\n        \u003ctd\u003e85.12\u003c/td\u003e\n        \u003ctd\u003e74.45\u003c/td\u003e\n        \u003ctd\u003e84.44\u003c/td\u003e\n        \u003ctd\u003e84.66\u003c/td\u003e\n        \u003ctd\u003e84.29\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eQwen-VL\u003c/td\u003e\n        \u003ctd\u003e89.36\u003c/td\u003e\n        \u003ctd\u003e92.26\u003c/td\u003e\n        \u003ctd\u003e85.34\u003c/td\u003e\n        \u003ctd\u003e83.12\u003c/td\u003e\n        \u003ctd\u003e88.25\u003c/td\u003e\n        \u003ctd\u003e77.21\u003c/td\u003e\n        \u003ctd\u003e85.58\u003c/td\u003e\n        \u003ctd\u003e85.48\u003c/td\u003e\n        \u003ctd\u003e85.83\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr style=\"background-color: #ADD8E6;\"\u003e\n        \u003cth\u003eGroma\u003c/th\u003e\n        \u003cth\u003e89.53\u003c/th\u003e\n        \u003cth\u003e92.09\u003c/th\u003e\n        \u003cth\u003e86.26\u003c/th\u003e\n        \u003cth\u003e83.90\u003c/th\u003e\n        \u003cth\u003e88.91\u003c/th\u003e\n        \u003cth\u003e78.05\u003c/th\u003e\n        \u003cth\u003e86.37\u003c/th\u003e\n        \u003cth\u003e87.01\u003c/th\u003e\n        \u003cth\u003e86.52\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e\n\n## Installation\nClone the repository\n~~~\ngit clone https://github.com/FoundationVision/Groma.git\ncd Groma\n~~~\n\nCreate the conda environment and install dependencies\n~~~\nconda create -n groma python=3.9 -y\nconda activate groma\nconda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\n\ncd mmcv\nMMCV_WITH_OPS=1 pip install -e .\ncd ..\n~~~\n\nInstall falsh-attention for training\n~~~\npip install ninja\npip install flash-attn --no-build-isolation\n~~~\n\n\n## Model Weights\nTo play with Groma, please download the [model weights](https://huggingface.co/FoundationVision/groma-7b-finetune) from huggingface. \n\nWe additionally provide pretrained checkpoints from intermediate training stages. \nYou can start from any point to customize training.\n\n| Training stage | Required checkpoints |\n|:--------------:|:--------------------:|\n| Detection pretraining | [DINOv2-L](https://huggingface.co/facebook/dinov2-large) |\n| Alignment pretraining | [Vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5), [Groma-det-pretrain](https://huggingface.co/FoundationVision/groma-det-pretrain) |\n| Instruction finetuning | [Groma-7b-pretrain](https://huggingface.co/FoundationVision/groma-7b-pretrain) |\n\n\n\n## Prepare Data\nWe provide instructions to download datasets used at different training stages of Groma, \nincluding [Groma Instruct](https://huggingface.co/datasets/FoundationVision/groma_instruct/),\na 30k viusally grounded conversation dataset constructed with GPT-4V.\nYou don't have to download all of them unless you want to train Groma from scratch.\nPlease follow instructions in [DATA.md](docs/DATA.md) to prepare datasets.\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth align=\"left\"\u003eTraining stage\u003c/th\u003e\n    \u003cth align=\"left\"\u003eData types\u003c/th\u003e\n    \u003cth align=\"left\"\u003eDatasets\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"left\"\u003eDetection pretraining\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eDetection\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eCOCO, Objects365, OpenImages, V3Det, SA1B\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd rowspan=\"4\" align=\"left\"\u003eAlignment pretraining\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eImage caption\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eShareGPT-4V-PT\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"left\"\u003eGrounded caption\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eFlickr30k Entities\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"left\"\u003eRegion caption\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eVisual Genome, RefCOCOg\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"left\"\u003eREC\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eCOCO, RefCOCO/g/+, Grit-20m\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd rowspan=\"4\" align=\"left\"\u003eInstruction finetuning\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eGrounded caption\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eFlickr30k Entities\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"left\"\u003eRegion caption\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eVisual Genome, RefCOCOg\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"left\"\u003eREC\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eCOCO, RefCOCO/g/+\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"left\"\u003eInstruction following\u003c/td\u003e\n    \u003ctd align=\"left\"\u003eGroma Instruct, LLaVA Instruct, ShareGPT-4V\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n## Training\nFor detection pretraining, please run\n~~~\nbash scripts/det_pretrain.sh {path_to_dinov2_ckpt} {output_dir}\n~~~\n\nFor alignment pretraining, please run\n~~~\nbash scripts/vl_pretrain.sh {path_to_vicuna_ckpt} {path_to_groma_det_pretrain_ckpt} {output_dir}\n~~~\n\nFor instruction finetuning, please run\n~~~\nbash scripts/vl_finetune.sh {path_to_groma_7b_pretrain_ckpt} {output_dir}\n~~~\n\n\n## Inference\nTo test on single image, you can run\n~~~\npython -m groma.eval.run_groma \\\n    --model-name {path_to_groma_7b_finetune} \\\n    --image-file {path_to_img} \\\n    --query {user_query} \\\n    --quant_type 'none' # support ['none', 'fp16', '8bit', '4bit'] for inference\n~~~\n\n\n## Evaluation\nFor evaluation, please refer to [EVAL.md](docs/EVAL.md) for more details.\n\n\n## Citation\nIf you find this repo useful for your research, feel free to give us a star ⭐ or cite our paper:\n```\n@article{ma2024groma,\n  title={Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models},\n  author={Ma, Chuofan and Jiang, Yi and Wu, Jiannan and Yuan, Zehuan and Qi, Xiaojuan},\n  journal={arXiv preprint arXiv:2404.13013},\n  year={2024}\n}\n```\n\n\n## Acknowledgement\nGroma is built upon the awesome works \n[LLaVA](https://github.com/haotian-liu/LLaVA/) and \n[GPT4ROI](https://github.com/jshilong/GPT4RoI).\n\n\n\n## LICENSE\nThis project is licensed under the Apache License 2.0 - \nsee the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoundationvision%2Fgroma","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffoundationvision%2Fgroma","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoundationvision%2Fgroma/lists"}