{"id":24251357,"url":"https://github.com/jefferyZhan/Griffon","last_synced_at":"2025-09-23T16:31:12.689Z","repository":{"id":209439072,"uuid":"721913051","full_name":"jefferyZhan/Griffon","owner":"jefferyZhan","description":"Official repo of Griffon series including v1(ECCV 2024), v2, and G","archived":false,"fork":false,"pushed_at":"2024-11-27T04:09:35.000Z","size":6048,"stargazers_count":108,"open_issues_count":11,"forks_count":6,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-27T05:18:57.272Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jefferyZhan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-22T03:18:57.000Z","updated_at":"2024-11-27T04:49:29.000Z","dependencies_parsed_at":"2024-11-05T09:36:32.205Z","dependency_job_id":null,"html_url":"https://github.com/jefferyZhan/Griffon","commit_stats":null,"previous_names":["jefferyzhan/griffon"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jefferyZhan%2FGriffon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jefferyZhan%2FGriffon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jefferyZhan%2FGriffon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jefferyZhan%2FGriffon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jefferyZhan","download_url":"https://codeload.github.com/jefferyZhan/Griffon/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233985938,"owners_count":18761562,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-15T02:50:52.636Z","updated_at":"2025-09-23T16:31:12.684Z","avatar_url":"https://github.com/jefferyZhan.png","language":"Python","funding_links":[],"categories":["其他_机器视觉","📖 Related Papers"],"sub_categories":["资源传输下载","2024.3 ###"],"readme":"![](./docs/logo.jpg)\n\n\u003cdiv align=\"center\"\u003e\n\n# Welcome to Griffon\n\n\u003c/div\u003e\n\nWelcome to the official repository of the Griffon Series — including Griffon v1, v2, G, R, and the Vision-R1 reinforcement learning framework. **Griffon begins with fine-grained perception and localization, achieving state-of-the-art performance in visual grounding and referring expression comprehension (REC) — rivaling expert-level object detection models. Beyond its visual strengths, Griffon also demonstrates impressive general-purpose question answering and the ability to identify relevant regions based on a given question to perform reasoning.** Griffon is continuously evolving to tackle increasingly complex vision-language tasks. We are actively maintaining and open-sourcing our progress. Feel free to follow the project and open an issue if you have questions or feedback!\n\n---\n***Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models***\n\n[`📕Paper`](https://arxiv.org/abs/2505.20753) [`🌀Usage`](./Griffon-R/README.md) \n\u003c!-- [`🤗Model`](https://huggingface.co/collections/JefferyZhan/vision-r1-67e166f8b6a9ec3f6a664262) [`🤗Data`](https://huggingface.co/datasets/JefferyZhan/Vision-R1-Data) --\u003e\n\n***Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning***\n\n[`📕Paper`](https://arxiv.org/abs/2503.18013) [`🌀Usage`](./Vision-R1/README.md) [`🤗Model`](https://huggingface.co/collections/JefferyZhan/vision-r1-67e166f8b6a9ec3f6a664262) [`🤗Data`](https://huggingface.co/datasets/JefferyZhan/Vision-R1-Data)\n\n***Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models***\n\n[`📕Paper`](https://arxiv.org/abs/2410.16163) [`🌀Usage`](./README.md) [`🤗Model`](https://huggingface.co/collections/JefferyZhan/griffon-g-6729d8d65cd58b3f40e87794) [`🤗Data🔥`](https://huggingface.co/datasets/JefferyZhan/Griffon-G-CCMD-8M)\n\n***Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring (ICCV 2025)***\n\n[`📕Paper`](https://arxiv.org/abs/2403.09333) [`🌀Intro`](./docs/README_v2.md) [`🤗Data🔥`](https://huggingface.co/datasets/JefferyZhan/Griffon-V2-Data)\n\n***Griffon: Spelling out All Object Locations at Any Granuality with Large Language Model (ECCV 2024)***\n\n[`📕Paper`](https://arxiv.org/abs/2311.14552) [`🌀Usage`](./docs/README_v1.md) [`🤗Model`](https://huggingface.co/JefferyZhan/Griffon/tree/main)\n\n\n## Release\n- [x] **`2025.08.12`** 🔥🔥**We have released the data of [Griffon v2](https://huggingface.co/datasets/JefferyZhan/Griffon-V2-Data) and [Griffon-G](https://huggingface.co/datasets/JefferyZhan/Griffon-G-CCMD-8M) in the 🤗HuggingFace and also updated the [training codes](./docs/TRAIN_README.md). For any potential bugs or improvements, feel free to submit a pull request.**\n- [x] **`2025.08.11`** 🔥🔥**We are glad to annouce that Griffon v2 has been accepted to ICCV 2025.**\n- [x] **`2025.05.27`** We have released Griffon-R in the [arxiv](https://arxiv.org/abs/2505.20753).\n- [x] **`2025.03.25`** We release the Vision-R1 paper, evaluation codes, models, and data. Check out in the [repo](Vision-R1/README.md).\n- [x] **`2025.01.15`** Release the evaluation scripts supporting distributed inference.\n- [x] **`2024.11.26`** We are glad to release inference code and the model of Griffon-G in [`🤗Griffon-G`](https://huggingface.co/collections/JefferyZhan/griffon-g-6729d8d65cd58b3f40e87794). Training codes will be released later.\n- [x] **`2024.07.01`** **Griffon has been accepted to ECCV 2024. Data is released in [`🤗HuggingFace`](https://huggingface.co/datasets/JefferyZhan/Language-prompted-Localization-Dataset)**\n- [x] **`2024.03.11`** We are excited to announce the arrival of Griffon v2. Griffion v2 brings fine-grained perception performance to new heights with high-resolution expert-level detection and counting and supports visual-language co-referring. Take a look at our demo first. Paper is preprinted in [`📕Arxiv`](https://arxiv.org/abs/2403.09333).\n- [x] **`2023.12.06`** Release the Griffon v1 inference code and model in [`🤗HuggingFace`](https://huggingface.co/JefferyZhan/Griffon/tree/main).\n- [x] **`2023.11.29`** Griffon v1 Paper has been released in [`📕Arxiv`](https://arxiv.org/abs/2311.14552).\n\n## What can Griffon do now?\nGriffon-G demonstrates advanced performance across multimodal benchmarks, general VQAs, and text-rich VQAs, achieving new state-of-the-art results in REC and object detection.\n **More quantitative evaluation results can be found in our paper.**\n![](./docs/griffon-g.jpg)\n\n## Get Started\n\n### 1.Clone \u0026 Install\n\n```shell\ngit clone git@github.com:jefferyZhan/Griffon.git\ncd Griffon\npip install -e .\n```\nTips: If you encounter any errors while installing the packages, you can always download the corresponding source files (*.whl), which have been verified by us.\n\n---\n\n### 2.Download the Griffon and CLIP models to the checkpoints folder.\n\n| Model                                | Links                                  |\n|---------                            |---------------------------------------|\n| Griffon-G-9B                        | [`🤗HuggingFace`](https://huggingface.co/JefferyZhan/Griffon-G-gemma2-9B)    |\n| Griffon-G-27B                        | [`🤗HuggingFace`](https://huggingface.co/JefferyZhan/Griffon-G-gemma2-27B/tree/main)    |\n| clip-vit-large-path14               | [`🤗HuggingFace`](https://huggingface.co/openai/clip-vit-large-patch14)    |\n| clip-vit-large-path14-336_to_1022   | [`🤗HuggingFace`](https://huggingface.co/JefferyZhan/clip-vit-large-path14-336_to_1022/tree/main)    |\n---\n\n### 3. Training\nPlease refer to the [Training README](./docs/TRAIN_README.md).\n\n---\n### 4.Inference\n\n```shell\n# 4.1 Modify the instruction in the run_inference.sh.\n\n# 4.2.1 DO NOT USE Visual Prompt\nbash run_inference.sh [CUDA_ID] [CHECKPOINTS_PATH] [IMAGE_PATH]\n\n# 4.2.2 USE Visual Prompt for COUNTING: Input both query image and prompt image splited with comma and specify \u003cregion\u003e placeholder in the instruction\nbash run_inference.sh [CUDA_ID] [CHECKPOINTS_PATH] [IMAGE_PATH,PROMPT_PATH]\n```\nNotice: Please pay attention to the singular and plural expressions of objects.\n\n---\n### 5.Evaluation\n\n**5.1 Multimodal Benchmark Evaluation**\n\nPlease Refer to LLaVA Evaluation or Use VLMEvalKit.\n\n\n**5.2 COCO Detection Evaluation**\n\n\n```shell\n# Single Node\ntorchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr 127.0.0.1 --master_port 12457 -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/coco2017/val2017 --dataset PATH/TO/instances_val2017.json\n\n# Multiple Node\n## NODE 0\ntorchrun --nproc_per_node 8 --nnodes N --node_rank 0 --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/coco2017/val2017 --dataset PATH/TO/instances_val2017.json --init tcp://MASTER_ADDR:MASTER_PORT\n## NODE K(1 to N-1)\ntorchrun --nproc_per_node 8 --nnodes N --node_rank K --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/coco2017/val2017 --dataset PATH/TO/instances_val2017.json --init tcp://MASTER_ADDR:MASTER_PORT\n```\n\n\n**5.3 REC Evaluation**\n\nProcessed RefCOCO annotation set can be downloaded from this [link](https://drive.google.com/file/d/1Yh1l-f-rLSWkAlXUkZiHmK7oUC9NCmGl/view?usp=sharing).\n\n```shell\n# Single Node\ntorchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr 127.0.0.1 --master_port 12457 -m griffon.eval.eval_rec --model-path PATH/TO/MODEL --image-folder PATH/TO/COCO/train2014 --dataset PATH/TO/REF_COCO_ANN\n\n# Multiple Node\n## NODE 0\ntorchrun --nproc_per_node 8 --nnodes N --node_rank 0 --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/COCO/train2014 --dataset PATH/TO/REF_COCO_ANN --init tcp://MASTER_ADDR:MASTER_PORT\n## NODE K(1 to N-1)\ntorchrun --nproc_per_node 8 --nnodes N --node_rank K --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/COCO/train2014 --dataset PATH/TO/REF_COCO_ANN --init tcp://MASTER_ADDR:MASTER_PORT\n```\n\n## Acknowledgement\n\n- [LLaVA](https://github.com/haotian-liu/LLaVA/tree/main) provides the base codes and pre-trained models.\n- [Shikra](https://github.com/shikras/shikra) provides insight of how to organize datasets and some base processed annotations.\n- [Llama](https://github.com/facebookresearch/llama) provides the large language model.\n- [Gemma2](https://arxiv.org/abs/2408.00118) provides the large language model.\n- [volgachen](https://github.com/volgachen/Awesome-AI-Environment) provides the basic environment setting config.\n\n## Citation\nIf you find Griffon useful for your research and applications, please cite using this BibTeX:\n```bibtex\n@inproceedings{zhan2025griffonv1,\n  title={Griffon: Spelling out all object locations at any granularity with large language models},\n  author={Zhan, Yufei and Zhu, Yousong and Chen, Zhiyang and Yang, Fan and Tang, Ming and Wang, Jinqiao},\n  booktitle={European Conference on Computer Vision},\n  pages={405--422},\n  year={2025},\n  organization={Springer}\n}\n\n@misc{zhan2024griffonv2,\n      title={Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring}, \n      author={Yufei Zhan and Yousong Zhu and Hongyin Zhao and Fan Yang and Ming Tang and Jinqiao Wang},\n      year={2024},\n      eprint={2403.09333},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n\n@article{zhan2024griffon-G,\n  title={Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models},\n  author={Zhan, Yufei and Zhao, Hongyin and Zhu, Yousong and Yang, Fan and Tang, Ming and Wang, Jinqiao},\n  journal={arXiv preprint arXiv:2410.16163},\n  year={2024}\n}\n```\n\n## License\n\n[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)\n[![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE)\n\nThe data and checkpoint is licensed for research use only. All of them are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Gemma2, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FjefferyZhan%2FGriffon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FjefferyZhan%2FGriffon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FjefferyZhan%2FGriffon/lists"}