{"id":13441255,"url":"https://github.com/Haiyang-W/GiT","last_synced_at":"2025-03-20T11:38:06.507Z","repository":{"id":227669457,"uuid":"768811559","full_name":"Haiyang-W/GiT","owner":"Haiyang-W","description":"[ECCV2024 Oral🔥] Official Implementation of \"GiT: Towards Generalist Vision Transformer through Universal Language Interface\"","archived":false,"fork":false,"pushed_at":"2024-08-12T11:19:59.000Z","size":13101,"stargazers_count":262,"open_issues_count":1,"forks_count":12,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-08-12T12:44:01.406Z","etag":null,"topics":["foundation-models","perception","transformer","unified","vision-and-language","vision-transformer"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2403.09394","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Haiyang-W.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-07T19:24:17.000Z","updated_at":"2024-08-12T11:21:25.000Z","dependencies_parsed_at":"2024-04-22T08:53:25.251Z","dependency_job_id":"a6c76489-1d2f-49ab-af6b-7c7b9b060060","html_url":"https://github.com/Haiyang-W/GiT","commit_stats":null,"previous_names":["haiyang-w/git"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Haiyang-W%2FGiT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Haiyang-W%2FGiT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Haiyang-W%2FGiT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Haiyang-W%2FGiT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Haiyang-W","download_url":"https://codeload.github.com/Haiyang-W/GiT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221759944,"owners_count":16876323,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["foundation-models","perception","transformer","unified","vision-and-language","vision-transformer"],"created_at":"2024-07-31T03:01:31.656Z","updated_at":"2025-03-20T11:38:06.495Z","avatar_url":"https://github.com/Haiyang-W.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.\n\u003ch5 align=\"center\"\u003e\n\u003c!-- [![hf_space](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/LanguageBind/GiT)\n[![Replicate demo and cloud API](https://replicate.com/camenduru/GiT/badge)](https://replicate.com/camenduru/GiT)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/camenduru/GiT-jupyter/blob/main/MoE_LLaVA_jupyter.ipynb)\n[![hf_space](https://img.shields.io/badge/🤗-Paper%20In%20HF-red.svg)](https://huggingface.co/papers/2401.15947) --\u003e\n\n\u003c!-- [![youtube](https://img.shields.io/badge/-YouTube-000000?logo=youtube\u0026logoColor=FF0000)](https://www.youtube.com/watch?v=uYb38g-weEY)\n[![jiqizhixin](https://img.shields.io/badge/-WeChat@机器之心-000000?logo=wechat\u0026logoColor=07C160)](https://mp.weixin.qq.com/s/ICylR6n2LhqQRS0CAHFI1A) --\u003e\n[![arXiv](https://img.shields.io/badge/Arxiv-2403.09394-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2403.09394)\n[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/Haiyang-W/GiT/blob/main/LICENSE) \n[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FHaiyang-W%2FGiT%2Ftree%2Fmain\u0026count_bg=%2379C83D\u0026title_bg=%23555555\u0026icon=\u0026icon_color=%23E7E7E7\u0026title=hits\u0026edge_flat=false)](https://hits.seeyoufarm.com)\n[![GitHub issues](https://img.shields.io/github/issues/Haiyang-W/GiT?color=critical\u0026label=Issues)](https://github.com/Haiyang-W/GiT/issues?q=is%3Aopen+is%3Aissue)\n[![GitHub closed issues](https://img.shields.io/github/issues-closed/Haiyang-W/GiT?color=success\u0026label=Issues)](https://github.com/Haiyang-W/GiT/issues?q=is%3Aissue+is%3Aclosed)  \n[![Twitter](https://img.shields.io/badge/Twitter-🔥%2036k%20views-b31b1b.svg?style=social\u0026logo=twitter)](https://twitter.com/_akhaliq/status/1768484390873477480) \u003cbr\u003e\n\u003c/h5\u003e\n\nThis repo is the official implementation of [**ECCV2024**](https://eccv.ecva.net/) \u003cfont color=Red\u003e**Oral**\u003c/font\u003e paper: [GiT: Towards Generalist Vision Transformer through Universal Language Interface](https://arxiv.org/abs/2403.09394) as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.\n\n\u003e GiT: Towards Generalist Vision Transformer through Universal Language Interface\n\u003e\n\u003e [Haiyang Wang*](https://scholar.google.com/citations?user=R3Av3IkAAAAJ\u0026hl=en\u0026oi=ao), [Hao Tang*](https://scholar.google.com/citations?user=MyarrsEAAAAJ\u0026hl=en), [Li Jiang](https://scholar.google.com/citations?user=5cIodxsAAAAJ\u0026hl=en) $^\\dagger$, [Shaoshuai Shi](https://scholar.google.com/citations?user=DC9wzBgAAAAJ\u0026hl=en\u0026oi=ao), [Muhammad Ferjad Naeem](https://scholar.google.com/citations?user=PR2DwYYAAAAJ\u0026hl=en), [Hongsheng Li](https://scholar.google.com/citations?user=BN2Ze-QAAAAJ\u0026hl=en\u0026oi=ao), [Bernt Schiele](https://scholar.google.com/citations?user=z76PBfYAAAAJ\u0026hl=en), [Liwei Wang](https://scholar.google.com/citations?user=VZHxoh8AAAAJ\u0026hl=en) $^\\dagger$\n\u003e - Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang ( tanghao@stu.pku.edu.cn )\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/Figure1.png\" width=\"800\"/\u003e\n\u003c/div\u003e\n\n## 📣 News\n- [24-8-12] 🤗 Our GiT was accepted by [ECCV2024](https://eccv.ecva.net/) with \u003cfont color=Red\u003e**Oral**\u003c/font\u003e presentation.\n- [24-7-01] 🤗 Our GiT was accepted by [ECCV2024](https://eccv.ecva.net/).\n- [24-3-15] 🚀 Training and inference Code is released.\n- [24-3-15] 👀 GiT is released on [arXiv](https://arxiv.org/abs/2403.09394).\n\n## 💫 What we want to do\n### The Model Architectures across various AI domains are converging towards \u003cfont color=Red\u003e*Multi-Layer Plain Transformers*\u003c/font\u003e. \n- *Language Modeling* ([GPT](https://arxiv.org/abs/2005.14165))\n- *2D Image Modeling* ([ViT](https://arxiv.org/abs/2010.11929))\n- *3D Point Cloud Modeling* ([DSVT](https://openaccess.thecvf.com/content/CVPR2023/papers/Wang_DSVT_Dynamic_Sparse_Voxel_Transformer_With_Rotated_Sets_CVPR_2023_paper.pdf))\n- *2D Image and 3D Point Cloud Joint Modeling* ([UniTR](https://arxiv.org/pdf/2308.07732))\n- *Graph Modeling* ([Graphormer](https://proceedings.neurips.cc/paper/2021/file/f1c1592588411002af340cbaedd6fc33-Paper.pdf))\n- $\\cdot \\cdot \\cdot$\n### Reducing Human Bias in Model Architecture Designing\nWe aim to unify the model architecture of vision and language through a plain transformer, **reducing human biases** such as modality-specific encoders and task-specific heads.  A key advancement in deep learning is the shift from hand-crafted to autonomously learned features, inspiring us to reduce human-designed aspects in architecture. Moreover, benefiting from the flexibility of plain transformers, our framework can extend to more modalities like [point clouds](https://github.com/Haiyang-W/UniTR) and graphs.\n\n## 🤔 What we achieve\nBuilding a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (**G**eneralist V**i**sion **T**ransformer). GiT has the following characteristics: \n - 😮 **Minimalist architecture design similar to LLM**: GiT consists solely of a single transformer, without the inclusion of additional vision encoders and adapters.\n - 🚀 **Covering all types of visual understanding tasks**: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., object detection), pixel-level tasks (e.g., semantic segmentation), and vision-language tasks (e.g., image captioning).\n - 🤗 **Achieving multi-task ability by unified language interface**: Similar to LLM, GiT observes the task synergy effect in multi-task training. It fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. No negative transfer phenomenon.\n - 🔥 **Strong performance on zero-shot and few-shot benchmark**: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after training on 27 datasets.\n - 👍 **Simple one-stage training strategy**: GiT uses a very simple one-stage training strategy, fully embracing the training style utilized by the current LLM framework.\n\n## Overview\n- [💫 What we want to do](https://github.com/Haiyang-W/GiT?tab=readme-ov-file#-what-we-want-to-do)\n- [🤔 Introduction](https://github.com/Haiyang-W/GiT?tab=readme-ov-file#-what-we-achieve)\n- [🚀 Main Results](https://github.com/Haiyang-W/GiT?tab=readme-ov-file#-main-results)\n- [🛠️ Quick Start](https://github.com/Haiyang-W/GiT?tab=readme-ov-file#%EF%B8%8F-quick-start)\n- [👀 Todo](https://github.com/Haiyang-W/GiT?tab=readme-ov-file#-todo)\n- [👍 Acknowledgments](https://github.com/Haiyang-W/GiT?tab=readme-ov-file#-acknowledgement)\n- [📘 Citation](https://github.com/Haiyang-W/GiT?tab=readme-ov-file#-citation)\n\n## 🚀 Main Results\n\n### Single-Task Benchmark\n|  Model  |Params| Metric | Perfomance |ckpt|log|config|\n|---------|---------|---------|--------|--------|---------|---------|\n|  GiT-B\u003csub\u003edetection\u003c/sub\u003e | 131M|mAP|45.1 | [ckpt](https://huggingface.co/kanashi6/GiT/blob/main/det_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/det_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/single_detection_base.py)|\n|  GiT-B\u003csub\u003einsseg\u003c/sub\u003e | 131M|mAP|31.4 |[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/insseg_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/insseg_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/single_instanceseg_base.py) |\n|  GiT-B\u003csub\u003esemseg\u003c/sub\u003e | 131M|mIoU|47.7 |[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/semseg_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/semseg_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/single_semanticseg_base.py) |\n|  GiT-B\u003csub\u003ecaption\u003c/sub\u003e| 131M|BLEU-4|33.7 | [ckpt](https://huggingface.co/kanashi6/GiT/blob/main/caption_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/caption_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/single_caption_base.py) |\n|  GiT-B\u003csub\u003egrounding\u003c/sub\u003e| 131M|Acc@0.5|83.3 | [ckpt](https://huggingface.co/kanashi6/GiT/blob/main/grounding_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/grounding_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/single_visualgrounding_base.py) |\n### Multi-Tasking Benchmark\n|  Model  |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding |ckpt|log|config|\n|---------|---------|---------|--------|--------|---------|---------|---------|---------|---------|\n|  GiT-B\u003csub\u003emulti-task\u003c/sub\u003e | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/multi_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/multi_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/multi_fivetask_base.py) |\n|  GiT-L\u003csub\u003emulti-task\u003c/sub\u003e | 387M|51.3 | 35.1 | 50.6|35.7|88.4|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/multi_large.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/multi_large.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/multi_fivetask_large.py) |\n|  GiT-H\u003csub\u003emulti-task\u003c/sub\u003e| 756M|52.9 | 35.8 | 52.4|36.2|89.2|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/multi_huge.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/multi_huge.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/multi_fivetask_huge.py) |\n\u003c!-- |  GiT-B\u003csub\u003esingle-task\u003c/sub\u003e | 131M|45.1 | 31.4| 47.7 |33.7|83.3|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/det_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/det_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/single_detection_base.py)| --\u003e\n### Task Synergy in Multi-Tasking Training\n|  Model  |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding |\n|---------|---------|---------|--------|--------|---------|---------|\n|  GiT-B\u003csub\u003esingle-task\u003c/sub\u003e | 131M|45.1 | 31.4| 47.7 |33.7|83.3|\n|  *Improvement* | |*+1.6* | *+0.5*| *+0.1* |*+1.6*|*+2.5*|\n|  GiT-B\u003csub\u003emulti-task\u003c/sub\u003e | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|\n### Zero-shot benchmark\n|  Model  | Params|  Cityscapes\u003cbr\u003e(Det)|Cityscapes \u003cbr\u003e(Ins Seg)|Cityscapes \u003cbr\u003e(Sem Seg)|SUN RGB-D|nocaps|ckpt|log|config|\n|---------|---------|---------|--------|--------|---------|---------|---------|---------|---------|\n|  GiT-B\u003csub\u003emulti-task\u003c/sub\u003e |131M| 21.8 | 14.3| 34.4 | 30.9 | 9.2|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/multi_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/multi_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/multi_fivetask_base.py) |\n|  GiT-B\u003csub\u003euniversal\u003c/sub\u003e  |131M|29.1|17.9|56.2|37.5|10.6|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/universal_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/universal_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/universal_base.py) |\n|  GiT-L\u003csub\u003euniversal\u003c/sub\u003e |387M|32.3|20.3|58.0|39.9|11.6|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/universal_large.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/universal_large.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/universal_large.py) |\n|  GiT-H\u003csub\u003euniversal\u003c/sub\u003e | 756M|34.1 | 18.7 | 61.8| 42.5 | 12.6|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/universal_huge.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/universal_huge.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/universal_huge.py) |\n### Few-shot benchmark\n\n|  Model  | Params|DRIVE|LoveDA|Potsdam|WIDERFace|DeepFashion|config|\n|---------|---------|---------|--------|--------|---------|---------|---------|\n| GiT-B\u003csub\u003emulti-task\u003c/sub\u003e |131M| 34.3 | 24.9| 19.1 | 17.4 |23.0| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/few-shot/few_shot_drive_base.py)|\n|  GiT-B\u003csub\u003euniversal\u003c/sub\u003e  |131M|51.1|30.8|30.6|31.2|38.3| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/few-shot/few_shot_drive_base.py) |\n|  GiT-L\u003csub\u003euniversal\u003c/sub\u003e |387M|55.4|34.1|37.2|33.4|49.3| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/few-shot/few_shot_drive_large.py) |\n|  GiT-H\u003csub\u003euniversal\u003c/sub\u003e | 756M|57.9|35.1|43.4|34.0|52.2| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/few-shot/few_shot_drive_huge.py) |\n\n## 🛠️ Quick Start\n### Installation\n\n```shell\nconda create -n GiT python=3.8\n\nconda activate GiT\n\n# We only test in 1.9.1, may be other versions are also worked.\npip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html\n\npip install -U openmim\nmim install \"mmengine==0.8.3\"\nmim install \"mmcv==2.0.1\"\npip install \"transformers==4.31.0\"\n\ngit clone git@github.com:Haiyang-W/GiT.git\ncd GiT\npip install -v -e .\npip install -r requirements/optional.txt\npip install -r requirements/runtime.txt\n\n# if you face ChildFailedError, please update yapf\npip install yapf==0.40.1\n```\n- Please download pretrained text embedding from [huggingface](https://huggingface.co/kanashi6/GiT/tree/main) and organize the downloaded files as follows:\n```\nGiT\n|──bert_embed.pt\n|——bert_embed_large.pt\n|——bert_embed_huge.pt\n```\n- (Optional) Install Java manually for image caption evaluation. Without Java, you can train image caption normally, but fail in caption evaluation.\n- (Optional) Install lvis api for LVIS dataset.\n```\n# current path is ./GiT\ncd ..\npip install git+https://github.com/lvis-dataset/lvis-api.git\n```\n\n### Dataset Preparation\n#### Multi-Tasking Dataset\nMulti-tasking benchmark contains coco2017 for object detection and instance segmentation, ade20k for semantic segmentation, coco caption for image caption, and refcoco series for visual grounding. \n```\nGiT\n|──data\n|  |──ade\n|  |  |──ADEChallengeData2016\n|  |  |  |──annorations\n|  |  |  |  |──training \u0026 validation\n|  |  |  |──images\n|  |  |  |  |──training \u0026 validation\n|  |  |  |──objectInfo150.txt\n|  |  |  |──sceneCategories.txt\n|  |──coco\n|  |  |──annotations\n|  |  |  |──*.json\n|  |  |──train2017\n|  |  |  |──*.jpg\n|  |  |──val2017\n|  |  |  |──*.jpg\n|  |──coco_2014\n|  |  |──annotations\n|  |  |  |──*.json\n|  |  |  |──coco_karpathy_test.json\n|  |  |  |──coco_karpathy_train.json\n|  |  |  |──coco_karpathy_val_gt.json\n|  |  |  |──coco_karpathy_val.json\n|  |  |──train2014\n|  |  |  |──*.jpg\n|  |  |──val2014\n|  |  |  |──*.jpg\n|  |  |──refcoco\n|  |  |  |──*.p\n```\n\n#### Universal Dataset\nWe use 27 datasets in universal training. For more details about dataset preparation, please refer to [here](https://github.com/Haiyang-W/GiT/blob/main/tools/dataset_preprocess/dataset_prepare.md).\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/universal.png\" width=\"800\"/\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\n🚨 **We only list part of the commands (GiT-B) below. For more detailed commands, please refer to [here](https://github.com/Haiyang-W/GiT/blob/main/docs/en/GiT_commands.md).**\n\n### Training\n#### Single Task \nDetection\n\n```shell\nbash tools/dist_train.sh configs/GiT/single_detection_base.py  ${GPU_NUM} --work-dir ${work_dir}\n```\n\n#### Multi Task \n\nGiT-B\n\n```shell\nbash tools/dist_train.sh configs/GiT/multi_fivetask_base.py  ${GPU_NUM} --work-dir ${work_dir}\n```\n\n#### Universal Training\n\nGiT-B\n\n```shell\nbash tools/dist_train.sh configs/GiT/universal_base.py  ${GPU_NUM} --work-dir ${work_dir}\n```\n\n### Testing\n\n#### Single Task \nDetection\n\n```shell\nbash tools/dist_test.sh configs/GiT/single_detection_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}\n```\n\n#### Multi Task \n\nGiT-B\n\n```shell\nbash tools/dist_test.sh configs/GiT/multi_fivetask_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}\n```\n#### Zero-shot and few-shot\nPlease download universal pretrain weight from [huggingface](https://huggingface.co/kanashi6/GiT/tree/main) and organize files as follows:\n```\nGiT\n|──universal_base.pth\n|——universal_large.pth\n|——universal_huge.pth\n```\n\nZero-shot\n\n```shell\nbash tools/dist_test.sh configs/GiT/zero-shot/zero_shot_cityscapes_det_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}\n```\n\nFew-shot\n\n```shell\nbash tools/dist_train.sh configs/GiT/few-shot/few_shot_drive_det_base.py ${GPU_NUM} --work-dir ${work_dir}\n```\n\n#### Customize Dataset\nIf you want to use GiT on your own dataset, please refer [here](https://github.com/Haiyang-W/GiT/blob/main/docs/en/customize_dataset.md) for more details.\n\n### 🚀 Lightweight Version\nIf your GPU memory is insufficient, you can reduce the resolution like [here](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/single_detection_base_672.py), where we lower the detection resolution to 672. It requires ~20 hours of training and reaches ~41.5 mAP.\n\n## 👀 Todo\n\n- [x] Release the [arXiv](https://arxiv.org/abs/2403.09394) version.\n- [x] SOTA performance of generalist model on multi-tasking benchmark.\n- [x] SOTA performance of generalist model on zero- and few-shot benchmark.\n- [x] Clean up and release the inference code.\n- [x] Clean up and release the training code.\n- [ ] Engineering Optimization (faster).\n- [ ] Joint Training including Language (stronger).\n- [ ] Code Refactoring (now is also a little dirty, sorry for that).\n\n## 👍 Acknowledgement\n* [MMDetection](https://github.com/open-mmlab/mmdetection) The codebase we built upon. Thanks for providing such a convenient framework.\n* [BLIP](https://github.com/salesforce/BLIP) We extract text embedding from BLIP pretrain models and use the web caption filtered by BLIP. Thanks for their efforts in open source and cleaning the dataset. \n\n## 📘 Citation\nPlease consider citing our work as follows if it is helpful.\n```\n@inproceedings{wang2024git,\n  title={GiT: Towards Generalist Vision Transformer through Universal Language Interface},\n  author={Wang, Haiyang and Tang, Hao and Jiang, Li and Shi, Shaoshuai and Naeem, Muhammad Ferjad and Li, Hongsheng and Schiele, Bernt and Wang, Liwei},\n  booktitle={ECCV},\n  year={2024}\n}\n```\n\n## ✨ Star History\n[![Star History Chart](https://api.star-history.com/svg?repos=Haiyang-W/GiT\u0026type=Date)](https://star-history.com/#Haiyang-W/GiT\u0026Date)\n\n\n\n\u003c!-- ## 🤝 Contributors\n\n\u003ca href=\"https://github.com/Haiyang-W/GiT/graphs/contributors\"\u003e\n  \u003cimg src=\"https://avatars.githubusercontent.com/u/54112784?v=4\" /\u003e\n\u003c/a\u003e --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHaiyang-W%2FGiT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FHaiyang-W%2FGiT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHaiyang-W%2FGiT/lists"}