{"id":13442502,"url":"https://github.com/OpenGVLab/InternImage","last_synced_at":"2025-03-20T14:31:16.923Z","repository":{"id":63045698,"uuid":"564231518","full_name":"OpenGVLab/InternImage","owner":"OpenGVLab","description":"[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions","archived":false,"fork":false,"pushed_at":"2025-03-04T15:46:51.000Z","size":27757,"stargazers_count":2611,"open_issues_count":181,"forks_count":244,"subscribers_count":34,"default_branch":"master","last_synced_at":"2025-03-13T13:39:40.288Z","etag":null,"topics":["backbone","deformable-convolution","foundation-model","object-detection","semantic-segmentation"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2211.05778","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGVLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-10T09:24:57.000Z","updated_at":"2025-03-12T09:34:32.000Z","dependencies_parsed_at":"2024-01-14T09:15:06.707Z","dependency_job_id":"592da252-cf91-483a-810d-d0e117220833","html_url":"https://github.com/OpenGVLab/InternImage","commit_stats":{"total_commits":70,"total_committers":13,"mean_commits":5.384615384615385,"dds":0.6857142857142857,"last_synced_commit":"ac5ed37f92a6807d3ecc793dbf62e2bf0c960ef2"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FInternImage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FInternImage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FInternImage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FInternImage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGVLab","download_url":"https://codeload.github.com/OpenGVLab/InternImage/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244630125,"owners_count":20484318,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backbone","deformable-convolution","foundation-model","object-detection","semantic-segmentation"],"created_at":"2024-07-31T03:01:46.528Z","updated_at":"2025-03-20T14:31:16.913Z","avatar_url":"https://github.com/OpenGVLab.png","language":"Python","funding_links":[],"categories":["Python","Models","Summary"],"sub_categories":["Vision Models"],"readme":"\u003cp\u003e\n\t\u003ca href=\"./README_CN.md\"\u003e[中文版本]\u003c/a\u003e\n\u003c/p\u003e\n\n# InternImage: Large-Scale Vision Foundation Model\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-lvis-v1-0-minival)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-minival?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-lvis-v1-0-val)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-val?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-pascal-voc-2012)](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2012?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-openimages-v6)](https://paperswithcode.com/sota/object-detection-on-openimages-v6?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-crowdhuman-full-body)](https://paperswithcode.com/sota/object-detection-on-crowdhuman-full-body?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/2d-object-detection-on-bdd100k-val)](https://paperswithcode.com/sota/2d-object-detection-on-bdd100k-val?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-cityscapes)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-cityscapes-val)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes-val?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-pascal-context)](https://paperswithcode.com/sota/semantic-segmentation-on-pascal-context?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-inaturalist-2018)](https://paperswithcode.com/sota/image-classification-on-inaturalist-2018?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-places365)](https://paperswithcode.com/sota/image-classification-on-places365?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-places205)](https://paperswithcode.com/sota/image-classification-on-places205?p=internimage-exploring-large-scale-vision)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bevformer-v2-adapting-modern-image-backbones/3d-object-detection-on-nuscenes-camera-only)](https://paperswithcode.com/sota/3d-object-detection-on-nuscenes-camera-only?p=bevformer-v2-adapting-modern-image-backbones)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=internimage-exploring-large-scale-vision)\n\nThe official implementation of\n\n[InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions](https://arxiv.org/abs/2211.05778).\n\n\\[[Paper](https://arxiv.org/abs/2211.05778)\\]  \\[[Blog in Chinese](https://zhuanlan.zhihu.com/p/610772005)\\]\n\n## Highlights\n\n- :thumbsup: **The strongest open-source visual universal backbone model with up to 3 billion parameters**\n- 🏆 **Achieved `90.1% Top1` accuracy in ImageNet, the most accurate among open-source models**\n- 🏆 **Achieved `65.5 mAP` on the COCO benchmark dataset for object detection, the only model that exceeded `65.0 mAP`**\n\n## News\n\n- `Jan 22, 2024`: 🚀 Support [DCNv4](https://github.com/OpenGVLab/DCNv4) in InternImage!\n- `Feb 28, 2023`: 🚀 InternImage is accepted to CVPR 2023!\n- `Nov 18, 2022`: 🚀 InternImage-XL merged into [BEVFormer v2](https://arxiv.org/abs/2211.10439) achieves state-of-the-art performance of `63.4 NDS` on nuScenes Camera Only.\n- `Nov 10, 2022`: 🚀 InternImage-H achieves a new record `65.4 mAP` on COCO detection test-dev and `62.9 mIoU` on ADE20K, outperforming previous models by a large margin.\n\n## History\n\n- [x] Models for other downstream tasks\n- [x] Support [CVPR 2023 Workshop on End-to-End Autonomous Driving](https://opendrivelab.com/e2ead/cvpr23), see [here](https://github.com/OpenGVLab/InternImage/tree/master/autonomous_driving)\n- [x] Support extracting intermediate features, see [here](classification/extract_feature.py)\n- [x] Low-cost training with [DeepSpeed](https://github.com/microsoft/DeepSpeed), see [here](https://github.com/OpenGVLab/InternImage/tree/master/classification)\n- [x] Compiling-free `.whl` package of DCNv3 operator, see [here](https://github.com/OpenGVLab/InternImage/releases/tag/whl_files)\n- [x] InternImage-H(1B)/G(3B)\n- [x] TensorRT inference for classification/detection/segmentation models\n- [x] Classification code of the InternImage series\n- [x] InternImage-T/S/B/L/XL ImageNet-1K pretrained model\n- [x] InternImage-L/XL ImageNet-22K pretrained model\n- [x] InternImage-T/S/B/L/XL detection and instance segmentation model\n- [x] InternImage-T/S/B/L/XL semantic segmentation model\n\n## Introduction\n\nInternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.\n\n\u003cdiv align=center\u003e\n\u003cimg src='./docs/figs/arch.png' width=400\u003e\n\u003c/div\u003e\n\nSome other projects related to InternImage include the pretraining algorithm \"M3I-Pretraining,\" the general-purpose decoder series \"Uni-Perceiver,\" and the autonomous driving perception encoder series \"BEVFormer.\"\n\n\u003cdiv align=left\u003e\n\u003cimg src='./docs/figs/intern_pipeline_en.png' width=900\u003e\n\u003c/div\u003e\n\n## Performance\n\n- InternImage achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, InternImage is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.\n- InternImage outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.\n- InternImage also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.\n\n**Classification**\n\n\u003ctable border=\"1\" width=\"90%\"\u003e\n\t\u003ctr align=\"center\"\u003e\n        \u003cth colspan=\"1\"\u003e Image Classification\u003c/th\u003e\u003cth colspan=\"2\"\u003e Scene Classification \u003c/th\u003e\u003cth colspan=\"1\"\u003eLong-Tail Classification\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eImageNet\u003c/th\u003e\u003cth\u003ePlaces365\u003c/th\u003e\u003cth\u003ePlaces 205\u003c/th\u003e\u003cth\u003eiNaturalist 2018\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003e90.1\u003c/th\u003e\u003cth\u003e61.2\u003c/th\u003e\u003cth\u003e71.7\u003c/th\u003e\u003cth\u003e92.6\u003c/th\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n**Detection**\n\n\u003ctable border=\"1\" width=\"90%\"\u003e\n\t\u003ctr align=\"center\"\u003e\n        \u003cth colspan=\"4\"\u003e General Object Detection \u003c/th\u003e\u003cth colspan=\"3\"\u003e Long-Tail Object Detection \u003c/th\u003e\u003cth colspan=\"1\"\u003e Autonomous Driving Object Detection \u003c/th\u003e\u003cth colspan=\"1\"\u003e Dense Object Detection \u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eCOCO\u003c/th\u003e\u003cth\u003eVOC 2007\u003c/th\u003e\u003cth\u003eVOC 2012\u003c/th\u003e\u003cth\u003eOpenImage\u003c/th\u003e\u003cth\u003eLVIS minival\u003c/th\u003e\u003cth\u003eLVIS val\u003c/th\u003e\u003cth\u003eBDD100K\u003c/th\u003e\u003cth\u003enuScenes\u003c/th\u003e\u003cth\u003eCrowdHuman\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003e65.5\u003c/th\u003e\u003cth\u003e94.0\u003c/th\u003e\u003cth\u003e97.2\u003c/th\u003e\u003cth\u003e74.1\u003c/th\u003e\u003cth\u003e65.8\u003c/th\u003e\u003cth\u003e63.2\u003c/th\u003e\u003cth\u003e38.8\u003c/th\u003e\u003cth\u003e64.8\u003c/th\u003e\u003cth\u003e97.2\u003c/th\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n**Segmentation**\n\n\u003ctable border=\"1\" width=\"90%\"\u003e\n\t\u003ctr align=\"center\"\u003e\n        \u003cth colspan=\"3\"\u003eSemantic Segmentation\u003c/th\u003e\u003cth colspan=\"1\"\u003eStreet Segmentation\u003c/th\u003e\u003cth colspan=\"1\"\u003eRGBD Segmentation\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eADE20K\u003c/th\u003e\u003cth\u003eCOCO Stuff-10K\u003c/th\u003e\u003cth\u003ePascal Context\u003c/th\u003e\u003cth\u003eCityScapes\u003c/th\u003e\u003cth\u003eNYU Depth V2\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003e62.9\u003c/th\u003e\u003cth\u003e59.6\u003c/th\u003e\u003cth\u003e70.3\u003c/th\u003e\u003cth\u003e87.0\u003c/th\u003e\u003cth\u003e68.1\u003c/th\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n## Released Models\n\n\u003cdetails open\u003e\n\u003csummary\u003e Open-Source Visual Pretrained Models \u003c/summary\u003e\n\u003cbr\u003e\n\u003cdiv\u003e\n\n|      name      |       pretrain       | resolution | #param |                                                                                  download                                                                                   |\n| :------------: | :------------------: | :--------: | :----: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |\n| InternImage-L  |        IN-22K        |  384x384   |  223M  |     [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)    \\| [hf](https://huggingface.co/OpenGVLab/internimage_l_22k_384)      |\n| InternImage-XL |        IN-22K        |  384x384   |  335M  |     [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)   \\| [hf](https://huggingface.co/OpenGVLab/internimage_xl_22k_384)     |\n| InternImage-H  | Joint 427M -\u003e IN-22K |  384x384   | 1.08B  | [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth)   \\| [hf](https://huggingface.co/OpenGVLab/internimage_h_jointto22k_384)  |\n| InternImage-G  | Joint 427M -\u003e IN-22K |  384x384   |   3B   | [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) \\| [hf](https://huggingface.co/OpenGVLab/internimage_g_jointto22k_384) |\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails open\u003e\n\u003csummary\u003e ImageNet-1K Image Classification \u003c/summary\u003e\n\u003cbr\u003e\n\u003cdiv\u003e\n\n|      name      |       pretrain       | resolution | acc@1 | #param | FLOPs |                                                                                                                        download                                                                                                                        |\n| :------------: | :------------------: | :--------: | :---: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |\n| InternImage-T  |        IN-1K         |  224x224   | 83.5  |  30M   |  5G   |          [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \\| [hf](https://huggingface.co/OpenGVLab/internimage_t_1k_224) \\| [cfg](classification/configs/without_lr_decay/internimage_t_1k_224.yaml)          |\n| InternImage-S  |        IN-1K         |  224x224   | 84.2  |  50M   |  8G   |          [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \\| [hf](https://huggingface.co/OpenGVLab/internimage_s_1k_224) \\| [cfg](classification/configs/without_lr_decay/internimage_s_1k_224.yaml)          |\n| InternImage-B  |        IN-1K         |  224x224   | 84.9  |  97M   |  16G  |          [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \\| [hf](https://huggingface.co/OpenGVLab/internimage_b_1k_224) \\| [cfg](classification/configs/without_lr_decay/internimage_b_1k_224.yaml)          |\n| InternImage-L  |        IN-22K        |  384x384   | 87.7  |  223M  | 108G  |  [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \\| [hf](https://huggingface.co/OpenGVLab/internimage_l_22kto1k_384) \\| [cfg](classification/configs/without_lr_decay/internimage_l_22kto1k_384.yaml)   |\n| InternImage-XL |        IN-22K        |  384x384   | 88.0  |  335M  | 163G  | [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \\| [hf](https://huggingface.co/OpenGVLab/internimage_xl_22kto1k_384) \\| [cfg](classification/configs/without_lr_decay/internimage_xl_22kto1k_384.yaml) |\n| InternImage-H  | Joint 427M -\u003e IN-22K |  640x640   | 89.6  | 1.08B  | 1478G |  [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \\| [hf](https://huggingface.co/OpenGVLab/internimage_h_22kto1k_640) \\| [cfg](classification/configs/without_lr_decay/internimage_h_22kto1k_640.yaml)   |\n| InternImage-G  | Joint 427M -\u003e IN-22K |  512x512   | 90.1  |   3B   | 2700G |  [pth](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \\| [hf](https://huggingface.co/OpenGVLab/internimage_g_22kto1k_512) \\| [cfg](classification/configs/without_lr_decay/internimage_g_22kto1k_512.yaml)   |\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails open\u003e\n\u003csummary\u003e COCO Object Detection and Instance Segmentation \u003c/summary\u003e\n\u003cbr\u003e\n\u003cdiv\u003e\n\n|    backbone    |   method   | schd | box mAP | mask mAP | #param | FLOPs |                                                                                     download                                                                                      |\n| :------------: | :--------: | :--: | :-----: | :------: | :----: | :---: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |\n| InternImage-T  | Mask R-CNN |  1x  |  47.2   |   42.5   |  49M   | 270G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \\| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |\n| InternImage-T  | Mask R-CNN |  3x  |  49.1   |   43.7   |  49M   | 270G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \\| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |\n| InternImage-S  | Mask R-CNN |  1x  |  47.8   |   43.3   |  69M   | 340G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_1x_coco.pth) \\| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_1x_coco.py) |\n| InternImage-S  | Mask R-CNN |  3x  |  49.7   |   44.5   |  69M   | 340G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_3x_coco.pth) \\| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_3x_coco.py) |\n| InternImage-B  | Mask R-CNN |  1x  |  48.8   |   44.0   |  115M  | 501G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_1x_coco.pth) \\| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_1x_coco.py) |\n| InternImage-B  | Mask R-CNN |  3x  |  50.3   |   44.8   |  115M  | 501G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_3x_coco.pth) \\| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_3x_coco.py) |\n| InternImage-L  |  Cascade   |  1x  |  54.9   |   47.7   |  277M  | 1399G |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \\| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py)   |\n| InternImage-L  |  Cascade   |  3x  |  56.1   |   48.5   |  277M  | 1399G |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \\| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py)   |\n| InternImage-XL |  Cascade   |  1x  |  55.3   |   48.1   |  387M  | 1782G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \\| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py)  |\n| InternImage-XL |  Cascade   |  3x  |  56.2   |   48.8   |  387M  | 1782G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \\| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py)  |\n\n|     backbone     |   method   | box mAP (val/test) | #param |                                                                                                                         download                                                                                                                          |\n| :--------------: | :--------: | :----------------: | :----: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |\n| CB-InternImage-H | DINO (TTA) |    65.0 / 65.4     | 2.18B  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_coco.pth) \\| [cfg](https://github.com/OpenGVLab/InternImage/blob/master/detection/configs/coco/dino_4scale_cbinternimage_h_objects365_coco_ss.py) |\n| CB-InternImage-G | DINO (TTA) |    65.3 / 65.5     |   6B   |                                                                                                                           TODO                                                                                                                            |\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails open\u003e\n\u003csummary\u003e ADE20K Semantic Segmentation \u003c/summary\u003e\n\u003cbr\u003e\n\u003cdiv\u003e\n\n|    backbone    |   method    | resolution | mIoU (ss/ms) | #param | FLOPs |                                                                                                        download                                                                                                         |\n| :------------: | :---------: | :--------: | :----------: | :----: | :---: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |\n| InternImage-T  |   UperNet   |  512x512   | 47.9 / 48.1  |  59M   | 944G  |               [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \\| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py)                |\n| InternImage-S  |   UperNet   |  512x512   | 50.1 / 50.9  |  80M   | 1017G |               [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \\| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py)                |\n| InternImage-B  |   UperNet   |  512x512   | 50.8 / 51.3  |  128M  | 1185G |               [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512_160k_ade20k.pth) \\| [cfg](segmentation/configs/ade20k/upernet_internimage_b_512_160k_ade20k.py)                |\n| InternImage-L  |   UperNet   |  640x640   | 53.9 / 54.1  |  256M  | 2526G |               [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \\| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py)                |\n| InternImage-XL |   UperNet   |  640x640   | 55.0 / 55.3  |  368M  | 3142G |              [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \\| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py)               |\n| InternImage-H  |   UperNet   |  896x896   | 59.9 / 60.3  | 1.12B  | 3566G |               [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \\| [cfg](segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py)                |\n| InternImage-H  | Mask2Former |  896x896   | 62.5 / 62.9  | 1.31B  | 4635G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff2ade20k.pth) \\| [cfg](segmentation/configs/ade20k/mask2former_internimage_h_896_80k_cocostuff2ade20k_ss.py) |\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e Main Results of FPS \u003c/summary\u003e\n\u003cbr\u003e\n\u003cdiv\u003e\n\n[Export classification model from pytorch to tensorrt](classification/README.md#export)\n\n[Export detection model from pytorch to tensorrt](detection/README.md#export)\n\n[Export segmentation model from pytorch to tensorrt](segmentation/README.md#export)\n\n|      name      | resolution | #param | FLOPs | batch 1 FPS (TensorRT) |\n| :------------: | :--------: | :----: | :---: | :--------------------: |\n| InternImage-T  |  224x224   |  30M   |  5G   |          156           |\n| InternImage-S  |  224x224   |  50M   |  8G   |          129           |\n| InternImage-B  |  224x224   |  97M   |  16G  |          116           |\n| InternImage-L  |  384x384   |  223M  | 108G  |           56           |\n| InternImage-XL |  384x384   |  335M  | 163G  |           47           |\n\nBefore using `mmdeploy` to convert our PyTorch models to TensorRT, please make sure you have the DCNv3 custom operator built correctly. You can build it with the following command:\n\n```shell\nexport MMDEPLOY_DIR=/the/root/path/of/MMDeploy\n\n# prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3\ncp -r modulated_deform_conv_v3 ${MMDEPLOY_DIR}/csrc/mmdeploy/backend_ops/tensorrt\n\n# build custom ops\ncd ${MMDEPLOY_DIR}\nmkdir -p build \u0026\u0026 cd build\ncmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..\nmake -j$(nproc) \u0026\u0026 make install\n\n# install the mmdeploy after building custom ops\ncd ${MMDEPLOY_DIR}\npip install -e .\n```\n\nFor more details on building custom ops, please referring to [this document](https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/01-how-to-build/linux-x86_64.md).\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n## Related Projects\n\n### Foundation Models\n\n- [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver): A Pre-training unified architecture for generic perception for zero-shot and few-shot tasks\n- [Uni-Perceiver v2](https://arxiv.org/abs/2211.09808): A generalist model for large-scale vision and vision-language tasks\n- [M3I-Pretraining](https://github.com/OpenGVLab/M3I-Pretraining): One-stage pre-training paradigm via maximizing multi-modal mutual information\n- [InternVL](https://github.com/OpenGVLab/InternVL): A leading multimodal large language model excelling in tasks such as OCR, multimodal reasoning, and dialogue\n\n### Autonomous Driving\n\n- [BEVFormer](https://github.com/fundamentalvision/BEVFormer): A cutting-edge baseline for camera-based 3D detection\n- [BEVFormer v2](https://arxiv.org/abs/2211.10439): Adapting modern image backbones to Bird's-Eye-View recognition via perspective supervision\n\n## Application in Challenges\n\n- [2022 Waymo 3D Camera-Only Detection Challenge](https://waymo.com/open/challenges/2022/3d-camera-only-detection/): BEVFormer++ ranks 1st based on InternImage\n- [nuScenes 3D detection](https://www.nuscenes.org/object-detection?externalData=all\u0026mapData=all\u0026modalities=Camera): BEVFormer v2 achieves SOTA performance of 64.8 NDS on nuScenes Camera Only\n- [CVPR 2023 Workshop End-to-End Autonomous Driving](https://opendrivelab.com/e2ead/cvpr23): InternImage supports the baseline of the [3D Occupancy Prediction Challenge](https://opendrivelab.com/AD23Challenge.html#Track3) and [OpenLane Topology Challenge](https://opendrivelab.com/AD23Challenge.html#Track1)\n\n## Citation\n\nIf this work is helpful for your research, please consider citing the following BibTeX entry.\n\n```bibtex\n@inproceedings{wang2023internimage,\n  title={Internimage: Exploring large-scale vision foundation models with deformable convolutions},\n  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},\n  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},\n  pages={14408--14419},\n  year={2023}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FInternImage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenGVLab%2FInternImage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FInternImage/lists"}