{"id":13678907,"url":"https://github.com/ashkamath/mdetr","last_synced_at":"2025-04-29T15:33:47.892Z","repository":{"id":37441773,"uuid":"360631605","full_name":"ashkamath/mdetr","owner":"ashkamath","description":null,"archived":false,"fork":false,"pushed_at":"2022-10-03T19:35:46.000Z","size":9781,"stargazers_count":977,"open_issues_count":31,"forks_count":128,"subscribers_count":19,"default_branch":"main","last_synced_at":"2024-11-11T21:38:04.213Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashkamath.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-22T17:39:28.000Z","updated_at":"2024-11-04T13:22:17.000Z","dependencies_parsed_at":"2022-07-12T13:01:29.010Z","dependency_job_id":null,"html_url":"https://github.com/ashkamath/mdetr","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashkamath%2Fmdetr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashkamath%2Fmdetr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashkamath%2Fmdetr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashkamath%2Fmdetr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashkamath","download_url":"https://codeload.github.com/ashkamath/mdetr/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251529621,"owners_count":21603989,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T13:00:59.722Z","updated_at":"2025-04-29T15:33:47.885Z","avatar_url":"https://github.com/ashkamath.png","language":"Python","funding_links":[],"categories":["Python","其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"**MDETR**: Modulated Detection for End-to-End Multi-Modal Understanding\n========\n\n[Website](https://ashkamath.github.io/mdetr_page/) • [Colab](https://colab.research.google.com/drive/11xz5IhwqAqHj9-XAIP17yVIuJsLqeYYJ?usp=sharing) • [Paper](https://arxiv.org/abs/2104.12763)\n\n\nThis repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data having aligned text and images with box annotations, as well as fine-tuning on tasks requiring fine grained understanding of image and text. \n\nWe show big gains on the phrase grounding task (Flickr30k), Referring Expression Comprehension (RefCOCO, RefCOCO+ and RefCOCOg) as well as Referring Expression Segmentation (PhraseCut, CLEVR Ref+). We also achieve competitive performance on visual question answering (GQA, CLEVR).\n\n\n![MDETR](.github/mdetr.png)\n\n**TL;DR**. We depart from the fixed frozen object detector approach of several popular vision + language pre-trained models and achieve true end-to-end multi-modal understanding by training our detector in the loop. In addition, we *only* detect objects that are relevant to the given text query, where the class labels for the objects are just the relevant words in the text query. This allows us to expand our vocabulary to anything found in free form text, making it possible to detect and reason over novel combination of object classes and attributes.\n                \n\nFor details, please see the paper: [MDETR - Modulated Detection for End-to-End Multi-Modal Understanding](https://arxiv.org/abs/2104.12763) by Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve and Nicolas Carion.\n\nAishwarya Kamath and Nicolas Carion made equal contributions to this codebase. \n\n# Usage\nThe requirements file has all the dependencies that are needed by MDETR. \n\nWe provide instructions how to install dependencies via conda.\nFirst, clone the repository locally:\n```\ngit clone https://github.com/ashkamath/mdetr.git\n```\n\nMake a new conda env and activate it: \n```\nconda create -n mdetr_env python=3.8\nconda activate mdetr_env\n```\n\nInstall the the packages in the requirements.txt:\n```\npip install -r requirements.txt\n```\n\nMultinode training\n\nDistributed training is available via Slurm and [submitit](https://github.com/facebookincubator/submitit):\n```\npip install submitit\n```\n\n\n# Pre-training\n\nThe links to data, steps for data preparation and script for running finetuning can be found in [Pretraining Instructions](.github/pretrain.md)\nWe also provide the pre-trained model weights for MDETR trained on our combined aligned dataset of 1.3 million images paired with text. \n\nThe models are summarized in the following table. Note that the performance reported is \"raw\", without any fine-tuning. For each dataset, we report the class-agnostic box AP@50, which measures how well the model finds the boxes mentioned in the text. All performances are reported on the respective validation sets of each dataset.\n\u003ctable\u003e\n\u003cthead\u003e\n  \u003ctr\u003e\n    \u003cth rowspan=\"2\"\u003e\u003c/th\u003e\n    \u003cth rowspan=\"2\"\u003eBackbone\u003c/th\u003e\n    \u003cth\u003eGQA\u003c/th\u003e\n    \u003cth colspan=\"2\"\u003eFlickr\u003c/th\u003e\n    \u003cth colspan=\"4\"\u003eRefcoco\u003c/th\u003e\n    \u003cth rowspan=\"2\"\u003e Url\u003cbr\u003e\u003c/th\u003e\n    \u003cth rowspan=\"2\"\u003eSize\u003cbr\u003e\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eAP\u003c/td\u003e\n    \u003ctd\u003eAP\u003c/td\u003e\n    \u003ctd\u003eR@1\u003c/td\u003e\n    \u003ctd\u003eAP\u003c/td\u003e\n    \u003ctd\u003eRefcoco R@1\u003c/td\u003e\n    \u003ctd\u003eRefcoco+ R@1\u003c/td\u003e\n    \u003ctd\u003eRefcocog R@1\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e1\u003c/td\u003e\n    \u003ctd\u003eR101\u003c/td\u003e\n    \u003ctd\u003e58.9\u003c/td\u003e\n    \u003ctd\u003e75.6\u003c/td\u003e\n    \u003ctd\u003e82.5\u003c/td\u003e\n    \u003ctd\u003e60.3\u003c/td\u003e\n    \u003ctd\u003e72.1\u003c/td\u003e\n    \u003ctd\u003e58.0\u003c/td\u003e\n    \u003ctd\u003e55.7\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://zenodo.org/record/4721981/files/pretrained_resnet101_checkpoint.pth?download=1\"\u003e model\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e3GB\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e2\u003c/td\u003e\n    \u003ctd\u003eENB3\u003c/td\u003e\n    \u003ctd\u003e59.5\u003c/td\u003e\n    \u003ctd\u003e76.6\u003c/td\u003e\n    \u003ctd\u003e82.9\u003c/td\u003e\n    \u003ctd\u003e57.6\u003c/td\u003e\n    \u003ctd\u003e70.2\u003c/td\u003e\n    \u003ctd\u003e56.7\u003c/td\u003e\n    \u003ctd\u003e53.8\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://zenodo.org/record/4721981/files/pretrained_EB3_checkpoint.pth?download=1\"\u003emodel\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e2.4GB\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e3\u003c/td\u003e\n    \u003ctd\u003eENB5\u003c/td\u003e\n    \u003ctd\u003e59.9\u003c/td\u003e\n    \u003ctd\u003e76.4\u003c/td\u003e\n    \u003ctd\u003e83.7\u003c/td\u003e\n    \u003ctd\u003e61.8\u003c/td\u003e\n    \u003ctd\u003e73.4\u003c/td\u003e\n    \u003ctd\u003e58.8\u003c/td\u003e\n    \u003ctd\u003e57.1\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://zenodo.org/record/4721981/files/pretrained_EB5_checkpoint.pth?download=1\"\u003emodel\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e2.7GB\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n\n# Downstream tasks\n\n## Phrase grounding on Flickr30k\nInstructions for data preparation and script to run evaluation can be found at [Flickr30k Instructions](.github/flickr.md)\n\n### AnyBox protocol\n| Backbone | Pre-training Image Data | Val R@1 | Val R@5 | Val R@10 | Test R@1 | Test  R@5 | Test  R@10 | url | size |\n|----------|---------|---------|-----------|----------|-----------|-----------|-----|------|---|\n| Resnet-101| COCO+VG+Flickr | 82.5   |  92.9   |   94.9  |   83.4  |   93.5  |   95.3    | [model](https://zenodo.org/record/4721981/files/pretrained_resnet101_checkpoint.pth?download=1)    | 3GB      | \n| EfficientNet-B3| COCO+VG+Flickr | 82.9   | 93.2    | 95.2    |  84.0  | 93.8    |  95.6    | [model](https://zenodo.org/record/4721981/files/pretrained_EB3_checkpoint.pth?download=1)    |  2.4GB     |\n| EfficientNet-B5| COCO+VG+Flickr |83.6   | 93.4    | 95.1   |  84.3   | 93.9    |  95.8     | [model](https://zenodo.org/record/4721981/files/pretrained_EB5_checkpoint.pth?download=1)    |  2.7GB     |\n\n ### MergedBox protocol\n | Backbone | Pre-training Image Data | Val R@1 | Val R@5 | Val R@10 | Test R@1 | Test  R@5 | Test  R@10 | url | size |\n|----------|---------|---------|-----------|----------|-----------|-----------|-----|------|---|\n| Resnet-101| COCO+VG+Flickr | 82.3   |  91.8   |   93.7  |   83.8  |   92.7  |   94.4    | [model](https://zenodo.org/record/4721981/files/flickr_merged_resnet101_checkpoint.pth?download=1)    |  3GB     | \n\n\n## Referring expression comprehension on RefCOCO, RefCOCO+, RefCOCOg\nInstructions for data preparation and script to run finetuning and evaluation can be found at [Referring Expression Instructions](.github/refexp.md)\n\n\n### RefCOCO \n\n| Backbone | Pre-training Image Data | Val | TestA  | TestB | url | size |\n|----------|---------|---------|-----------|----------|-----------|-----------|\n| Resnet-101| COCO+VG+Flickr | 86.75   |  89.58   |   81.41  | [model](https://zenodo.org/record/4721981/files/refcoco_resnet101_checkpoint.pth?download=1)   |  3GB   |   \n| EfficientNet-B3| COCO+VG+Flickr |  87.51  | 90.40  | 82.67 | [model](https://zenodo.org/record/4721981/files/refcoco_EB3_checkpoint.pth?download=1)  |  2.4GB   | \n\n### RefCOCO+\n\n| Backbone | Pre-training Image Data | Val | TestA  | TestB | url | size |\n|----------|---------|---------|-----------|----------|-----------|-----------|\n| Resnet-101| COCO+VG+Flickr | 79.52   |  84.09  |   70.62  | [model](https://zenodo.org/record/4721981/files/refcoco%2B_resnet101_checkpoint.pth?download=1)   |  3GB  |   \n| EfficientNet-B3| COCO+VG+Flickr |  81.13  | 85.52  | 72.96 | [model](https://zenodo.org/record/4721981/files/refcoco%2B_EB3_checkpoint.pth?download=1)   | 2.4GB   | \n\n### RefCOCOg\n\n| Backbone | Pre-training Image Data | Val | Test  |  url | size |\n|----------|---------|---------|-----------|----------|-----------|\n| Resnet-101| COCO+VG+Flickr | 81.64 | 80.89    | [model](https://zenodo.org/record/4721981/files/refcocog_resnet101_checkpoint.pth?download=1)   |   3GB  |   \n| EfficientNet-B3| COCO+VG+Flickr |  83.35  | 83.31  | [model](https://zenodo.org/record/4721981/files/refcocog_EB3_checkpoint.pth?download=1)  | 2.4GB   | \n\n\n## Referring expression segmentation on PhraseCut\nInstructions for data preparation and script to run finetuning and evaluation can be found at [PhraseCut Instructions](.github/phrasecut.md)\n\n| Backbone | M-IoU | Precision @0.5 | Precision @0.7 | Precision @0.9  |  url | size |\n|----------|---------|---------|-----------|----------|-----------|-----------|\n| Resnet-101| 53.1 | 56.1 | 38.9    | 11.9   | [model](https://zenodo.org/record/4721981/files/phrasecut_resnet101_checkpoint.pth?download=1)   |  1.5GB    |   \n| EfficientNet-B3| 53.7| 57.5|  39.9  | 11.9 | [model](https://zenodo.org/record/4721981/files/phrasecut_EB3_checkpoint.pth?download=1)   | 1.2GB  | \n\n\n## Visual question answering on GQA\nInstructions for data preparation and scripts to run finetuning and evaluation can be found at [GQA Instructions](.github/gqa.md)\n\n\n| Backbone | Test-dev | Test-std  |  url | size |\n|----------|---------|---------|-----------|----------|\n| Resnet-101| 62.48 | 61.99 | [model](https://zenodo.org/record/4721981/files/gqa_resnet101_checkpoint.pth?download=1)    | 3GB  | \n| EfficientNet-B5| 62.95 | 62.45 | [model](https://zenodo.org/record/4721981/files/gqa_EB5_checkpoint.pth?download=1)   | 2.7GB | \n\n## Long-tailed few-shot object detection\nInstructions for data preparation and scripts to run finetuning and evaluation can be found at [LVIS Instructions](.github/lvis.md)\n\n\n| Data | AP | AP 50 |  AP r | APc | AP f | url | size\n|----------|---------|---------|-----------|----------|---------|---------|---------|\n| 1%| 16.7 | 25.8 | 11.2  | 14.6  | 19.5  | [model](https://zenodo.org/record/4721981/files/lvis1_checkpoint.pth?download=1) | 3GB\n| 10%| 24.2 | 38.0 | 20.9   | 24.9 | 24.3 | [model](https://zenodo.org/record/4721981/files/lvis10_checkpoint.pth?download=1) | 3GB\n| 100%| 22.5 | 35.2 | 7.4 |22.7 | 25.0 | [model](https://zenodo.org/record/4721981/files/lvis100_checkpoint.pth?download=1) | 3GB\n\n## Synthetic datasets\nInstructions to reproduce our results on CLEVR-based datasets are available at [CLEVR instructions](.github/clevr.md)\n\n\u003ctable\u003e\n\u003cthead\u003e\n  \u003ctr\u003e\n    \u003cth\u003eOverall Accuracy\u003c/th\u003e\n    \u003cth\u003eCount\u003c/th\u003e\n    \u003cth\u003eExist\u003cbr\u003e\u003c/th\u003e\n    \u003cth\u003eCompare Number\u003c/th\u003e\n    \u003cth\u003eQuery Attribute\u003c/th\u003e\n    \u003cth\u003eCompare Attribute\u003c/th\u003e\n    \u003cth\u003eUrl\u003c/th\u003e\n    \u003cth\u003eSize\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e99.7\u003c/td\u003e\n    \u003ctd\u003e99.3\u003c/td\u003e\n    \u003ctd\u003e99.9\u003c/td\u003e\n    \u003ctd\u003e99.4\u003c/td\u003e\n    \u003ctd\u003e99.9\u003c/td\u003e\n    \u003ctd\u003e99.9\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://zenodo.org/record/4721981/files/clevr_checkpoint.pth?download=1\"\u003e model\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e446MB\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n# License\nMDETR is released under the Apache 2.0 license. Please see the [LICENSE](LICENSE) file for more information.\n\n# Citation \nIf you find this repository useful please give it a star and cite as follows! :) :\n```\n    @article{kamath2021mdetr,\n      title={MDETR--Modulated Detection for End-to-End Multi-Modal Understanding},\n      author={Kamath, Aishwarya and Singh, Mannat and LeCun, Yann and Misra, Ishan and Synnaeve, Gabriel and Carion, Nicolas},\n      journal={arXiv preprint arXiv:2104.12763},\n      year={2021}\n    }\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashkamath%2Fmdetr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashkamath%2Fmdetr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashkamath%2Fmdetr/lists"}