Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mmaaz60/mvits_for_class_agnostic_od
[ECCV'22] Official repository of paper titled "Class-agnostic Object Detection with Multi-modal Transformer".
https://github.com/mmaaz60/mvits_for_class_agnostic_od
class-agnostic-detection multimodal-learning object-detection open-world-detection psuedo-labels pytorch
Last synced: 18 days ago
JSON representation
[ECCV'22] Official repository of paper titled "Class-agnostic Object Detection with Multi-modal Transformer".
- Host: GitHub
- URL: https://github.com/mmaaz60/mvits_for_class_agnostic_od
- Owner: mmaaz60
- License: mit
- Created: 2021-11-16T09:15:36.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-05-09T04:19:22.000Z (over 1 year ago)
- Last Synced: 2024-10-10T18:11:04.836Z (about 1 month ago)
- Topics: class-agnostic-detection, multimodal-learning, object-detection, open-world-detection, psuedo-labels, pytorch
- Language: Python
- Homepage:
- Size: 34.1 MB
- Stars: 298
- Watchers: 8
- Forks: 24
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Class-agnostic Object Detection with Multi-modal Transformer (ECCV 2022)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/class-agnostic-object-detection-on-pascal-voc)](https://paperswithcode.com/sota/class-agnostic-object-detection-on-pascal-voc?p=multi-modal-transformers-excel-at-class)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/class-agnostic-object-detection-on-coco)](https://paperswithcode.com/sota/class-agnostic-object-detection-on-coco?p=multi-modal-transformers-excel-at-class)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/class-agnostic-object-detection-on-kitti)](https://paperswithcode.com/sota/class-agnostic-object-detection-on-kitti?p=multi-modal-transformers-excel-at-class)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/class-agnostic-object-detection-on-kitchen)](https://paperswithcode.com/sota/class-agnostic-object-detection-on-kitchen?p=multi-modal-transformers-excel-at-class)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/class-agnostic-object-detection-on-comic2k)](https://paperswithcode.com/sota/class-agnostic-object-detection-on-comic2k?p=multi-modal-transformers-excel-at-class)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/open-world-object-detection-on-pascal-voc)](https://paperswithcode.com/sota/open-world-object-detection-on-pascal-voc?p=multi-modal-transformers-excel-at-class)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/open-world-object-detection-on-coco-2017)](https://paperswithcode.com/sota/open-world-object-detection-on-coco-2017?p=multi-modal-transformers-excel-at-class)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/open-world-object-detection-on-coco-2017-1)](https://paperswithcode.com/sota/open-world-object-detection-on-coco-2017-1?p=multi-modal-transformers-excel-at-class)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/open-world-object-detection-on-coco-2017-2)](https://paperswithcode.com/sota/open-world-object-detection-on-coco-2017-2?p=multi-modal-transformers-excel-at-class)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/object-detection-on-pascal-voc-10)](https://paperswithcode.com/sota/object-detection-on-pascal-voc-10?p=multi-modal-transformers-excel-at-class)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/object-detection-on-pascal-voc-2007)](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2007?p=multi-modal-transformers-excel-at-class)### **Class-agnostic Object Detection with Multi-modal Transformer**
[Muhammad Maaz](https://scholar.google.com/citations?user=vTy9Te8AAAAJ&hl=en&authuser=1&oi=sra), [Hanoona Rasheed](https://scholar.google.com/citations?user=yhDdEuEAAAAJ&hl=en&authuser=1&oi=sra), [Salman Khan](https://salman-h-khan.github.io/), [Fahad Shahbaz Khan](https://scholar.google.es/citations?user=zvaeYnUAAAAJ&hl=en), [Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ) and [Ming-Hsuan Yang](https://scholar.google.com/citations?user=p9-ohHsAAAAJ&hl=en)
[![paper](https://img.shields.io/badge/arXiv-Paper-.svg)](https://arxiv.org/abs/2111.11430)
[![video](https://img.shields.io/badge/Video-Presentation-F9D371)](https://youtu.be/pkooyDZAxdA)
[![slides](https://img.shields.io/badge/Presentation-Slides-B762C1)](https://drive.google.com/file/d/1v8PcbVVOHwzo5LShjB_NQJE7m1rbA9bL)
[![slides](https://img.shields.io/badge/Paper-Poster-87CEEB)](paper_resources/eccv'22_poster.pdf)# :rocket: News
* **(July 06, 2022)**
* Paper accepted at ECCV 2022
* **(Feb 01, 2022)**
* Training codes for `MAVL` and `MAVL minus Language` models are released `->` [training/README.md](training/README.md)
* Instructions to use class-agnostic object detection behavior of MAVL on different applications are released `->` [applications/README.md](applications/README.md)
* All the pretrained models (`MAVL`, `Def-DETR`, `MDETR`, `DETReg`, `Faster-RCNN`, `RetinaNet`, `ORE`, and others), along with the instructions to reproduce the results are released `->` [this link](https://mbzuaiac-my.sharepoint.com/:f:/g/personal/muhammad_maaz_mbzuai_ac_ae/Et8rDrc4jkdIuHx4OH52fFUBreLD2-AIUAvO7ZjxtwjU3g?e=lMbeGq)
* **(Nov 25, 2021)** Evaluation code along with pre-trained models & pre-computed predictions is released. [evaluation/README.md](evaluation/class_agnostic_od/README.md)
![main figure](paper_resources/new_main_figure.jpg)
> ***Abstract:** What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free*
and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel
objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics.
For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text
pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the
state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing
MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using
multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications
including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs
can adaptively generate proposals given a specific language query and thus offer enhanced interactability.
## Architecture overview of MViTs used in this work
Architecture overview of MViTs used in this work – [GPV-1](https://arxiv.org/abs/2104.00743),
[MDETR](https://openaccess.thecvf.com/content/ICCV2021/papers/Kamath_MDETR_-_Modulated_Detection_for_End-to-End_Multi-Modal_Understanding_ICCV_2021_paper.pdf)
and Multiscale Attention ViT with Late fusion (MAVL) (ours).
![Architecture overview](paper_resources/new_block_diag.png)
## Installation
The code is tested with PyTorch 1.8.0 and CUDA 11.1. After cloning the repository, follow the below steps for installation,1. Install PyTorch and torchvision
```shell
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
```
2. Install other dependencies
```shell
pip install -r requirements.txt
```
3. Compile Deformable Attention modules
```shell
cd models/ops
sh make.sh
```
## Results
Results of Class-agnostic Object Detection of MViTS including our proposed Multiscale Attention ViT with Late fusion
(MAVL) model, applications, and exploratory analysis.Class-agnostic Object Detection performance of MViTs in comparison with bottom-up approaches and uni-modal detectors on five natural image OD datasets. MViTs show consistently good results on all datasets.
![Results](paper_resources/table_1.png)
Generalization to New Domains: Class-agnostic OD performance of MViTs in comparison with uni-modal detector(RetinaNet) on five out-of-domain OD datasets. MViTs show consistently good results on all datasets.
![Results](paper_resources/table_2.png)
Generalization to Rare/Novel Classes: MAVL class-agnostic OD performance on rarely and frequently occurring categories in the pretraining captions.
The numbers on top of the bars indicate occurrences of the corresponding category in the training dataset.
The MViT achieves good recall values even for the classes with no or very few occurrences.![Results](paper_resources/table_3.png)
Enhanced Interactability: Effect of using different intuitive text queries on the MAVL class-agnostic OD performance.
Combining detections from multiple queries captures varying aspects of objectness.![Results](paper_resources/table_4.png)
Language Skeleton/Structure: Experimental analysis to explore the contribution of language by removing all textual inputs, but maintaining the structure introduced by captions.
All experiments are performed on Def-DETR.
In setting 1, annotations corresponding to same images are combined.
Setting 2 has an additional NMS applied to remove duplicate boxes.
In setting 3, four to eight boxes are randomly grouped in each iteration.
The same model is trained longer in setting 4.
In setting 5, the dataloader structure corresponding to captions is kept intact.
Results from setting 5 demonstrate the importance of structure introduced by language.![Results](paper_resources/table_5.png)
Open-world Object Detection: Effect of using class-agnostic OD proposals from MAVL for pseudo labelling of unknowns in Open World Detector (ORE).
![Results](paper_resources/table_6.png)
Pretraining for Class-aware Object Detection: Effect of using MAVL proposals for pre-training of DETReg instead of Selective Search proposals.
![Results](paper_resources/table_7.png)
## Evaluation
Please refer to [evaluation/class_agnostic_od/README.md](evaluation/class_agnostic_od/README.md).
## Training
Please refer to [training/README.md](training/README.md).## Applications
Please refer to [applications/README.md](applications/README.md).
## Citation
If you use our work, please consider citing:
```bibtex
@inproceedings{Maaz2022Multimodal,
title={Class-agnostic Object Detection with Multi-modal Transformer},
author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz and Anwer, Rao Muhammad and Yang, Ming-Hsuan},
booktitle={17th European Conference on Computer Vision (ECCV)},
year={2022},
organization={Springer}
}
```## Contact
Should you have any question, please create an issue on this repository or contact at [email protected], [email protected]## Related Works
- Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection, NeurIPS 2022. [Paper](https://arxiv.org/abs/2207.03482) | [Code](https://github.com/hanoonaR/object-centric-ovd)