Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/mmaaz60/mvits_for_class_agnostic_od

[ECCV'22] Official repository of paper titled "Class-agnostic Object Detection with Multi-modal Transformer".
https://github.com/mmaaz60/mvits_for_class_agnostic_od
class-agnostic-detection multimodal-learning object-detection open-world-detection psuedo-labels pytorch
Last synced: 32 minutes ago
JSON representation
[ECCV'22] Official repository of paper titled "Class-agnostic Object Detection with Multi-modal Transformer".
Host: GitHub
URL: https://github.com/mmaaz60/mvits_for_class_agnostic_od
Owner: mmaaz60
License: mit
Created: 2021-11-16T09:15:36.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-05-09T04:19:22.000Z (almost 2 years ago)
Last Synced: 2025-02-07T12:10:42.846Z (7 days ago)
Topics: class-agnostic-detection, multimodal-learning, object-detection, open-world-detection, psuedo-labels, pytorch
Language: Python
Homepage:
Size: 34.1 MB
Stars: 306
Watchers: 7
Forks: 25
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # Class-agnostic Object Detection with Multi-modal Transformer (ECCV 2022)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/class-agnostic-object-detection-on-pascal-voc)](https://paperswithcode.com/sota/class-agnostic-object-detection-on-pascal-voc?p=multi-modal-transformers-excel-at-class)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/class-agnostic-object-detection-on-coco)](https://paperswithcode.com/sota/class-agnostic-object-detection-on-coco?p=multi-modal-transformers-excel-at-class)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/class-agnostic-object-detection-on-kitti)](https://paperswithcode.com/sota/class-agnostic-object-detection-on-kitti?p=multi-modal-transformers-excel-at-class)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/class-agnostic-object-detection-on-kitchen)](https://paperswithcode.com/sota/class-agnostic-object-detection-on-kitchen?p=multi-modal-transformers-excel-at-class)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/class-agnostic-object-detection-on-comic2k)](https://paperswithcode.com/sota/class-agnostic-object-detection-on-comic2k?p=multi-modal-transformers-excel-at-class)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/open-world-object-detection-on-pascal-voc)](https://paperswithcode.com/sota/open-world-object-detection-on-pascal-voc?p=multi-modal-transformers-excel-at-class)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/open-world-object-detection-on-coco-2017)](https://paperswithcode.com/sota/open-world-object-detection-on-coco-2017?p=multi-modal-transformers-excel-at-class)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/open-world-object-detection-on-coco-2017-1)](https://paperswithcode.com/sota/open-world-object-detection-on-coco-2017-1?p=multi-modal-transformers-excel-at-class)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/open-world-object-detection-on-coco-2017-2)](https://paperswithcode.com/sota/open-world-object-detection-on-coco-2017-2?p=multi-modal-transformers-excel-at-class)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/object-detection-on-pascal-voc-10)](https://paperswithcode.com/sota/object-detection-on-pascal-voc-10?p=multi-modal-transformers-excel-at-class)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-modal-transformers-excel-at-class/object-detection-on-pascal-voc-2007)](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2007?p=multi-modal-transformers-excel-at-class)

### **Class-agnostic Object Detection with Multi-modal Transformer**

[Muhammad Maaz](https://scholar.google.com/citations?user=vTy9Te8AAAAJ&hl=en&authuser=1&oi=sra), [Hanoona Rasheed](https://scholar.google.com/citations?user=yhDdEuEAAAAJ&hl=en&authuser=1&oi=sra), [Salman Khan](https://salman-h-khan.github.io/), [Fahad Shahbaz Khan](https://scholar.google.es/citations?user=zvaeYnUAAAAJ&hl=en), [Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ) and [Ming-Hsuan Yang](https://scholar.google.com/citations?user=p9-ohHsAAAAJ&hl=en)

[![paper](https://img.shields.io/badge/arXiv-Paper-.svg)](https://arxiv.org/abs/2111.11430)

[![video](https://img.shields.io/badge/Video-Presentation-F9D371)](https://youtu.be/pkooyDZAxdA)

[![slides](https://img.shields.io/badge/Presentation-Slides-B762C1)](https://drive.google.com/file/d/1v8PcbVVOHwzo5LShjB_NQJE7m1rbA9bL)

[![slides](https://img.shields.io/badge/Paper-Poster-87CEEB)](paper_resources/eccv'22_poster.pdf)

# :rocket: News

* **(July 06, 2022)** 

  * Paper accepted at ECCV 2022

* **(Feb 01, 2022)** 

  * Training codes for `MAVL` and `MAVL minus Language` models are released `->` [training/README.md](training/README.md)

  * Instructions to use class-agnostic object detection behavior of MAVL on different applications are released `->` [applications/README.md](applications/README.md)

  * All the pretrained models (`MAVL`, `Def-DETR`, `MDETR`, `DETReg`, `Faster-RCNN`, `RetinaNet`, `ORE`, and others), along with the instructions to reproduce the results are released `->` [this link](https://mbzuaiac-my.sharepoint.com/:f:/g/personal/muhammad_maaz_mbzuai_ac_ae/Et8rDrc4jkdIuHx4OH52fFUBreLD2-AIUAvO7ZjxtwjU3g?e=lMbeGq)

* **(Nov 25, 2021)** Evaluation code along with pre-trained models & pre-computed predictions is released. [evaluation/README.md](evaluation/class_agnostic_od/README.md)



![main figure](paper_resources/new_main_figure.jpg)

> *
 **Abstract:** What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free

and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel

objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics.

For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text

pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the 

state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing

MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using

multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications

including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs

can adaptively generate proposals given a specific language query and thus offer enhanced interactability. 
*



## Architecture overview of MViTs used in this work

Architecture overview of MViTs used in this work – [GPV-1](https://arxiv.org/abs/2104.00743),

[MDETR](https://openaccess.thecvf.com/content/ICCV2021/papers/Kamath_MDETR_-_Modulated_Detection_for_End-to-End_Multi-Modal_Understanding_ICCV_2021_paper.pdf)

and Multiscale Attention ViT with Late fusion (MAVL) (ours).

![Architecture overview](paper_resources/new_block_diag.png)



## Installation

The code is tested with PyTorch 1.8.0 and CUDA 11.1. After cloning the repository, follow the below steps for installation,

1. Install PyTorch and torchvision

```shell

pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

```

2. Install other dependencies

```shell

pip install -r requirements.txt

```

3. Compile Deformable Attention modules

```shell

cd models/ops

sh make.sh

```



## Results

Results of Class-agnostic Object Detection of MViTS including our proposed Multiscale Attention ViT with Late fusion

(MAVL) model, applications, and exploratory analysis.

Class-agnostic Object Detection performance of MViTs in comparison with bottom-up approaches and uni-modal detectors on five natural image OD datasets. MViTs show consistently good results on all datasets.

![Results](paper_resources/table_1.png)



Generalization to New Domains: Class-agnostic OD performance of MViTs in comparison with uni-modal detector(RetinaNet) on five out-of-domain OD datasets. MViTs show consistently good results on all datasets.

![Results](paper_resources/table_2.png)



 Generalization to Rare/Novel Classes: MAVL class-agnostic OD performance on rarely and frequently occurring categories in the pretraining captions.

The numbers on top of the bars indicate occurrences of the corresponding category in the training dataset.

The MViT achieves good recall values even for the classes with no or very few occurrences.

![Results](paper_resources/table_3.png)



 Enhanced Interactability: Effect of using different intuitive text queries on the MAVL class-agnostic OD performance.

Combining detections from multiple queries captures varying aspects of objectness.

![Results](paper_resources/table_4.png)



 Language Skeleton/Structure: Experimental analysis to explore the contribution of language by removing all textual inputs, but maintaining the structure introduced by captions. 

All experiments are performed on Def-DETR. 

In setting 1, annotations corresponding to same images are combined. 

Setting 2 has an additional NMS applied to remove duplicate boxes. 

In setting 3, four to eight boxes are randomly grouped in each iteration. 

The same model is trained longer in setting 4. 

In setting 5, the dataloader structure corresponding to captions is kept intact. 

Results from setting 5 demonstrate the importance of structure introduced by language.

![Results](paper_resources/table_5.png)



 Open-world Object Detection: Effect of using class-agnostic OD proposals from MAVL for pseudo labelling of unknowns in Open World Detector (ORE).

![Results](paper_resources/table_6.png)



 Pretraining for Class-aware Object Detection: Effect of using MAVL proposals for pre-training of DETReg instead of Selective Search proposals.

![Results](paper_resources/table_7.png)



## Evaluation

Please refer to [evaluation/class_agnostic_od/README.md](evaluation/class_agnostic_od/README.md).



## Training

Please refer to [training/README.md](training/README.md).

## Applications

Please refer to [applications/README.md](applications/README.md).



## Citation

If you use our work, please consider citing:

```bibtex

    @inproceedings{Maaz2022Multimodal,

      title={Class-agnostic Object Detection with Multi-modal Transformer},

      author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz and Anwer, Rao Muhammad and Yang, Ming-Hsuan},

      booktitle={17th European Conference on Computer Vision (ECCV)},

      year={2022},

      organization={Springer}

}

```

## Contact

Should you have any question, please create an issue on this repository or contact at [email protected], [email protected]

## Related Works

- Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection, NeurIPS 2022. [Paper](https://arxiv.org/abs/2207.03482) | [Code](https://github.com/hanoonaR/object-centric-ovd)