Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/reshalfahsi/instance-segmentation-vit-maskrcnn

Instance Segmentation Using ViT-based Mask R-CNN
https://github.com/reshalfahsi/instance-segmentation-vit-maskrcnn

instance-segmentation mask-rcnn penn-fudan-database penn-fudan-dataset vision-transformer

Last synced: about 2 months ago
JSON representation

Instance Segmentation Using ViT-based Mask R-CNN

Awesome Lists containing this project

README

        

# Instance Segmentation Using ViT-based Mask R-CNN


colab



qualitative-3



Instance segmentation aims at dichotomizing a pixel acting as a sub-object of a unique entity in the scene. One of the approaches, which combines object detection and semantic segmentation, is Mask R-CNN. Furthermore, we can also incorporate ViT as the backbone of Mask R-CNN. In this project, the pre-trained ViT-based Mask R-CNN model is fine-tuned and evaluated on the dataset from the Penn-Fudan Database for Pedestrian Detection and Segmentation. With a ratio of 80:10:10, the train, validation, and test sets are distributed.

## Experiment

Leap into this [link](https://github.com/reshalfahsi/instance-segmentation-vit-maskrcnn/blob/master/Instance_Segmentation_Using_ViT_based_Mask_RCNN.ipynb) that harbors a Jupyter Notebook of the entire experiment.

## Result

## Quantitative Result

The following table delivers the performance results of ViT-based Mask R-CNN, quantitatively.

Test Metric | Score
------------------------------ | -------------
mAPbox@0.5:0.95 | 96.85%
mAPmask@0.5:0.95 | 79.58%

## Loss Curve

loss_curve
Loss curves of ViT-based Mask R-CNN on the Penn-Fudan Database for Pedestrian Detection and Segmentation train and validation sets.

## Qualitative Result

Below, the qualitative results are presented.

qualitative-1qualitative-2qualitative-3qualitative-4qualitative-5qualitative-6qualitative-7
Few samples of qualitative results from the ViT-based Mask R-CNN model.

## Credit

- [An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929.pdf)
- [Mask R-CNN](https://arxiv.org/pdf/1703.06870.pdf)
- [Benchmarking Detection Transfer Learning with Vision Transformers](https://arxiv.org/pdf/2111.11429.pdf)
- [TorchVision's Mask R-CNN](https://github.com/pytorch/vision/blob/main/torchvision/models/detection/mask_rcnn.py)
- [TorchVision Object Detection Finetuning Tutorial](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html)
- [Penn-Fudan Database for Pedestrian Detection and Segmentation](https://www.cis.upenn.edu/~jshi/ped_html/)
- [PyTorch Lightning](https://lightning.ai/docs/pytorch/latest/)