https://github.com/om-ai-lab/omdet
Real-time and accurate open-vocabulary end-to-end object detection
https://github.com/om-ai-lab/omdet
coco computer-vision lvis object-detection open-vocabulary real-time vision-and-language zero-shot zero-shot-object-detection
Last synced: 20 days ago
JSON representation
Real-time and accurate open-vocabulary end-to-end object detection
- Host: GitHub
- URL: https://github.com/om-ai-lab/omdet
- Owner: om-ai-lab
- License: apache-2.0
- Created: 2024-03-11T16:34:47.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-18T08:04:07.000Z (5 months ago)
- Last Synced: 2025-04-06T09:01:31.975Z (28 days ago)
- Topics: coco, computer-vision, lvis, object-detection, open-vocabulary, real-time, vision-and-language, zero-shot, zero-shot-object-detection
- Language: Python
- Homepage:
- Size: 9.75 MB
- Stars: 1,310
- Watchers: 70
- Forks: 111
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# OmDet-Turbo
Fast and accurate open-vocabulary end-to-end object detection***
## 🗓️ Updates
* 09/26/2024:OmDet-Turbo has been integrated into Transformers version 4.45.0. The code is available at [here](https://github.com/huggingface/transformers/tree/main/src/transformers/models/omdet_turbo), and the Hugging Face model is available at [here](https://huggingface.co/omlab/omdet-turbo-swin-tiny-hf).
* 07/05/2024: Our new open-source project, [OmAget: A multimodal agent framework for solving complex tasks](https://github.com/om-ai-lab/OmAgent) is available !!! Additionally, OmDet has been seamlessly integrated as an OVD tool within it. Feel free to delve into our innovative multimodal agent framework.
* 06/24/2024: Guidance for [converting OmDet-Turbo to ONNX](https://github.com/om-ai-lab/OmDet#:~:text=How%20To%20Export%20ONNX%20Model)
* 03/25/2024: Inference code and a pretrained OmDet-Turbo-Tiny model released.
* 03/12/2024: Github open-source project created***
## 🔗 Related Works
If you are interested in our research, we welcome you to explore our other wonderful projects.🔆 [How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection](https://arxiv.org/abs/2308.13177)(AAAI24) 🏠[Github Repository](https://github.com/om-ai-lab/OVDEval/tree/main)
🔆 [OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network](https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12268)(IET Computer Vision)
***
## 📖 Introduction
This repository is the official PyTorch implementation for **OmDet-Turbo**, a fast transformer-based open-vocabulary object detection model.**⭐️Highlights**
1. **OmDet-Turbo** is a transformer-based real-time open-vocabulary
detector that combines strong OVD capabilities with fast inference speed.
This model addresses the challenges of efficient detection in open-vocabulary
scenarios while maintaining high detection performance.
2. We introduce the **Efficient Fusion Head**, a swift multimodal fusion module
designed to alleviate the computational burden on the encoder and reduce
the time consumption of the head with ROI.
3. OmDet-Turbo-Base model, achieves state-of-the-art zero-shot performance on the ODinW and OVDEval datasets, with AP scores
of **30.1** and **26.86**, respectively.
4. The inference speed of OmDetTurbo-Base on the COCO val2017 dataset reach **100.2** FPS on an A100 GPU.For more details, check out our paper **[Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head](https://arxiv.org/abs/2403.06892)**
***
## ⚡️ Inference Speed
Comparison of inference speeds for each component in tiny-size model.***
## 🛠️ How To Install
Follow the [Installation Instructions](install.md) to set up the environments for OmDet-Turbo***
## 🚀 How To Run
### Local Inference
1. Download our pretrained model and the [CLIP](https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/resolve/main/ViT-B-16.pt?download=true) checkpoints.
2. Create a folder named **resources**, put downloaded models into this folder.
3. Run **run_demo.py**, the images with predicted results will be saved at **./outputs** folder.
### Run as a API Server
1. Download our pretrained model and the [CLIP](https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/resolve/main/ViT-B-16.pt?download=true) checkpoints.
2. Create a folder named **resources**, put downloaded models into this folder.
3. Run **run_wsgi.py**, the API server will be started at **http://host_ip:8000/inf_predict**, check **http://host_ip:8000/docs** to have a try.We already added language cache while inferring with **run_demo.py**. For more details, please open and check **run_demo.py** scripts.
***
## ⚙️ How To Export ONNX Model
1. Replace **OmDetV2Turbo** in **OmDet-Turbo_tiny_SWIN_T.yaml** with **OmDetV2TurboInfer**
2. Run **export.py**, and the omdet.onnx will be exported.
In the above example, post processing is not included in onnx model , and all input size are fixed. You can add more post processing and change the input size according to your needs.***
## 📦 Model Zoo
The performance of COCO and LVIS are evaluated under zero-shot setting.Model | Backbone | Pre-Train Data | COCO | LVIS | FPS (pytorch/trt) |Weight
-- |--------|-----------------| -- | -- |-------------------| --
OmDet-Turbo-Tiny| Swin-T | O365,GoldG | 42.5 | 30.3 | 21.5/140.0 | [weight](https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/tree/main)***
## 📝 Main Results***
## Citation
Please consider citing our papers if you use our projects:```
@article{zhao2024real,
title={Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head},
author={Zhao, Tiancheng and Liu, Peng and He, Xuan and Zhang, Lu and Lee, Kyusong},
journal={arXiv preprint arXiv:2403.06892},
year={2024}
}
``````
@article{zhao2024omdet,
title={OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network},
author={Zhao, Tiancheng and Liu, Peng and Lee, Kyusong},
journal={IET Computer Vision},
year={2024},
publisher={Wiley Online Library}
}
```