
An open API service indexing awesome lists of open source software.

[CVPR 2024] Real-Time Open-Vocabulary Object Detection

Last synced: about 1 month ago
JSON representation

[CVPR 2024] Real-Time Open-Vocabulary Object Detection




Tianheng Cheng2,3,*,
Lin Song1,📧,*,
Yixiao Ge1,🌟,2,
Wenyu Liu3,
Xinggang Wang3,📧,
Ying Shan1,2

\* Equal contribution 🌟 Project lead 📧 Corresponding author

1 Tencent AI Lab, 2 ARC Lab, Tencent PCG
3 Huazhong University of Science and Technology

[![arxiv paper](](
[![arxiv paper](](
Open In Colab

## Notice

We recommend that everyone **use English to communicate on issues**, as this helps developers from around the world discuss, share experiences, and answer questions together.

## 🔥 Updates
`[2024-4-28]:` Long time no see! This update contains bugfixs and improvements: (1) ONNX demo; (2) image demo (support tensor input); (2) new pre-trained models; (3) image prompts; (4)simple version for fine-tuning / deployment; (5) guide for installation (include a `requirements.txt`).
`[2024-3-28]:` We provide: (1) more high-resolution pre-trained models (e.g., S, M, X) ([#142](; (2) pre-trained models with CLIP-Large text encoders. Most importantly, we preliminarily fix the **fine-tuning without `mask-refine`** and explore a new fine-tuning setting ([#160](,[#76]( In addition, fine-tuning YOLO-World with `mask-refine` also obtains significant improvements, check more details in [configs/finetune_coco](./configs/finetune_coco/).
`[2024-3-16]:` We fix the bugs about the demo ([#110](,[#94](,[#129](, [#125]( with visualizations of segmentation masks, and release [**YOLO-World with Embeddings**](./docs/, which supports prompt tuning, text prompts and image prompts.
`[2024-3-3]:` We add the **high-resolution YOLO-World**, which supports `1280x1280` resolution with higher accuracy and better performance for small objects!
`[2024-2-29]:` We release the newest version of [ **YOLO-World-v2**](./docs/ with higher accuracy and faster speed! We hope the community can join us to improve YOLO-World!
`[2024-2-28]:` Excited to announce that YOLO-World has been accepted by **CVPR 2024**! We're continuing to make YOLO-World faster and stronger, as well as making it better to use for all.
`[2024-2-22]:` We sincerely thank [RoboFlow]( and [@Skalskip92]( for the [**Video Guide**]( about YOLO-World, nice work!
`[2024-2-18]:` We thank [@Skalskip92]( for developing the wonderful segmentation demo via connecting YOLO-World and EfficientSAM. You can try it now at the [🤗 HuggingFace Spaces](
`[2024-2-17]:` The largest model **X** of YOLO-World is released, which achieves better zero-shot performance!
`[2024-2-17]:` We release the code & models for **YOLO-World-Seg** now! YOLO-World now supports open-vocabulary / zero-shot object segmentation!
`[2024-2-15]:` The pre-traind YOLO-World-L with CC3M-Lite is released!
`[2024-2-14]:` We provide the [`image_demo`]( for inference on images or directories.
`[2024-2-10]:` We provide the [fine-tuning](./docs/ and [data](./docs/ details for fine-tuning YOLO-World on the COCO dataset or the custom datasets!
`[2024-2-3]:` We support the `Gradio` demo now in the repo and you can build the YOLO-World demo on your own device!
`[2024-2-1]:` We've released the code and weights of YOLO-World now!
`[2024-2-1]:` We deploy the YOLO-World demo on [HuggingFace 🤗](, you can try it now!
`[2024-1-31]:` We are excited to launch **YOLO-World**, a cutting-edge real-time open-vocabulary object detector.


YOLO-World is under active development and please stay tuned ☕️!
If you have suggestions📃 or ideas💡,**we would love for you to bring them up in the [Roadmap](** ❤️!
> YOLO-World 目前正在积极开发中📃,如果你有建议或者想法💡,**我们非常希望您在 [Roadmap]( 中提出来** ❤️!

## [FAQ (Frequently Asked Questions)](

We have set up an FAQ about YOLO-World in the discussion on GitHub. We hope everyone can raise issues or solutions during use here, and we also hope that everyone can quickly find solutions from it.

> 我们在GitHub的discussion中建立了关于YOLO-World的常见问答,这里将收集一些常见问题,同时大家可以在此提出使用中的问题或者解决方案,也希望大家能够从中快速寻找到解决方案

## Highlights & Introduction

This repo contains the PyTorch implementation, pre-trained weights, and pre-training/fine-tuning code for YOLO-World.

* YOLO-World is pre-trained on large-scale datasets, including detection, grounding, and image-text datasets.

* YOLO-World is the next-generation YOLO detector, with a strong open-vocabulary detection capability and grounding ability.

* YOLO-World presents a *prompt-then-detect* paradigm for efficient user-vocabulary inference, which re-parameterizes vocabulary embeddings as parameters into the model and achieve superior inference speed. You can try to export your own detection model without extra training or fine-tuning in our [online demo](!

## Model Zoo

We've pre-trained YOLO-World-S/M/L from scratch and evaluate on the `LVIS val-1.0` and `LVIS minival`. We provide the pre-trained model weights and training logs for applications/research or re-producing the results.

### Zero-shot Inference on LVIS dataset

| model | Pre-train Data | Size | APmini | APr | APc | APf | APval | APr | APc | APf | weights |
| :------------------------------------------------------------------------------------------------------------------- | :------------------- | :----------------- | :--------------: | :------------: | :------------: | :------------: | :-------------: | :------------: | :------------: | :------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| [YOLO-Worldv2-S](./configs/pretrain/ | O365+GoldG | 640 | 22.7 | 16.3 | 20.8 | 25.5 | 17.3 | 11.3 | 14.9 | 22.7 |[HF Checkpoints 🤗](|
| [YOLO-Worldv2-S](./configs/pretrain/ | O365+GoldG | 1280🔸 | 24.1 | 18.7 | 22.0 | 26.9 | 18.8 | 14.1 | 16.3 | 23.8 |[HF Checkpoints 🤗](|
| [YOLO-Worldv2-M](./configs/pretrain/ | O365+GoldG | 640 | 30.0 | 25.0 | 27.2 | 33.4 | 23.5 | 17.1 | 20.0 | 30.1 | [HF Checkpoints 🤗](|
| [YOLO-Worldv2-M](./configs/pretrain/ | O365+GoldG | 1280🔸 | 31.6 | 24.5 | 29.0 | 35.1 | 25.3 | 19.3 | 22.0 | 31.7 | [HF Checkpoints 🤗](|
| [YOLO-Worldv2-L](./configs/pretrain/ | O365+GoldG | 640 | 33.0 | 22.6 | 32.0 | 35.8 | 26.0 | 18.6 | 23.0 | 32.6 | [HF Checkpoints 🤗](|
| [YOLO-Worldv2-L](./configs/pretrain/ | O365+GoldG | 1280🔸 | 34.6 | 29.2 | 32.8 | 37.2 | 27.6 | 21.9 | 24.2 | 34.0 | [HF Checkpoints 🤗](|
| [YOLO-Worldv2-L (CLIP-Large)](./configs/pretrain/ 🔥 | O365+GoldG | 640 | 34.0 | 22.0 | 32.6 | 37.4 | 27.1 | 19.9 | 23.9 | 33.9 | [HF Checkpoints 🤗](|
| [YOLO-Worldv2-L (CLIP-Large)](./configs/pretrain/ 🔥 | O365+GoldG | 800🔸 | 35.5 | 28.3 | 33.2 | 38.8 | 28.6 | 22.0 | 25.1 | 35.4 | [HF Checkpoints 🤗](|
| [YOLO-Worldv2-L](./configs/pretrain/ | O365+GoldG+CC3M-Lite | 640 | 32.9 | 25.3 | 31.1 | 35.8 | 26.1 | 20.6 | 22.6 | 32.3 | [HF Checkpoints 🤗](|
| [YOLO-Worldv2-X](./configs/pretrain/ | O365+GoldG+CC3M-Lite | 640 | 35.4 | 28.7 | 32.9 | 38.7 | 28.4 | 20.6 | 25.6 | 35.0 | [HF Checkpoints 🤗]( |
| 🔥 [YOLO-Worldv2-X]() | O365+GoldG+CC3M-Lite | 1280🔸 | 37.4 | 30.5 | 35.2 | 40.7 | 29.8 | 21.1 | 26.8 | 37.0 | [HF Checkpoints 🤗]( |
| [YOLO-Worldv2-XL](./configs/pretrain/ | O365+GoldG+CC3M-Lite | 640 | 36.0 | 25.8 | 34.1 | 39.5 | 29.1 | 21.1 | 26.3 | 35.8 | [HF Checkpoints 🤗]( |

1. APmini: evaluated on LVIS `minival`.
3. APval: evaluated on LVIS `val 1.0`.
4. [HuggingFace Mirror]( provides the mirror of HuggingFace, which is a choice for users who are unable to reach.
5. 🔸: fine-tuning models with the pre-trained data.

**Pre-training Logs:**

We provide the pre-training logs of `YOLO-World-v2`. Due to the unexpected errors of the local machines, the training might be interrupted several times.

| Model | YOLO-World-v2-S | YOLO-World-v2-M | YOLO-World-v2-L | YOLO-World-v2-X |
| :--- | :-------------: | :--------------: | :-------------: | :-------------: |
|Pre-training Log | [Part-1](, [Part-2]( | [Part-1](, [Part-2]( | [Part-1](, [Part-2]( | [Final part](|

## Getting started

### 1. Installation

YOLO-World is developed based on `torch==1.11.0` `mmyolo==0.6.0` and `mmdetection==3.0.0`. Check more details about `requirements` and `mmcv` in [docs/installation](./docs/

#### Clone Project

git clone --recursive
#### Install

pip install torch wheel -q
pip install -e .

### 2. Preparing Data

We provide the details about the pre-training data in [docs/data](./docs/

## Training & Evaluation

We adopt the default [training](./tools/ or [evaluation](./tools/ scripts of [mmyolo](
We provide the configs for pre-training and fine-tuning in `configs/pretrain` and `configs/finetune_coco`.
Training YOLO-World is easy:

chmod +x tools/
# sample command for pre-training, use AMP for mixed-precision training
./tools/ configs/pretrain/ 8 --amp
**NOTE:** YOLO-World is pre-trained on 4 nodes with 8 GPUs per node (32 GPUs in total). For pre-training, the `node_rank` and `nnodes` for multi-node training should be specified.

Evaluating YOLO-World is also easy:

chmod +x tools/
./tools/ path/to/config path/to/weights 8

**NOTE:** We mainly evaluate the performance on LVIS-minival for pre-training.

## Fine-tuning YOLO-World

We provide the details about fine-tuning YOLO-World in [docs/fine-tuning](./docs/

## Deployment

We provide the details about deployment for downstream applications in [docs/deployment](./docs/
You can directly download the ONNX model through the online [demo]( in Huggingface Spaces 🤗.

## Demo

See [`demo`](./demo) for more details

- [x] ``: Gradio demo, ONNX export
- [x] ``: inference with images or a directory of images
- [x] ``: a simple demo of YOLO-World, using `array` (instead of path as input).
- [x] ``: inference YOLO-World on videos.
- [x] `inference.ipynb`: jupyter notebook for YOLO-World.
- [x] [Google Colab Notebook]( We sincerely thank [Onuralp]( for sharing the [Colab Demo](, you can have a try 😊!

## Acknowledgement

We sincerely thank [mmyolo](, [mmdetection](, [GLIP](, and [transformers]( for providing their wonderful code to the community!

## Citations
If you find YOLO-World is useful in your research or applications, please consider giving us a star 🌟 and citing it.

title={YOLO-World: Real-Time Open-Vocabulary Object Detection},
author={Cheng, Tianheng and Song, Lin and Ge, Yixiao and Liu, Wenyu and Wang, Xinggang and Shan, Ying},
booktitle={Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},

## Licence
YOLO-World is under the GPL-v3 Licence and is supported for comercial usage.