https://github.com/bytedance/OmniScient-Model

This repo contains the code for our paper Towards Open-Ended Visual Recognition with Large Language Model
https://github.com/bytedance/OmniScient-Model

research

Last synced: 7 months ago
JSON representation

This repo contains the code for our paper Towards Open-Ended Visual Recognition with Large Language Model

Host: GitHub
URL: https://github.com/bytedance/OmniScient-Model
Owner: bytedance
License: apache-2.0
Created: 2023-11-13T19:04:06.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-15T16:13:37.000Z (11 months ago)
Last Synced: 2024-11-15T13:15:25.072Z (7 months ago)
Topics: research
Language: Jupyter Notebook
Homepage:
Size: 25.2 MB
Stars: 90
Watchers: 10
Forks: 5
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

Awesome-Segment-Anything - [code

README

# OmniScient-Model (ECCV 2024)

This repo contains the code for our paper [**Towards Open-Ended Visual Recognition with Large Language Model**](https://arxiv.org/abs/2311.08400)

We propose OmniScient Model (OSM) towards open-ended visual recognition, allowing the identification of diverse real-world entities without the constraints of a user-defined vocabulary. Unlike closed-vocabulary and open-vocabulary recognition frameworks, OSM operates seamlessly without the need for predefined vocabularies.

### Features
* A simple strategy to adapt multi-modal LLM for high-resolution image at 1120x1120, leading to more precise recognition ability.

* A brand-new task named open-ended visual recognition to predict beyond the limitation of a given vocabulary.

* A strong model that can recognize novel concepts in the real-world, e.g., it can recognize semantic parts even when only trained on object-level data.

## Installation

```bash
pip install torch==2.0.1 torchvision==0.15.2
pip install -r requirements.txt
```

## Getting Started

We provide examples applying OSM on top of an off-the-shelf segmenter (e.g., SAM), illustrating playing with OSM in a segment and recognize anything mode in [demo_with_sam.py](./demo_with_sam.py), or in an interactive model in [interactive_demo.ipynb](./interactive_demo.ipynb).

## Data Preparation

Please refer to [Preparing Datasets for OSM](dataset_preparation/README.md).

## Training

After finishing the data preparation, you can use the following commands to train OSM model with 8 A100 GPUs in 2 days, and you can adjust the gradient accumulation, FSDP, gradient checkpointing per your computational resources.

To train OSM-final w/o part segmentation or detection data:
```bash
torchrun --nnodes=1 --nproc_per_node=8 --master_addr=127.0.0.1 --master_port=9999 --node_rank=0 \
train/train.py \
--dataset_resampled \
--batch_size_coco 8 \
--batch_size_lvis 16 \
--batch_size_a847 4 \
--batch_size_pc459 2 \
--batch_size_ade20k 4 \
--batch_size_cityscapes 2 \
--train_num_samples_coco 100000 \
--train_num_samples_lvis 200000 \
--train_num_samples_a847 50000 \
--train_num_samples_pc459 25000 \
--train_num_samples_ade20k 50000 \
--train_num_samples_cityscapes 25000 \
--workers 4 \
--run_name osm_final \
--num_epochs 10 \
--warmup_steps 100 \
--weight_decay 0.05 \
--lr_scheduler cosine \
--coco_shards "$SAVE_PATH/coco_pan_wds_exclude_lvisval/{000000000..000000106}.tar" \
--lvis_shards "$SAVE_PATH/lvis_wds/{000000000..000000099}.tar" \
--a847_shards "$SAVE_PATH/a847_wds/{000000000..000000025}.tar" \
--pc459_shards "$SAVE_PATH/pc459_wds/{000000000..000000004}.tar" \
--ade20k_shards "$SAVE_PATH/ade20k_pan_wds/{000000000..000000020}.tar" \
--cityscapes_shards "$SAVE_PATH/cityscapes_pan_wds/{000000000..000000002}.tar" \
--learning_rate 4e-5 \
--precision amp_bfloat16 \
--gradient_accumulation_steps 4
```

To train OSM-final w/ part segmentation and detection data:
```bash
torchrun --nnodes=1 --nproc_per_node=8 --master_addr=127.0.0.1 --master_port=9999 --node_rank=0 \
train/train.py \
--dataset_resampled \
--mask2box_prob 0.2 \
--batch_size_coco 8 \
--batch_size_lvis 16 \
--batch_size_a847 4 \
--batch_size_pc459 2 \
--batch_size_ade20k 4 \
--batch_size_cityscapes 2 \
--batch_size_v3det 16 \
--batch_size_partimagenet 4 \
--batch_size_pascal_part 2 \
--train_num_samples_coco 100000 \
--train_num_samples_lvis 200000 \
--train_num_samples_a847 50000 \
--train_num_samples_pc459 25000 \
--train_num_samples_ade20k 50000 \
--train_num_samples_cityscapes 25000 \
--train_num_samples_v3det 200000 \
--train_num_samples_partimagenet 50000 \
--train_num_samples_pascal_part 25000 \
--workers 4 \
--run_name osm_final_partseg_det \
--num_epochs 10 \
--warmup_steps 100 \
--weight_decay 0.05 \
--lr_scheduler cosine \
--coco_shards "$SAVE_PATH/coco_pan_wds_exclude_lvisval/{000000000..000000106}.tar" \
--lvis_shards "$SAVE_PATH/lvis_wds/{000000000..000000099}.tar" \
--a847_shards "$SAVE_PATH/a847_wds/{000000000..000000025}.tar" \
--pc459_shards "$SAVE_PATH/pc459_wds/{000000000..000000004}.tar" \
--ade20k_shards "$SAVE_PATH/ade20k_pan_wds/{000000000..000000020}.tar" \
--cityscapes_shards "$SAVE_PATH/cityscapes_pan_wds/{000000000..000000002}.tar" \
--v3det_shards "$SAVE_PATH/v3det_wds/{000000000..000000183}.tar" \
--partimagenet_shards "$SAVE_PATH/part_imagenet_wds/{000000000..000000020}.tar" \
--pascal_part_shards "$SAVE_PATH/pascal_part_wds/{000000000..000000008}.tar" \
--learning_rate 4e-5 \
--precision amp_bfloat16 \
--gradient_accumulation_steps 4
```

## Testing

Update the data path in [test/generate_pred.py](test/generate_pred.py), then run the following script for testing:
```bash
GPU_COUNT=8 # Set your GPU count here
CKPT_PATH="./osm_final.pt" # Set your checkpoint path here
RESULT_SAVE_PATH="osm_final" # Set your result save path here

for (( i=0; i

Checkpoint
Training Datasets

OSM
COCO Panoptic, ADE Panoptic, Cityscapes Panoptic, LVIS Instance, A-847 Semantic, PC-459 Semantic

OSM w/ part and box
COCO Panoptic, ADE Panoptic, Cityscapes Panoptic, LVIS Instance, A-847 Semantic, PC-459 Semantic, Part-ImageNet Semantic, Pascal-Part Semantic, V3Det Detection

## Visual Results

## Citing OSM

If you use OSM in your research, please use the following BibTeX entry.

```BibTeX
@inproceedings{yu2023towards,
title={Towards Open-Ended Visual Recognition with Large Language Model},
author={Qihang Yu and Xiaohui Shen and Liang-Chieh Chen},
booktitle={ECCV},
year={2024}
}
```

## Acknowledgement

[Segment Anything](https://github.com/facebookresearch/segment-anything)

[OpenFlamingo](https://github.com/mlfoundations/open_flamingo)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bytedance/OmniScient-Model

Awesome Lists containing this project

README