Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Sense-GVT/DeCLIP

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
https://github.com/Sense-GVT/DeCLIP

big-model clip image-text multi-model self-supervised vision-language-pretraining zero-shot

Last synced: 9 days ago
JSON representation

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Host: GitHub
URL: https://github.com/Sense-GVT/DeCLIP
Owner: Sense-GVT
Created: 2021-10-09T07:35:46.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-09-19T03:50:23.000Z (about 2 years ago)
Last Synced: 2024-08-01T13:30:19.163Z (3 months ago)
Topics: big-model, clip, image-text, multi-model, self-supervised, vision-language-pretraining, zero-shot
Language: Python
Homepage:
Size: 970 KB
Stars: 622
Watchers: 19
Forks: 31
Open Issues: 23
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# [Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.](https://arxiv.org/abs/2110.05208)

DeCLIP is an open-source project that welcomes any contribution and feedback. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible as well as a standardized toolkit to reimplement existing methods and develop their own new Contrastive Language-Image Pretraining methods. You can find the following things in this repo:
+ Pre-trained models and training codes to reproduce various Contrastive Language-Image Pretraining methods(e.g. CLIP, DeCLIP, SLIP, FILIP).
+ Various benchmark datasets for Large-scale Contrastive Language-Image Pretraining task.
+ Zero-shot transfer and linear classification evaluation scripts for downstream datasets.

We aims to democratize large-scale CLIP to build a fair and reproducible CLIP community. Our paper are available on:

**DeCLIP**: [Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm](https://arxiv.org/abs/2110.05208).

**CLIP-Benchmark**: [Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision](https://arxiv.org/abs/2203.05796).

## Call for Papers & Participation

:loudspeaker: **Call for Papers & Participation**: ECCV Workshop and Challenge on [Computer Vision in the Wild (CVinW)](https://computer-vision-in-the-wild.github.io/eccv-2022/)

CVinW [Workshop]
ICinW [IC Challenge]
ODinW [OD Challenge]

## Introduction

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radfordet al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our DeCLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from these intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1× fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.

Declip framework

# Updates

***2022-09-19*** :loudspeaker: **Call for Papers & Participation**: ECCV Workshop and Challenge on [Computer Vision in the Wild (CVinW)](https://computer-vision-in-the-wild.github.io/eccv-2022/)

***2022-06-25*** We release the checkpoints of each models for benchmark.

***2022-03-10*** We update the result of CLIP-Benchmark and release our YFCC15M dataset.

***2022-02-22*** We release our training code, benchmark, and model zoo! ***We will release the checkpoints of each models after align the results soon***. We hope this project could serve the growing Contrastive Language-Image Pretraining research community by providing a flexible as well as standardized toolkit.

***2021-11-06*** First Commit, Our code, dataset and models will be relased soon.

## Installation

Please refer to [get_started.md](docs/get_started.md#installation) for installation and [dataset_prepare.md](docs/dataset_prepare.md#prepare-datasets) for dataset preparation.

## Get Started

Install PyTorch. The code has been tested with CUDA 11.2/CuDNN 8.1.0, PyTorch 1.8.1.

First, prepare pre-training datasets and downstream classification datasets through [get_started.md](docs/get_started.md#installation).

We organize the different models trained on different data through separate [experimental catalogs] (experiments/), you can check the dir for detail.

#### 1. Pre-training

You can run `run.sh` directly to train the corresponding model. We train most of our models on 4x8-gpu nodes. Check the config in the experiment directory of the corresponding model for details.

#### 2. Zero-shot Evalution

You can add a argument `--evaluate` on run script for zero-shot evalution.

## DeCLIP Model-Zoo

### Our pretrain visual backbone model (w/o text encoder)

Method
Dataset
Model
Epochs
0-shot
Config
Paper
Weights

DeCLIP
Declip-88M
ResNet50
32
62.5
config
paper
GoogleDriver

DeCLIP
Declip-88M
ViT-B32
32
66.2
config
paper
GoogleDriver

### Our pretrain declip model (w text encoder)

Method
Dataset
Model
Epochs
0-shot
Config
Paper
Weights

DeCLIP
Declip-88M
ResNet50
32
62.5
config
paper
GoogleDriver

DeCLIP
Declip-88M
ViT-B32
32
66.2
config
paper
GoogleDriver

# CLIP-Benchmark

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision. Our paper is available on [Arxiv](https://arxiv.org/abs/2203.05796).

Witnessing its great success, researchers continue to push the frontier of CLIP. For instance, SLIP, DeCLIP and FILIP achieve considerable improvements via embracing different kinds of supervision within the image-text pairs. However, it remains challenging to make fair comparison between these methods. This is because they do not choose consistent training recipes and even use different data. We propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. Moreover, we further combine DeCLIP with FILIP, bringing us the strongest variant DeFILIP.

Declip framework

### Supported Models:

The following models are pre-trained on YFCC15M and evaluated on ImageNet-1K (ILSVRC2012).

Method
Dataset
Model
Epochs
0-shot
Config
Paper
Weights

CLIP
YFCC-15M
ViT-B32
32
32.8
config
paper
GoogleDriver

DeCLIP
YFCC-15M
ViT-B32
32
43.2
config
paper
GoogleDriver

SLIP
YFCC-15M
ViT-B32
32
34.3
config
paper
GoogleDriver

FILIP
YFCC-15M
ViT-B32
32
39.5
config
paper
GoogleDriver

DeFILIP
YFCC-15M
ViT-B32
32
45.0
config
paper
GoogleDriver

Method
Dataset
Model
Epochs
0-shot
Config
Paper
Weights

CLIP
YFCC-15M
ResNet50
32
37.2
config
paper
GoogleDriver

DeCLIP
YFCC-15M
ResNet50
32
44.4
config
paper
GoogleDriver

SLIP
YFCC-15M
ResNet50
32
28.5
config
paper
--

FILIP
YFCC-15M
ResNet50
32
21.3
config
paper
--

### Supported datasets:

Dataset
Samples
download
Paper

YFCC-15M
15,388,848
google driver
url

## Changelog

***2022-02-22*** Realase our Training code

***2021-11-06*** First Commit

## Citation

```
@inproceedings{li2022supervision,
title={Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm},
author={Yangguang Li and Feng Liang and Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=zq1iJkNk3uN}
}

@misc{cui2022democratizing,
title={Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision},
author={Yufeng Cui and Lichen Zhao and Feng Liang and Yangguang Li and Jing Shao},
year={2022},
eprint={2203.05796},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```

## License

For academic use, this project is licensed under the 2-clause BSD License. For commercial use, please contact the authors.

## Acknowledgement

Our framework is based on [prototype](https://github.com/ModelTC/prototype).