https://github.com/apple/ml-veclip
The official repo for the paper "VeCLIP: Improving CLIP Training via Visual-enriched Captions"
https://github.com/apple/ml-veclip
Last synced: about 1 year ago
JSON representation
The official repo for the paper "VeCLIP: Improving CLIP Training via Visual-enriched Captions"
- Host: GitHub
- URL: https://github.com/apple/ml-veclip
- Owner: apple
- License: other
- Created: 2024-03-06T19:00:33.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-01-22T17:21:12.000Z (over 1 year ago)
- Last Synced: 2025-04-13T00:49:23.304Z (about 1 year ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 1.42 MB
- Stars: 242
- Watchers: 15
- Forks: 14
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# [ECCV-2024] VeCLIP: Improving CLIP Training via Visual-enriched Captions
* A novel CLIP training scheme that achieves the SoTA performance on zero-shot ImageNet classification and COCO image text retreival using limited visual-enriched captions. * [[Paper](https://arxiv.org/abs/2310.07699)]
[Zhengfeng Lai*](https://zjujefflai.github.io/), [Haotian Zhang*](https://haotian-zhang.github.io/) , [Bowen Zhang](https://zbwglory.github.io/), Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, [Zhe Gan](https://zhegan27.github.io/), Jiulong Shan, [Chen-Nee Chuah](https://www.ece.ucdavis.edu/~chuah/rubinet/people/chuah/bio.html), Yinfei Yang, Meng Cao [*: equal contribution]
Diagram of VeCap.
## Release
- [10/03/2024] 🔥🔥🔥 We release [VeCap-V2](https://arxiv.org/abs/2410.02740): Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models.
- [08/23/2024] 🔥🔥🔥 We release our VeCap-300M [dataset](#vecap-300m-download).
- [07/01/2024] 🔥 Our paper is accepted by ECCV 2024.
- [03/06/2024] 🔥 We released the VeCLIP & VeCap-DFN [checkpoints](#checkpoints).
## Contents
- [Install](#install)
- [Getting Started](#getting-started)
- [Checkpoints](#checkpoints)
## Install
1. Clone this repository
```Shell
git clone https://github.com/apple/ml-veclip
cd ml-veclip
```
2. Create an environment and install related packages
```Shell
conda create -n veclip python=3.9 -y
conda activate veclip
pip install -r requirements.txt
```
## Getting Started
See the [example notebook](https://github.com/apple/ml-veclip/blob/main/load_veclip.ipynb) for details on how to simply load the different checkpoints using HuggingFace transformers.
## VeCap-300M Download
We split our 300M data into 10 jsons: for each image, we save the web link and our caption.
```Shell
wget -i vecap300m.txt -b -c
```
## Checkpoints
We release the checkpoints for **VeCLIP**, which are trained from scratch on visual-enriched captions VeCap 3M/12M/100M/200M/300M, as reported in the paper. The models are evaluated on COCO/Flickr30k image-text retrieval and ImageNet/ImageNetv2 classification in a zero-shot fashion. Use `wget` or `curl` to download the below checkpoints.
Data
Model
Resolution
COCO (R@1)
Flickr30k (R@1)
ImageNet
ImageNetv2
I2T
T2I
I2T
T2I
VeCap 3M
CLIP-B/16
224x224
5.46
3.28
12.20
6.36
5.46
7.09
VeCLIP-B/16
224x224
22.30
13.01
40.60
27.58
15.98
13.51
VeCap 12M
CLIP-B/16
224x224
24.52
14.28
44.70
290.6
31.60
27.03
VeCLIP-B/16
224x224
47.78
31.62
73.90
55.68
38.11
32.53
VeCap 100M
CLIP-B/16
224x224
47.24
30.61
74.40
57.16
58.64
50.96
VeCLIP-B/16
224x224
64.82
46.12
89.30
73.10
60.77
54.17
VeCap 200M
CLIP-B/16
224x224
52.20
34.97
80.90
63.26
63.72
56.84
VeCLIP-B/16
224x224
67.20
48.40
91.10
76.32
64.64
57.67
We further found our VeCap can also be complementary to other well-established filtering methods, e.g., [Data Filtering Network (DFN)](ttps://arxiv.org/abs/2309.17425). We also provide thosse checkpoints (referred to as **VeCap-DFN**) and report their performance below.
Backbone
Resolution
Data
COCO (R@1)
Flickr30k (R@1)
ImageNet
ImageNetV2
I2T
T2I
I2T
T2I
VeCap-DFN-B/16
224x224
DFN
62.96
43.20
87.10
70.44
76.15
68.19
VeCap 300M
64.74
44.58
90.10
73.14
46.43
41.15
DFN + VeCap 300M
66.28
45.12
88.80
73.56
76.19
69.58
VeCap-DFN-L/14
224x224
DFN + VeCap 300M
71.06
51.13
93.10
80.96
81.95
75.48
VeCap-DFN-H/14
336x336
DFN + VeCap 300M
72.78
52.33
93.60
82.64
83.07
76.37
## Citation
If you find VeCLIP useful, please cite using this BibTeX:
```bibtex
@misc{lai2024veclip,
title={VeCLIP: Improving CLIP Training via Visual-enriched Captions},
author={Zhengfeng Lai and Haotian Zhang and Bowen Zhang and Wentao Wu and Haoping Bai and Aleksei Timofeev and Xianzhi Du and Zhe Gan and Jiulong Shan and Chen-Nee Chuah and Yinfei Yang and Meng Cao},
year={2024},
eprint={2310.07699},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{lai2024revisitlargescaleimagecaptiondata,
title={Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models},
author={Zhengfeng Lai and Vasileios Saveris and Chen Chen and Hong-You Chen and Haotian Zhang and Bowen Zhang and Juan Lao Tebar and Wenze Hu and Zhe Gan and Peter Grasch and Meng Cao and Yinfei Yang},
year={2024},
eprint={2410.02740},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.02740},
}
@article{fang2023data,
title={Data filtering networks},
author={Fang, Alex and Jose, Albin Madappally and Jain, Amit and Schmidt, Ludwig and Toshev, Alexander and Shankar, Vaishaal},
journal={arXiv preprint arXiv:2309.17425},
year={2023}
}
```
## Acknowledgement
- [axlearn](https://github.com/apple/axlearn): the codebase we use to train the models.
- [huggingface transformers](https://huggingface.co/docs/transformers/en/index): Transformers provides APIs to load our trained models.