Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/DAMO-NLP-SG/Inf-CLIP
π£π£ The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
https://github.com/DAMO-NLP-SG/Inf-CLIP
clip contrastive-learning flash-attention infinite-batch-size memory-efficient ring-attention
Last synced: 13 days ago
JSON representation
π£π£ The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
- Host: GitHub
- URL: https://github.com/DAMO-NLP-SG/Inf-CLIP
- Owner: DAMO-NLP-SG
- License: apache-2.0
- Created: 2024-10-16T12:11:45.000Z (28 days ago)
- Default Branch: main
- Last Pushed: 2024-10-23T06:46:41.000Z (21 days ago)
- Last Synced: 2024-10-24T14:29:22.337Z (20 days ago)
- Topics: clip, contrastive-learning, flash-attention, infinite-batch-size, memory-efficient, ring-attention
- Language: Python
- Homepage:
- Size: 3.76 MB
- Stars: 47
- Watchers: 5
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive LossIf our project helps you, please give us a star β on GitHub to support us. ππ
[![arXiv](https://img.shields.io/badge/Arxiv-2410.17243-AD1C18.svg?logo=arXiv)](https://arxiv.org/abs/2410.17243)
[![hf_paper](https://img.shields.io/badge/π€-Paper%20In%20HF-red.svg)](https://huggingface.co/papers/2410.17243)
[![PyPI](https://img.shields.io/badge/PyPI-Inf--CL-9C276A.svg)](https://pypi.org/project/inf-cl)
[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/DAMO-NLP-SG/Inf-CLIP/blob/main/LICENSE)
[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FDAMO-NLP-SG%2FInf-CLIP&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com)
[![GitHub issues](https://img.shields.io/github/issues/DAMO-NLP-SG/Inf-CLIP?color=critical&label=Issues)](https://github.com/DAMO-NLP-SG/Inf-CLIP/issues?q=is%3Aopen+is%3Aissue)
[![GitHub closed issues](https://img.shields.io/github/issues-closed/DAMO-NLP-SG/Inf-CLIP?color=success&label=Issues)](https://github.com/DAMO-NLP-SG/Inf-CLIP/issues?q=is%3Aissue+is%3Aclosed)
[![zhihu](https://img.shields.io/badge/-η₯δΉ-000000?logo=zhihu&logoColor=0084FF)](https://zhuanlan.zhihu.com/p/1681887214)
[![Twitter](https://img.shields.io/badge/-Twitter-black?logo=twitter&logoColor=1D9BF0)](https://x.com/lixin4ever/status/1849669129613226457)π‘ Some other multimodal foundation model projects from our team may interest you β¨.
> [**VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding**](https://arxiv.org/abs/2311.16922)
> Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing
[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/DAMO-NLP-SG/VCD) [![github](https://img.shields.io/github/stars/DAMO-NLP-SG/VCD.svg?style=social)](https://github.com/DAMO-NLP-SG/VCD) [![arXiv](https://img.shields.io/badge/Arxiv-2311.16922-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2311.16922)> [**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https://github.com/DAMO-NLP-SG/VideoLLaMA2)
> Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/DAMO-NLP-SG/VideoLLaMA2) [![github](https://img.shields.io/github/stars/DAMO-NLP-SG/VideoLLaMA2.svg?style=social)](https://github.com/DAMO-NLP-SG/VideoLLaMA2) [![arXiv](https://img.shields.io/badge/Arxiv-2406.07476-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2406.07476)> [**The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio**](https://arxiv.org/abs/2410.12787)
> Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing
[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/DAMO-NLP-SG/CMM) [![github](https://img.shields.io/github/stars/DAMO-NLP-SG/CMM.svg?style=social)](https://github.com/DAMO-NLP-SG/CMM) [![arXiv](https://img.shields.io/badge/Arxiv-2410.12787-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.12787)## π° News
* **[2024.10.18]** Release training and evaluation codes of Inf-CLIP.## π οΈ Requirements and Installation
Basic Dependencies:
* Python >= 3.8
* Pytorch >= 2.0.0
* CUDA Version >= 11.8[Remote] Install Inf-CL:
```bash
# remote installing
pip install inf_cl -i https://pypi.org/simple
```[Local] Install Inf-CL:
```bash
pip install -e .
```Install required packages:
```bash
git clone https://github.com/DAMO-NLP-SG/Inf-CLIP
cd Inf-CLIP
pip install -r requirements.txt
```## β Features
`inf_cl` is the triton implementation of Inf-CL loss:
* [x] [Ring-CL (inf_cl/ring.py#L238)](https://github.com/DAMO-NLP-SG/Inf-CLIP/blob/main/inf_clip/models/ops/ring.py#L238)
* [x] [Inf-CL (inf_cl/ring.py#L251)](https://github.com/DAMO-NLP-SG/Inf-CLIP/blob/main/inf_clip/models/ops/ring.py#L251)`inf_clip` is the CLIP training codebase with Inf-CL loss and other training features:
- [x] [Gradient Accumulation (inf_clip/train/train.py#L180)](https://github.com/DAMO-NLP-SG/Inf-CLIP/inf_clip_train/train.py#L180)
- [x] [Gradient Cache (inf_clip/train/train.py#L292)](https://github.com/DAMO-NLP-SG/Inf-CLIP/blob/main/inf_clip_train/train.py#L292)## π Usage
A simple example about how to adopt our Inf-CL loss for contrastive learning. Using such command for attempting:
```
torchrun --nproc_per_node 2 tests/example.py
``````python
import torch
import torch.nn.functional as F
import torch.distributed as dist
import numpy as npfrom inf_cl import cal_inf_loss
def create_cl_tensors(rank, world_size):
# Parameters
dtype = torch.float32
num_heads = 3 # Number of attention heads
seq_length_q = 32768 # Sequence length
seq_length_k = 32768
d_model = 256 # Dimension of each head (must be 16, 32, 64, or 128)# Randomly initialize inputs
q = torch.rand((seq_length_q // world_size, num_heads * d_model), dtype=dtype, device=f"cuda:{rank}")
k = torch.rand((seq_length_k // world_size, num_heads * d_model), dtype=dtype, device=f"cuda:{rank}")
l = torch.ones([], dtype=dtype, device=f"cuda:{rank}") * np.log(1 / 0.07)q = F.normalize(q, p=2, dim=-1).requires_grad_() # Query
k = F.normalize(k, p=2, dim=-1).requires_grad_() # Key
l = l.requires_grad_() # Logit scalereturn q, k, l
if __name__ == "__main__":
# Assume that the distributed environment has been initialized
dist.init_process_group("nccl")rank = dist.get_rank()
world_size = dist.get_world_size()torch.cuda.set_device(rank)
# Exampled by Image-Text Contrastive Learning, q is the global image features,
# k is the text features, and l is the logit scale.
q, k, l = create_cl_tensors(rank, world_size)# labels are diagonal elements by default.
# labels = torch.arange(q.shape[0])
loss = cal_inf_loss(q, k, scale=l.exp())print(loss)
```
## π Main Results
### Memory Cost
\* denotes adopting "data offload" strategy.
### Max Supported Batch Size
### Speed
### Batch Size Scaling
Training with larger data scale needs larger batch size.
## ποΈ Training & Evaluation
### Quick Start
To facilitate further development on top of our codebase, we provide a quick-start guide on how to use Inf-CLIP to train a customized CLIP and evaluate the trained model on the mainstream clip benchmarks.
1. Training Data Structure:
```bash
Inf-CLIP
βββ datasets
β βββ cc3m/ # https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md
| | βββ 0000.tar
| | βββ 0001.tar
| | βββ ...
| | βββ 0301.tar
β βββ cc12m/ # https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc12m.md
| | βββ 0000.tar
| | βββ 0001.tar
| | βββ ...
| | βββ 1044.tar
β βββ laion400m/ # https://github.com/rom1504/img2dataset/blob/main/dataset_examples/laion400m.md
| | βββ 00000.tar
| | βββ 00001.tar
| | βββ ...
| | βββ 41407.tar
```
2. Command:
```bash
bash scripts/cc3m/lit_vit-b-32_bs16k.sh
bash scripts/cc12m/lit_vit-b-32_bs32k.sh
bash scripts/laion400m/lit_vit-b-32_bs256k.sh
```
3. Evaluation Data Structure:
```bash
Inf-CLIP
βββ datasets
β βββ imagenet-1k/ # download val_images.tar.gz of imagenet
| | βββ val/
| | | βββ n01440764
| | | βββ n01443537
| | | βββ ...
| | | βββ n15075141
β βββ clip-benchmark/ # bash datasets/benchmarks_download.sh
| | βββ wds_mscoco_captions
| | βββ wds_flickr8k
| | βββ wds_flickr30k
| | βββ wds_imagenet1k
| | βββ wds_imagenetv2
| | βββ wds_imagenet_sketch
| | βββ wds_imagenet-a
| | βββ wds_imagenet-r
| | βββ wds_imagenet-o
| | βββ wds_objectnet
```
4. Command:
```bash
# imagenet evaluation
bash scripts/imagenet_eval.sh
# overall evaluation
bash scripts/benchmarks_eval.sh
```## π Citation
If you find Inf-CLIP useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{damovl2024infcl,
title={Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss},
author={Zesen Cheng, Hang Zhang, Kehan Li, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing},
journal={arXiv preprint arXiv:2410.17243},
year={2024},
url={https://arxiv.org/abs/2410.12787}
}
```## π Acknowledgement
The codebase of Inf-CLIP is adapted from [**OpenCLIP**](https://github.com/mlfoundations/open_clip). We are also grateful for the following projects our Inf-CL arose from:
* [**OpenAI CLIP**](https://openai.com/index/clip/), [**img2dataset**](https://github.com/rom1504/img2dataset), [**CLIP-Benchmark**](https://github.com/LAION-AI/CLIP_benchmark).
* [**FlashAttention**](https://github.com/Dao-AILab/flash-attention), [**RingAttention**](https://github.com/haoliuhl/ringattention), [**RingFlashAttention**](https://github.com/zhuzilin/ring-flash-attention).## π License
This project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for **non-commercial use ONLY**, subject to the model Licenses of CLIP, Terms of Use of the data generated by OpenAI, and Laion. Please get in touch with us if you find any potential violations.