https://github.com/shreydan/liteclip
a lighter implementation of OpenAI CLIP in PyTorch.
https://github.com/shreydan/liteclip
Last synced: 7 months ago
JSON representation
a lighter implementation of OpenAI CLIP in PyTorch.
- Host: GitHub
- URL: https://github.com/shreydan/liteclip
- Owner: shreydan
- Created: 2023-03-15T15:50:19.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-11-06T16:46:05.000Z (about 2 years ago)
- Last Synced: 2025-05-15T13:11:59.287Z (8 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 943 KB
- Stars: 4
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# liteCLIP
## CLIP
CLIP (Contrastive Language-Image Pre-Training) model is a deep learning model designed to understand the relationship between images and text. Specifically, CLIP is trained on a large corpus of text and images in a self-supervised manner to learn how to associate descriptive text with the visual content of images.
#### It was introduced by [OpenAI](https://openai.com/research/clip)
#### Paper: Learning Transferable Visual Models From Natural Language Supervision [(arxiv)](https://arxiv.org/abs/2103.00020)

---
## liteCLIP
The models released were generally large in size since they used ViT and transformer language models as the image and text encoders respectively.
I wanted to train a lighter version of it to understand how it works and how the contrastive loss function associates the images with the texts so I trained liteCLIP.
I tried to implement the loss function as per the pseudo-code provided in the paper.
### trained using PyTorch, PyTorch Lightning
### it was trained on [Flickr8K](https://www.kaggle.com/datasets/adityajn105/flickr8k) which has ~8000 images with ~5 captions for each image.
### you can go through the training procedure in this notebook: [training.ipynb](./training.ipynb)
```
liteCLIP architecture:
----------------------
image encoder: convnext_tiny
text encoder: bert-mini (google/bert_uncased_L-4_H-256_A-4)
max token length: 128
embeddings dropout: 0.1
embeddings dimension: 256
batch size: 64
learning rate: 2e-4
epochs: 5
optimizer: Adam
```
## Zero-Shot Inference:

### Usage:
download model from `Releases`, save in `./model` dir as `liteclip2.pt`
```python
from liteclip import ZeroShotPipeline
pipeline = ZeroShotPipeline()
predictions = pipeline.predict('examples/cat.jpg',
['a photo of a dog',
'a photo of a cat',
'the photo of a human baby'
])
for label,prob in predictions:
print(f"{label}: {prob*100:.2f}%")
```
### You can see the results in [inference.ipynb](./inference.ipynb)
### Extra Resources
- [@moein-shariatnia/OpenAI-CLIP](https://github.com/moein-shariatnia/OpenAI-CLIP)
- [@openai/CLIP](https://github.com/openai/CLIP/tree/main/clip)
## Citations
```bibtex
@misc{https://doi.org/10.48550/arxiv.2103.00020,
doi = {10.48550/ARXIV.2103.00020},
url = {https://arxiv.org/abs/2103.00020},
author = {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Learning Transferable Visual Models From Natural Language Supervision},
publisher = {arXiv},
year = {2021},
copyright = {arXiv.org perpetual, non-exclusive license}
}
```
```bibtex
@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}
```
```bibtex
@article{turc2019,
title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
journal={arXiv preprint arXiv:1908.08962v2 },
year={2019}
}
```
```bibtex
@software{Shariatnia_Simple_CLIP_2021,
author = {Shariatnia, M. Moein},
doi = {10.5281/zenodo.6845731},
month = {4},
title = {{Simple CLIP}},
version = {1.0.0},
year = {2021}
}
```
had fun and learnt a lot <3