https://github.com/deepmancer/vlm-toolbox
https://github.com/deepmancer/vlm-toolbox
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/deepmancer/vlm-toolbox
- Owner: deepmancer
- License: bsd-3-clause
- Created: 2024-08-18T12:49:33.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-25T13:58:16.000Z (over 1 year ago)
- Last Synced: 2024-10-11T20:02:03.670Z (over 1 year ago)
- Size: 16.1 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
A PyTorch-powered library for accelerating multimodal AI research with Vision-Language Models
# Vision-Language Models Toolbox
A flexible, all-in-one PyTorch library that streamlines research and development with state-of-the-art vision-language models. Whether you’re experimenting with soft-prompt tuning (e.g., CoOp, CoCoOp) or large-scale models such as CLIP, this toolbox provides a robust foundation built on PyTorch and Hugging Face Transformers.
---
## Table of Contents
- [Key Features](#key-features)
- [Supported Models](#supported-models)
- [Quick Start](#quick-start)
- [Usage](#usage)
- [Running Experiments](#running-experiments)
- [Adding New Models](#adding-new-models)
- [Adding a New Dataset](#adding-a-new-dataset)
- [Jupyter Notebooks](#jupyter-notebooks)
- [Installation](#installation)
- [Acknowledgments](#acknowledgments)
- [Contributing](#contributing)
- [License](#license)
---
## Key Features
| **Feature** | **Description** |
|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Multimodal Datasets** | Supports **ImageNet1k, CIFAR-100, Stanford Cars, iNaturalist 2021, MSCOCO Captions**, and more. |
| **Model Flexibility** | Works with **CLIP (ViT & ResNet), DINO-V2, MiniLM, MPNet**, and also allows adding custom models. |
| **Custom Objectives/Tasks** | Quickly add new tasks or losses with minimal code changes for all combined vision-language flows. |
| **Prompt Tuning** | Supports **soft prompts (CoOp, CoCoOp) and predefined hard prompts** for dataset adaptation. |
| **Scalability & Precision** | Supports **multi-GPU, mixed precision (FP16, BF16, FP32, FP64), sharding, and DeepSpeed**. |
| **Sampling Strategies** | Includes **oversampling, undersampling, and hybrid methods** like **SMOTE, ADASYN, and Tomek Links**. |
| **Data Augmentation** | Provides **image and text augmentations** for model training. |
| **Evaluation Metrics** | Tracks **accuracy, precision, recall, F1-score, AUC-ROC, and more**. |
| **Logging & Visualization** | Supports **TensorBoard & Loguru** for monitoring and debugging. |
| **Flexible API** | **Pre-built modules & functionalities** for datasets, models, tasks, setups, and more. |
---
## Supported Models
| Backbone | Supported Provider(s) | Modality |
|--------------------|------------------------------------------------------------------------------------------------------------------------|------------|
| **CLIP-ViT-B/32** | [OpenAI](https://openai.com/research/clip)
[Hugging Face](https://huggingface.co/openai/clip-vit-base-patch32) | Multimodal |
| **CLIP-ViT-B/16** | [OpenAI](https://openai.com/research/clip)
[Hugging Face](https://huggingface.co/openai/clip-vit-base-patch16) | Multimodal |
| **CLIP-ViT-L/14** | [OpenAI](https://openai.com/research/clip)
[Hugging Face](https://huggingface.co/openai/clip-vit-large-patch14) | Multimodal |
| **CLIP-ViT-L/14-336** | [OpenAI](https://openai.com/research/clip)
[Hugging Face](https://huggingface.co/openai/clip-vit-large-patch14-336) | Multimodal |
| **CLIP-RN50** | [OpenAI](https://openai.com/research/clip) | Multimodal |
| **CLIP-RN101** | [OpenAI](https://openai.com/research/clip) | Multimodal |
| **CLIP-RN50x4** | [OpenAI](https://openai.com/research/clip) | Multimodal |
| **CLIP-RN50x16** | [OpenAI](https://openai.com/research/clip) | Multimodal |
| **CLIP-RN50x64** | [OpenAI](https://openai.com/research/clip) | Multimodal |
| **DINO-V2-GIANT** | [Hugging Face](https://huggingface.co/facebook/dinov2-giant) | Image |
| **ALL-MiniLM-L6-v2** | [Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | Text |
| **ALL-MPNET-BASE-V2** | [Hugging Face](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | Text |
---
## Quick Start
**Fine-tuning a CLIP model on ImageNet** is as simple as:
```bash
python vlm_toolbox/scripts/train.py \
--dataset_name imagenet1k \
--backbone_name vit_b_32 \
--trainer_name clip \
--model_type few_shot \
--setup_type full \
--num_epochs 100 \
--train_batch_size 64 \
--eval_batch_size 256 \
--precision_dtype fp16 \
--source huggingface \
--main_metric_name accuracy \
--random_state 42 \
--device_type cuda \
--collate_all_m2_samples False \
--save_predictions True
```
This command uses a ViT-B/32 CLIP model from Hugging Face, automatically logs progress, and stores prediction outputs for later review.
---
## Usage
### Running Experiments
You can also import this toolbox as a library for more advanced or **custom** experimentation. Here’s a minimal code example illustrating how to set up a multimodal pipeline:
```python
from config.enums import (
CLIPBackbones,
ImageDatasets,
Trainers,
Sources,
Metrics,
Stages,
)
from pipeline.pipeline import Pipeline
from config.setup import Setup
from util.memory import flush
# 1. Define your setup
setup = Setup(
dataset_name=ImageDatasets.IMAGENET_1K,
backbone_name=CLIPBackbones.CLIP_VIT_B_32,
trainer_name=Trainers.CLIP,
model_type='few_shot',
setup_type='full',
num_epochs=100,
train_batch_size=64,
eval_batch_size=256,
precision_dtype='fp16',
main_metric_name=Metrics.ACCURACY,
random_state=42,
device_type='cuda'
)
# 2. Initialize the pipeline
pipeline = Pipeline(setup, device_type='cuda')
# 3. Run the training
pipeline.run(
collate_all_m2_samples=False,
save_predictions=True,
persist=True,
)
# 4. Clean up
pipeline.tear_down()
flush()
```
> **Note**: The toolbox treats multiple data inputs as modalities: `m1` and `m2`. This modular design makes it easy to extend support for text, image, video, or other data streams.
---
### Adding New Models
One key strength of this repository is its **extensibility**. Integrating your own model is straightforward:
1. **Add Your Model to an Enum**
Extend `ImageBackbones` or `CLIPBackbones` in
[`enums.py`](vlm_toolbox/config/enums.py):
```python
class ImageBackbones(BaseEnum):
DINO_V2_GIANT = 'dino_v2_giant'
NEW_IMAGE_MODEL = 'new_image_model'
```
2. **Specify the Model URL**
Update [`backbones.py`](vlm_toolbox/config/backbones.py):
```python
class BackboneURLConfig(BaseConfig):
config = {
Backbones.IMAGE: {
ImageBackbones.NEW_IMAGE_MODEL: {
Sources.HUGGINGFACE: 'new/image-model-url',
},
},
...
}
```
3. **Train & Evaluate**
Reference your new model from the command line or from your Python code. Your model is now part of the VL Models Toolbox!
---
### Adding a New Dataset
Similar to adding new models, you can integrate additional datasets seamlessly:
1. **Extend the `ImageDatasets` Enum**
In [`enums.py`](vlm_toolbox/config/enums.py), add:
```python
class ImageDatasets(BaseEnum):
IMAGENET_1K = 'imagenet1k'
FOOD101 = 'food101'
...
MY_NEW_DATASET = 'my_new_dataset'
```
2. **Add Configuration**
In [`image_datasets.py`](vlm_toolbox/config/image_datasets.py), define:
```python
ImageDatasetConfig.config = {
...
ImageDatasets.MY_NEW_DATASET: {
'splits': ['train', 'validation'],
DataStatus.RAW: {
'path': 'HuggingFaceM4/MYNEW',
'type': StorageType.HUGGING_FACE,
},
DataStatus.EMBEDDING: {
'path': '/path/to/embeddings/my_new_dataset',
'type': StorageType.DISK,
},
'id_col': 'my_label_column_name',
},
}
```
3. **Validate Paths**
If using a local folder, ensure `StorageType.IMAGE_FOLDER` or `StorageType.DISK` is set, and that the path exists.
4. **Reference the Dataset**
Use `my_new_dataset` in your script or code, and you're all set. The dataset is now recognized and processed just like any other!
---
## Jupyter Notebooks
For deeper experimentation and visualization, explore our **Jupyter notebooks** in the [`notebooks`](notebooks) directory:
- **[Zero-Shot Image Classification with CLIP](notebooks/evaluate/zero_shot.ipynb)**
Demonstrates example usage and evaluation for zero-shot scenarios.
- **[Embedding Distribution Visualization](notebooks/analytics/embedding_distribution.ipynb)**
Compare embeddings via t-SNE, PCA, and more.
- **[Multi-Granular Performance on ImageNet](notebooks/analytics/multi_granular_performance.ipynb)**
Assess model accuracy at different class hierarchical levels.
- **[Misclassification Error Analysis](notebooks/analytics/sample_analysis.ipynb)**
Gain insights into where and why the model misclassifies.
---
## Installation
**1. (Optional) Create a Conda Environment**
```bash
conda create -n vlm python=3.9
conda activate vlm
```
**2. Install From the Source**
```bash
git clone https://github.com/deepmancer/vlm-toolbox.git
cd vlm-toolbox
pip install -e .
```
For more detailed instructions (e.g., installing separate packages individually), see [SETUP.md](SETUP.md).
---
## Acknowledgments
This project benefits from the work of several open-source repositories. We acknowledge and appreciate their contributions to the research community:
- **[OpenAI CLIP](https://github.com/openai/CLIP)**
- **[CoOp](https://github.com/KaiyangZhou/CoOp)**
- **[ProText](https://github.com/muzairkhattak/ProText)**
- **[CuPL](https://github.com/sarahpratt/CuPL)**
---
## Contributing
Contributions, suggestions, and new ideas are **highly appreciated**!
- **Submit Issues & PRs**: If you find bugs or have feature requests, open an [issue](https://github.com/yourusername/vlm-toolbox/issues) or a pull request.
- **Spread the Word**: Star the repo and share your results to help grow the community.
For direct inquiries, feel free to reach out via email:
**alirezaheidari dot cs at gmail dot com**
---
## License
This project is under the [BSD 3-Clause License](LICENSE).
Use it freely, modify it, and share your improvements under the same terms.