{"id":15783618,"url":"https://github.com/deepmancer/vlm-toolbox","last_synced_at":"2026-01-11T02:48:27.867Z","repository":{"id":253993904,"uuid":"844118693","full_name":"deepmancer/vlm-toolbox","owner":"deepmancer","description":null,"archived":false,"fork":false,"pushed_at":"2024-08-25T13:58:16.000Z","size":16898,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-11T20:02:03.670Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deepmancer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-18T12:49:33.000Z","updated_at":"2024-08-27T21:22:39.000Z","dependencies_parsed_at":"2024-10-04T20:00:31.674Z","dependency_job_id":"de5166bd-988c-4ee2-8e60-9f0117ef38a8","html_url":"https://github.com/deepmancer/vlm-toolbox","commit_stats":null,"previous_names":["deepmancer/vlm-toolbox"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepmancer%2Fvlm-toolbox","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepmancer%2Fvlm-toolbox/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepmancer%2Fvlm-toolbox/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepmancer%2Fvlm-toolbox/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deepmancer","download_url":"https://codeload.github.com/deepmancer/vlm-toolbox/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246500733,"owners_count":20787744,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-04T20:00:21.428Z","updated_at":"2026-01-11T02:48:27.824Z","avatar_url":"https://github.com/deepmancer.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/deepmancer/vlm-toolbox/main/assets/vision-language-models-toolbox-logo.png\" \n       alt=\"VLM Toolbox Logo\" width=\"80%\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge\u0026logo=PyTorch\u0026logoColor=white\" alt=\"PyTorch Badge\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Python-3670A0?style=for-the-badge\u0026logo=Python\u0026logoColor=ffdd54\" alt=\"Python Badge\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Jupyter-F37626.svg?\u0026style=for-the-badge\u0026logo=Jupyter\u0026logoColor=white\" alt=\"Jupyter Notebook Badge\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/License-BSD_3--Clause-blue.svg?style=for-the-badge\" alt=\"BSD 3-Clause License\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cem\u003eA PyTorch-powered library for accelerating multimodal AI research with Vision-Language Models\u003c/em\u003e\n\u003c/p\u003e\n\n# Vision-Language Models Toolbox\n\nA flexible, all-in-one PyTorch library that streamlines research and development with state-of-the-art vision-language models. Whether you’re experimenting with soft-prompt tuning (e.g., CoOp, CoCoOp) or large-scale models such as CLIP, this toolbox provides a robust foundation built on PyTorch and Hugging Face Transformers.\n\n---\n\n## Table of Contents\n\n- [Key Features](#key-features)\n- [Supported Models](#supported-models)\n- [Quick Start](#quick-start)\n- [Usage](#usage)\n  - [Running Experiments](#running-experiments)\n  - [Adding New Models](#adding-new-models)\n  - [Adding a New Dataset](#adding-a-new-dataset)\n- [Jupyter Notebooks](#jupyter-notebooks)\n- [Installation](#installation)\n- [Acknowledgments](#acknowledgments)\n- [Contributing](#contributing)\n- [License](#license)\n\n---\n\n## Key Features\n\n| **Feature**                 | **Description**                                                                                                                                          |\n|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **Multimodal Datasets**     | Supports **ImageNet1k, CIFAR-100, Stanford Cars, iNaturalist 2021, MSCOCO Captions**, and more.                                                         |\n| **Model Flexibility**       | Works with **CLIP (ViT \u0026 ResNet), DINO-V2, MiniLM, MPNet**, and also allows adding custom models.                                                        |\n| **Custom Objectives/Tasks** | Quickly add new tasks or losses with minimal code changes for all combined vision-language flows.                                                       |\n| **Prompt Tuning**           | Supports **soft prompts (CoOp, CoCoOp) and predefined hard prompts** for dataset adaptation.                                                             |\n| **Scalability \u0026 Precision** | Supports **multi-GPU, mixed precision (FP16, BF16, FP32, FP64), sharding, and DeepSpeed**.                                                               |\n| **Sampling Strategies**     | Includes **oversampling, undersampling, and hybrid methods** like **SMOTE, ADASYN, and Tomek Links**.                                                   |\n| **Data Augmentation**       | Provides **image and text augmentations** for model training.                                                                                           |\n| **Evaluation Metrics**      | Tracks **accuracy, precision, recall, F1-score, AUC-ROC, and more**.                                                                                    |\n| **Logging \u0026 Visualization** | Supports **TensorBoard \u0026 Loguru** for monitoring and debugging.                                                                                          |\n| **Flexible API**            | **Pre-built modules \u0026 functionalities** for datasets, models, tasks, setups, and more.                                                                  |\n\n---\n\n## Supported Models\n\n| Backbone           | Supported Provider(s)                                                                                                  | Modality   |\n|--------------------|------------------------------------------------------------------------------------------------------------------------|------------|\n| **CLIP-ViT-B/32**  | [OpenAI](https://openai.com/research/clip)\u003cbr\u003e[Hugging Face](https://huggingface.co/openai/clip-vit-base-patch32)       | Multimodal |\n| **CLIP-ViT-B/16**  | [OpenAI](https://openai.com/research/clip)\u003cbr\u003e[Hugging Face](https://huggingface.co/openai/clip-vit-base-patch16)       | Multimodal |\n| **CLIP-ViT-L/14**  | [OpenAI](https://openai.com/research/clip)\u003cbr\u003e[Hugging Face](https://huggingface.co/openai/clip-vit-large-patch14)       | Multimodal |\n| **CLIP-ViT-L/14-336** | [OpenAI](https://openai.com/research/clip)\u003cbr\u003e[Hugging Face](https://huggingface.co/openai/clip-vit-large-patch14-336) | Multimodal |\n| **CLIP-RN50**      | [OpenAI](https://openai.com/research/clip)                                                                             | Multimodal |\n| **CLIP-RN101**     | [OpenAI](https://openai.com/research/clip)                                                                             | Multimodal |\n| **CLIP-RN50x4**    | [OpenAI](https://openai.com/research/clip)                                                                             | Multimodal |\n| **CLIP-RN50x16**   | [OpenAI](https://openai.com/research/clip)                                                                             | Multimodal |\n| **CLIP-RN50x64**   | [OpenAI](https://openai.com/research/clip)                                                                             | Multimodal |\n| **DINO-V2-GIANT**  | [Hugging Face](https://huggingface.co/facebook/dinov2-giant)                                                            | Image      |\n| **ALL-MiniLM-L6-v2**  | [Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)                                          | Text       |\n| **ALL-MPNET-BASE-V2** | [Hugging Face](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)                                         | Text       |\n\n---\n\n## Quick Start\n\n**Fine-tuning a CLIP model on ImageNet** is as simple as:\n\n```bash\npython vlm_toolbox/scripts/train.py \\\n    --dataset_name imagenet1k \\\n    --backbone_name vit_b_32 \\\n    --trainer_name clip \\\n    --model_type few_shot \\\n    --setup_type full \\\n    --num_epochs 100 \\\n    --train_batch_size 64 \\\n    --eval_batch_size 256 \\\n    --precision_dtype fp16 \\\n    --source huggingface \\\n    --main_metric_name accuracy \\\n    --random_state 42 \\\n    --device_type cuda \\\n    --collate_all_m2_samples False \\\n    --save_predictions True\n```\n\nThis command uses a ViT-B/32 CLIP model from Hugging Face, automatically logs progress, and stores prediction outputs for later review.\n\n---\n\n## Usage\n\n### Running Experiments\n\nYou can also import this toolbox as a library for more advanced or **custom** experimentation. Here’s a minimal code example illustrating how to set up a multimodal pipeline:\n\n```python\nfrom config.enums import (\n    CLIPBackbones,\n    ImageDatasets,\n    Trainers,\n    Sources,\n    Metrics,\n    Stages,\n)\nfrom pipeline.pipeline import Pipeline\nfrom config.setup import Setup\nfrom util.memory import flush\n\n# 1. Define your setup\nsetup = Setup(\n    dataset_name=ImageDatasets.IMAGENET_1K,\n    backbone_name=CLIPBackbones.CLIP_VIT_B_32,\n    trainer_name=Trainers.CLIP,\n    model_type='few_shot',\n    setup_type='full',\n    num_epochs=100,\n    train_batch_size=64,\n    eval_batch_size=256,\n    precision_dtype='fp16',\n    main_metric_name=Metrics.ACCURACY,\n    random_state=42,\n    device_type='cuda'\n)\n\n# 2. Initialize the pipeline\npipeline = Pipeline(setup, device_type='cuda')\n\n# 3. Run the training\npipeline.run(\n    collate_all_m2_samples=False,\n    save_predictions=True,\n    persist=True,\n)\n\n# 4. Clean up\npipeline.tear_down()\nflush()\n```\n\n\u003e **Note**: The toolbox treats multiple data inputs as modalities: `m1` and `m2`. This modular design makes it easy to extend support for text, image, video, or other data streams.\n\n---\n\n### Adding New Models\n\nOne key strength of this repository is its **extensibility**. Integrating your own model is straightforward:\n\n1. **Add Your Model to an Enum**  \n   Extend `ImageBackbones` or `CLIPBackbones` in \n   [`enums.py`](vlm_toolbox/config/enums.py):\n   ```python\n   class ImageBackbones(BaseEnum):\n       DINO_V2_GIANT = 'dino_v2_giant'\n       NEW_IMAGE_MODEL = 'new_image_model'\n   ```\n\n2. **Specify the Model URL**  \n   Update [`backbones.py`](vlm_toolbox/config/backbones.py):\n   ```python\n   class BackboneURLConfig(BaseConfig):\n       config = {\n           Backbones.IMAGE: {\n               ImageBackbones.NEW_IMAGE_MODEL: {\n                   Sources.HUGGINGFACE: 'new/image-model-url',\n               },\n           },\n           ...\n       }\n   ```\n\n3. **Train \u0026 Evaluate**  \n   Reference your new model from the command line or from your Python code. Your model is now part of the VL Models Toolbox!\n\n---\n\n### Adding a New Dataset\n\nSimilar to adding new models, you can integrate additional datasets seamlessly:\n\n1. **Extend the `ImageDatasets` Enum**  \n   In [`enums.py`](vlm_toolbox/config/enums.py), add:\n   ```python\n   class ImageDatasets(BaseEnum):\n       IMAGENET_1K = 'imagenet1k'\n       FOOD101 = 'food101'\n       ...\n       MY_NEW_DATASET = 'my_new_dataset'\n   ```\n\n2. **Add Configuration**  \n   In [`image_datasets.py`](vlm_toolbox/config/image_datasets.py), define:\n   ```python\n   ImageDatasetConfig.config = {\n       ...\n       ImageDatasets.MY_NEW_DATASET: {\n           'splits': ['train', 'validation'],\n           DataStatus.RAW: {\n               'path': 'HuggingFaceM4/MYNEW',\n               'type': StorageType.HUGGING_FACE,\n           },\n           DataStatus.EMBEDDING: {\n               'path': '/path/to/embeddings/my_new_dataset',\n               'type': StorageType.DISK,\n           },\n           'id_col': 'my_label_column_name',\n       },\n   }\n   ```\n\n3. **Validate Paths**  \n   If using a local folder, ensure `StorageType.IMAGE_FOLDER` or `StorageType.DISK` is set, and that the path exists.\n\n4. **Reference the Dataset**  \n   Use `my_new_dataset` in your script or code, and you're all set. The dataset is now recognized and processed just like any other!\n\n---\n\n## Jupyter Notebooks\n\nFor deeper experimentation and visualization, explore our **Jupyter notebooks** in the [`notebooks`](notebooks) directory:\n\n- **[Zero-Shot Image Classification with CLIP](notebooks/evaluate/zero_shot.ipynb)**  \n  Demonstrates example usage and evaluation for zero-shot scenarios.\n\n  \u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/deepmancer/vlm-toolbox/main/assets/figures/top5-preds-prob.png\" \n         alt=\"Top 5 Predictions Probability\" width=\"50%\"\u003e\n  \u003c/p\u003e\n  \u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/deepmancer/vlm-toolbox/main/assets/figures/zero-shot-od.png\" \n         alt=\"Zero-shot Object Detection Model Output\" width=\"50%\"\u003e\n  \u003c/p\u003e\n\n- **[Embedding Distribution Visualization](notebooks/analytics/embedding_distribution.ipynb)**  \n  Compare embeddings via t-SNE, PCA, and more.\n\n  \u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/deepmancer/vlm-toolbox/main/assets/figures/tsne-all-classes.jpg\" \n         alt=\"VLM Image \u0026 Text Embeddings Visualization\" width=\"50%\"\u003e\n  \u003c/p\u003e\n  \u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/deepmancer/vlm-toolbox/main/assets/figures/tsne-top-preds.png\" \n         alt=\"Top-k Predictions Image Embedding Visualization\" width=\"50%\"\u003e\n  \u003c/p\u003e\n\n- **[Multi-Granular Performance on ImageNet](notebooks/analytics/multi_granular_performance.ipynb)**  \n  Assess model accuracy at different class hierarchical levels.\n\n  \u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/deepmancer/vlm-toolbox/main/assets/figures/tree-hierarchy-eval.png\" \n         alt=\"Top-k Predictions Visualization on Label Hierarchy\" width=\"50%\"\u003e\n  \u003c/p\u003e\n\n- **[Misclassification Error Analysis](notebooks/analytics/sample_analysis.ipynb)**  \n  Gain insights into where and why the model misclassifies.\n\n  \u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/deepmancer/vlm-toolbox/main/assets/figures/gt-heatmap.png\" \n         alt=\"Ground Truth Heatmap\" width=\"50%\"\u003e\n  \u003c/p\u003e\n  \u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/deepmancer/vlm-toolbox/main/assets/figures/top1-heatmap.png\" \n         alt=\"Top-1 Prediction Heatmap\" width=\"50%\"\u003e\n  \u003c/p\u003e\n  \u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/deepmancer/vlm-toolbox/main/assets/figures/top5-heatmap.png\" \n         alt=\"Top-5 Predictions Heatmap\" width=\"50%\"\u003e\n  \u003c/p\u003e\n\n---\n\n## Installation\n\n**1. (Optional) Create a Conda Environment**\n\n```bash\nconda create -n vlm python=3.9\nconda activate vlm\n```\n\n**2. Install From the Source**\n\n```bash\ngit clone https://github.com/deepmancer/vlm-toolbox.git\ncd vlm-toolbox\npip install -e .\n```\n\nFor more detailed instructions (e.g., installing separate packages individually), see [SETUP.md](SETUP.md).\n\n---\n\n## Acknowledgments\n\nThis project benefits from the work of several open-source repositories. We acknowledge and appreciate their contributions to the research community:\n\n- **[OpenAI CLIP](https://github.com/openai/CLIP)**\n- **[CoOp](https://github.com/KaiyangZhou/CoOp)**\n- **[ProText](https://github.com/muzairkhattak/ProText)**\n- **[CuPL](https://github.com/sarahpratt/CuPL)**\n\n---\n\n## Contributing\n\nContributions, suggestions, and new ideas are **highly appreciated**!\n\n- **Submit Issues \u0026 PRs**: If you find bugs or have feature requests, open an [issue](https://github.com/yourusername/vlm-toolbox/issues) or a pull request.  \n- **Spread the Word**: Star the repo and share your results to help grow the community.\n\nFor direct inquiries, feel free to reach out via email:\n\n**alirezaheidari dot cs at gmail dot com**\n\n---\n\n## License\n\nThis project is under the [BSD 3-Clause License](LICENSE).  \nUse it freely, modify it, and share your improvements under the same terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepmancer%2Fvlm-toolbox","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepmancer%2Fvlm-toolbox","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepmancer%2Fvlm-toolbox/lists"}