https://github.com/EbadShabbir/Argus_Resource
https://github.com/EbadShabbir/Argus_Resource
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/EbadShabbir/Argus_Resource
- Owner: EbadShabbir
- Created: 2025-07-01T18:44:27.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-07-03T07:39:32.000Z (3 months ago)
- Last Synced: 2025-07-11T23:04:14.623Z (3 months ago)
- Language: Jupyter Notebook
- Size: 208 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Argus: Can Argus Judge Them All? Comparing VLMs Across Domains
Argus is a benchmarking framework for evaluating **Vision-Language Models (VLMs)**โnotably **CLIP**, **BLIP**, and **LXMERT**โacross a diverse set of multimodal tasks and datasets. Drawing inspiration from the all-seeing giant of Greek mythology, Argus systematically examines the strengths, weaknesses, and generalization capabilities of leading VLMs, with a special focus on **cross-domain consistency, efficiency, and real-world deployment** in resource-constrained environments.
---
## ๐ Supported Datasets
Argus benchmarks VLMs on five representative datasets, each targeting different aspects of multimodal understanding:
- **COCO**
Large-scale dataset for object detection, segmentation, and captioning.- **Flickr30k**
Image captioning dataset with diverse, real-world scenes.- **CLEVR**
Synthetic dataset designed for compositional language and visual reasoning.- **VCR**
Visual Commonsense Reasoning dataset for complex image-based QA and rationale selection.- **Visual Genome**
Dataset with dense object, attribute, and relationship annotations for fine-grained scene understanding.Prepare each dataset in the `data/` directory as specified in the provided notebooks.
---
## ๐ง Vision-Language Models Benchmarked
Argus evaluates the following state-of-the-art VLMs:
| Model | Core Strengths | Key Weaknesses |
|---------|----------------------------------------------------------|-------------------------------|
| **CLIP** | Strong generalization, robust cross-domain consistency | Weaker in generation and structured reasoning |
| **BLIP** | Excels at curated data, strong captioning and retrieval | Variable on less curated domains |
| **LXMERT** | Superior in structured visual reasoning tasks | Less adaptable, higher resource use |- **CLIP**: Contrastive LanguageโImage Pretraining; excels at zero-shot generalization via contrastive learning on web-scale image-text pairs.
- **BLIP**: Bootstrapped Language Image Pretraining; uses a Mixture-of-Experts (MED) encoder-decoder architecture for unified understanding and generation, leveraging synthetic caption bootstrapping and multi-task objectives.
- **LXMERT**: Transformer-based model with cross-modal attention, optimized for structured reasoning and VQA.---
## ๐ ๏ธ Usage
### 1. Installation
Install dependencies:
pip install -r requirements.txt
**Core dependencies:**
- Python 3.8+
- PyTorch
- Transformers
- OpenAI CLIP
- Additional packages as listed in `requirements.txt`---
### 2. Embedding Extraction
Extract image-text embeddings for your dataset:
python blip_embedding_extractor.py --input data/coco/ --output embeddings/blip_coco.pkl
python clip_embedding_extractor.py --input data/coco/ --output embeddings/clip_coco.pkl---
### 3. Training and Evaluation
Use the provided Jupyter notebooks for model training and evaluation:
- `blip-model.py` โ BLIP experiments
- `clip-model.py` โ CLIP experiments
- `vcr-blip-model.py` โ BLIP on VCR datasetCustom workload prediction wrappers:
- `Clever_blip_model.py` โ BLIP-based workload prediction
- `Clever_clip_model.py` โ CLIP-based workload prediction---
## ๐งฉ File Structure
| File/Folder | Description |
|---------------------------|--------------------------------------------------------------|
| `Clever_clip&blip/` | CLEVR dataset experiments with CLIP & BLIP |
| `Coco_vlip&clip/` | COCO dataset experiments with CLIP & CLIP |
| `Flickr30k_clip&blip/` | Flickr30k dataset with CLIP & BLIP |
| `VCR_clip&blip/` | VCR dataset experiments with CLIP & BLIP |
| `vgenome_blip&clip/` | Visual Genome dataset with BLIP & CLIP |
| `README.md` | Project documentation |---
## โก Experiments & Results
### Key Evaluation Dimensions
- **Prediction Accuracy** (retrieval, captioning, reasoning)
- **Generation Quality** (BLEU, METEOR, CIDEr, SPICE)
- **Computational Efficiency** (latency, memory, FLOPs, throughput)
- **Generalization** using the novel **Cross-Dataset Consistency (CDC)** metric.### Main Findings
| Task / Metric | BLIP | CLIP | LXMERT |
|-------------------------------|----------------|----------------|----------------|
| **Image-Text Retrieval** | Best on COCO, Flickr30k | Strong generalization | Best on Visual Genome |
| **Caption Generation** | Highest BLEU, METEOR, CIDEr | Weakest | Competitive BLEU, lower diversity |
| **Visual Reasoning (CLEVR)** | 89.7% | 84.5% | **96.3%** |
| **Commonsense QA (VCR)** | 74.8% (QโA) | 62.3% (QโA) | 70.2% (QโA), **71.5% (QAโR)** |
| **Efficiency (COCO, latency)**| 120ms | **30ms** | 150ms |
| **CDC (Generalization)** | 0.76 | **0.92** | 0.64 |- **BLIP**: Outperforms in captioning and retrieval, especially on curated datasets.
- **CLIP**: Most robust across domains, with top CDC score (0.92), making it ideal for general-purpose and real-world deployment.
- **LXMERT**: Excels in structured reasoning and VQA, but less efficient and less generalizable.---
## ๐งฎ Cross-Dataset Consistency (CDC) Metric
**Cross-Dataset Consistency (CDC)** measures how reliably a model maintains its performance across multiple datasets.
$$\text{CDC}(M_j) = 1 - \frac{1}{|D|} \sum_{i=1}^{|D|} \left| \frac{a_{i,j} - \bar{a}_j}{\bar{a}_j} \right|$$
**Where:**
- $$\( M_j \)$$: The model being evaluated
- $$\( D \)$$: The set of datasets
- $$\( a_{i,j} \)$$: The accuracy (or relevant metric) of model \( M_j \) on dataset \( i \)
- $$\( \bar{a}_j \)$$: The mean performance of \( M_j \) across all datasets**Interpretation:**
- **CDC = 1:** The model's performance is perfectly consistent across all datasets.
- **Lower CDC values:** Indicate greater performance fluctuation between datasets.**Industrial relevance:**
CDC highlights and rewards models that deliver *balanced* performance, which is essential for building robust, fair, and reproducible AI systems.## ๐ Benchmarking Protocol
- Unified preprocessing (image resizing, text tokenization, normalization)
- Batch inference with consistent hardware (NVIDIA A100 GPU, batch size 32)
- Multiple random splits for statistical rigor
- No test-time augmentation or ensembling.---
## ๐ Summary Table: VLMs Across Domains
| Model | Retrieval | Captioning | Reasoning | Efficiency | Generalization (CDC) |
|---------|-----------|------------|-----------|------------|----------------------|
| BLIP | โ โ โ โ โ | โ โ โ โ โ | โ โ โ โโ | โ โ โโโ | โ โ โ โโ |
| CLIP | โ โ โ โ โ | โ โ โโโ | โ โ โโโ | โ โ โ โ โ | โ โ โ โ โ |
| LXMERT | โ โ โ โโ | โ โ โ โโ | โ โ โ โ โ | โ โโโโ | โ โ โโโ |---
## โ ๏ธ Limitations
- Benchmarks may not capture real-world diversity, noise, or domain-specific constraints.
- No task-specific fine-tuning; results reflect fixed checkpoints.
- Evaluation on edge devices is limited to theoretical analysis due to hardware constraints.---
## ๐ค Contributing
Contributions are welcome!
Open issues or submit pull requests for improvements, bug fixes, or new features.---
## ๐ License
MIT License
---
**Contact:**
For questions or collaboration, please open an issue or contact the maintainer.---