https://github.com/di37/multiclass-image-classification-using-multimodal-llms
A comprehensive comparison of multimodal models - llama3.2-vision, minicpm-v, llava-llama3, llava, llava13:b and closed source models for animal classification tasks. This project evaluates various models' performance in classifying 10 different animal species, ranging from common to rare animals.
https://github.com/di37/multiclass-image-classification-using-multimodal-llms
artificial-intelligence computer-vision gemini google-generative-ai large-language-models machine-learning natural-language-processing ollama openai python
Last synced: about 1 year ago
JSON representation
A comprehensive comparison of multimodal models - llama3.2-vision, minicpm-v, llava-llama3, llava, llava13:b and closed source models for animal classification tasks. This project evaluates various models' performance in classifying 10 different animal species, ranging from common to rare animals.
- Host: GitHub
- URL: https://github.com/di37/multiclass-image-classification-using-multimodal-llms
- Owner: di37
- Created: 2024-12-08T19:47:43.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-10T21:17:10.000Z (over 1 year ago)
- Last Synced: 2025-03-24T16:41:51.152Z (over 1 year ago)
- Topics: artificial-intelligence, computer-vision, gemini, google-generative-ai, large-language-models, machine-learning, natural-language-processing, ollama, openai, python
- Language: Jupyter Notebook
- Homepage:
- Size: 1.82 MB
- Stars: 8
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Evaluating Multimodal LLMs on Image Classification: A Comparative Analysis of Open-Source and Proprietary Models
This project evaluates and compares the performance of various multimodal Large Language Models (LLMs)—both open-source and closed-source—on an animal image classification task. The repository demonstrates data sampling, model inference, output normalization, and comprehensive evaluation of accuracy, precision, recall, F1 scores. It also explores trade-offs in inference time, data handling, and output formatting, ultimately providing insights into how different models fare in visual classification.
---
## Key Features
- **Multimodal Image Classification:**
Leverages LLMs that can process visual input to classify a curated set of animal images.
- **Variety of Models:**
Tests both open-source models (e.g., LLaMA variants, minicpm-v) and closed-source models (e.g., Gemini, GPT-4o) to highlight differences in performance, format consistency, and inference speed.
- **Normalization of Outputs:**
Implements post-processing steps to correct misspellings, verbose labels, or truncated predictions, ensuring fair and accurate metric comparisons.
- **Metrics & Visualizations:**
Provides accuracy, precision, recall, and F1 scores, alongside confusion matrices and inference time statistics to offer a complete performance profile.
---
## Repository Structure
- **`custom_logger/`**
Contains custom logging utilities that provide consistent, structured logs throughout the codebase.
- **`data/`**
Holds input data files and model-generated results.
- **`results_*.csv`:** Classification outputs for each model family (ollama, gemini, openai).
- **`sampled_animals.csv`:** Lists the subset of animals and images selected for evaluation.
- **`image_classification/`**
Contains core logic for performing image classification using various models, handling prompts, and evaluating outputs.
- **`notebooks/`**
Jupyter notebooks outlining each stage of the workflow:
1. **Data Gathering & Sampling:** Selecting a subset of animal images.
2. **Image Classification:** Running images through all models.
3. **Data Normalization:** Cleaning and standardizing outputs.
4. **Model Evaluation:** Computing metrics, plotting confusion matrices, and analyzing results.
- **`utilities/`**
Provides helper scripts, constants, and command-line utilities (that simplify repetitive tasks and support the main codebase.
- **`classify.py`**
A script to run classification across various models, generating CSV results.
- **`.env` & Configuration Files:**
May store environment variables or keys required to access closed-source models.
- **`README.md`** (this file):
A high-level overview of the entire project, guiding users through setup and usage.
---
## Getting Started
**Prerequisites:**
- Python 3.10+
- A virtual environment (recommended)
- Required packages listed in `requirements.txt` (if provided).
**Installation Steps:**
1. Clone the repository:
```bash
git clone https://github.com/di37/multimodal-image-classification.git
```
2. Change into the project directory:
```bash
cd multimodal-image-classification
```
3. Set up a virtual environment and install dependencies:
```bash
conda create -n image_classification python=3.10
conda activate image_classification
pip install -r requirements.txt
```
4. Add any necessary API keys or model credentials to your `.env` file.
---
## Usage
1. **Data Preparation:**
Use `01_Data_Gathering_And_Sampling.ipynb` in `notebooks/` to generate `sampled_animals.csv`.
2. **Classification:**
Run the classification script to process all images:
```bash
python classify.py
```
This will invoke all models (open-source and closed-source) and store results in the `data/` directory.
3. **Normalization:**
Use `03_Data_Normalization_of_Outputs.ipynb` to clean and standardize outputs from models that require it (e.g., Ollama models).
4. **Evaluation:**
Finally, run `04_Models_Evaluation.ipynb` to compute metrics, generate confusion matrices, compare inference times, and produce a comprehensive report of each model’s performance.
---
## Results and Interpretation
- **Performance Metrics:**
The evaluation notebooks summarize accuracy, precision, recall, and F1. Additional charts (e.g., confusion matrices, bar plots) are generated to visualize each model’s strengths and weaknesses.
- **Impact of Normalization:**
By comparing pre- and post-normalization results for open source models, users can see how minor formatting issues influenced initial metrics, uncovering the true capability of open-source models.
- **Trade-Offs:**
Closed-source models may yield perfect results but at higher latency and possibly less flexibility. Open-source models run locally and faster but may need some refinement and prompt tuning.
[Please read the article for in-depth analysis.](https://medium.com/@d.isham.ai93/evaluating-multimodal-llms-on-image-classification-a-comparative-analysis-of-open-source-and-077c5fc8a9d3)
---
## Contributing
Contributions are welcome. Please open an issue or submit a pull request if you have improvements, bug fixes, or new features to propose.
---
## Acknowledgments
- **Data:** Sourced from [Kaggle’s animal images dataset](https://www.kaggle.com/datasets/iamsouravbanerjee/animal-image-dataset-90-different-animals).
- **Models:**
- Open-Source: `LlaVa` variants, `Llama` models and `minicpm-v`
- Closed-Source: Gemini, GPT-4o, etc.
- **Community:** Thanks to the open-source community and model developers for providing the tools and resources enabling this project.
---
*This README provides a roadmap for anyone looking to understand, reproduce, or build upon the project.*