https://github.com/khushirajurkar/vision-transformer-image-classification
A Vision Transformer (ViT) implementation for image classification using CIFAR-10 dataset, leveraging HuggingFace's Trainer API for computational efficiency
https://github.com/khushirajurkar/vision-transformer-image-classification
cifar-10 computer-vision data-augmentation deep-learning huggingface image-classification machine-learning model-evaluation neural-networks patch-encoding positional-encoding self-attention trainer-api transfer-learning transformer vision-transformer
Last synced: 8 months ago
JSON representation
A Vision Transformer (ViT) implementation for image classification using CIFAR-10 dataset, leveraging HuggingFace's Trainer API for computational efficiency
- Host: GitHub
- URL: https://github.com/khushirajurkar/vision-transformer-image-classification
- Owner: KhushiRajurkar
- License: mit
- Created: 2025-01-10T07:43:47.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-07-13T07:59:26.000Z (11 months ago)
- Last Synced: 2025-07-13T09:29:58.510Z (11 months ago)
- Topics: cifar-10, computer-vision, data-augmentation, deep-learning, huggingface, image-classification, machine-learning, model-evaluation, neural-networks, patch-encoding, positional-encoding, self-attention, trainer-api, transfer-learning, transformer, vision-transformer
- Language: Jupyter Notebook
- Homepage:
- Size: 191 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Vision-Transformer-Image-Classification
A Vision Transformer (ViT) implementation for image classification using CIFAR-10 dataset, leveraging HuggingFace's Trainer API for computational efficiency
# Vision Transformer for Image Classification
## Overview
This repository contains an implementation of the Vision Transformer (ViT) model, a novel architecture leveraging self-attention mechanisms for image classification tasks. Unlike traditional CNNs, ViT splits images into patches and processes them as sequences, enabling the model to capture global context effectively.
## Objective
- To explore the capabilities of Vision Transformer on the CIFAR-10 dataset.
- To compare its performance with traditional CNN models.
- To implement and evaluate using HuggingFace's Trainer API for improved computational efficiency.
## Methodology
1. **Dataset**: CIFAR-10 (60,000 32x32 images across 10 classes).
2. **Preprocessing**: Data augmentation and patch embedding for input preparation.
3. **Model Architecture**: Implementation of Vision Transformer with patch encoding and positional encoding.
4. **Training**: Leveraged HuggingFace's Trainer API to streamline training and overcome computational limitations.
5. **Evaluation**: Achieved high accuracy through transfer learning and efficient training.
## Results
- **Accuracy**: Reached 98.77% validation accuracy by epoch 5.
- **Efficiency**: Demonstrated the use of pre-trained weights and transfer learning for computationally constrained setups.
## Challenges
Faced computational resource constraints but overcame them using HuggingFace’s Trainer API, reducing the training burden while maintaining accuracy.
## Usage
1. Clone the repository:
```bash
git clone https://github.com/KhushiRajurkar/Vision-Transformer-Image-Classification.git
```