https://github.com/khushirajurkar/vision-transformer-image-classification

A Vision Transformer (ViT) implementation for image classification using CIFAR-10 dataset, leveraging HuggingFace's Trainer API for computational efficiency
https://github.com/khushirajurkar/vision-transformer-image-classification

cifar-10 computer-vision data-augmentation deep-learning huggingface image-classification machine-learning model-evaluation neural-networks patch-encoding positional-encoding self-attention trainer-api transfer-learning transformer vision-transformer

Last synced: 8 days ago
JSON representation

A Vision Transformer (ViT) implementation for image classification using CIFAR-10 dataset, leveraging HuggingFace's Trainer API for computational efficiency

Host: GitHub
URL: https://github.com/khushirajurkar/vision-transformer-image-classification
Owner: KhushiRajurkar
License: mit
Created: 2025-01-10T07:43:47.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-07-13T07:59:26.000Z (about 1 year ago)
Last Synced: 2025-10-21T12:44:55.323Z (9 months ago)
Topics: cifar-10, computer-vision, data-augmentation, deep-learning, huggingface, image-classification, machine-learning, model-evaluation, neural-networks, patch-encoding, positional-encoding, self-attention, trainer-api, transfer-learning, transformer, vision-transformer
Language: Jupyter Notebook
Homepage:
Size: 191 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Vision-Transformer-Image-Classification
A Vision Transformer (ViT) implementation for image classification using CIFAR-10 dataset, leveraging HuggingFace's Trainer API for computational efficiency
# Vision Transformer for Image Classification

## Overview
This repository contains an implementation of the Vision Transformer (ViT) model, a novel architecture leveraging self-attention mechanisms for image classification tasks. Unlike traditional CNNs, ViT splits images into patches and processes them as sequences, enabling the model to capture global context effectively.

## Objective
- To explore the capabilities of Vision Transformer on the CIFAR-10 dataset.
- To compare its performance with traditional CNN models.
- To implement and evaluate using HuggingFace's Trainer API for improved computational efficiency.

## Methodology
1. **Dataset**: CIFAR-10 (60,000 32x32 images across 10 classes).
2. **Preprocessing**: Data augmentation and patch embedding for input preparation.
3. **Model Architecture**: Implementation of Vision Transformer with patch encoding and positional encoding.
4. **Training**: Leveraged HuggingFace's Trainer API to streamline training and overcome computational limitations.
5. **Evaluation**: Achieved high accuracy through transfer learning and efficient training.

## Results
- **Accuracy**: Reached 98.77% validation accuracy by epoch 5.
- **Efficiency**: Demonstrated the use of pre-trained weights and transfer learning for computationally constrained setups.

## Challenges
Faced computational resource constraints but overcame them using HuggingFace’s Trainer API, reducing the training burden while maintaining accuracy.

## Usage
1. Clone the repository:
```bash
git clone https://github.com/KhushiRajurkar/Vision-Transformer-Image-Classification.git
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/khushirajurkar/vision-transformer-image-classification

Awesome Lists containing this project

README