https://github.com/adityashaarma/asl-interpreter

Real-time ASL interpreter using CNN, OpenCV, and MediaPipe for accessible gesture recognition.
https://github.com/adityashaarma/asl-interpreter

accessibility asl cnn computer-vision deep-learning keras mediapipe opencv python sign-language tensorflow

Last synced: 4 months ago
JSON representation

Real-time ASL interpreter using CNN, OpenCV, and MediaPipe for accessible gesture recognition.

Host: GitHub
URL: https://github.com/adityashaarma/asl-interpreter
Owner: AdityaShaarma
Created: 2025-05-15T02:19:19.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-05-24T06:13:29.000Z (5 months ago)
Last Synced: 2025-06-17T08:07:34.654Z (4 months ago)
Topics: accessibility, asl, cnn, computer-vision, deep-learning, keras, mediapipe, opencv, python, sign-language, tensorflow
Language: Jupyter Notebook
Homepage:
Size: 10.3 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# AI Sign Language Interpreter

A computer vision project that interprets American Sign Language (ASL) hand gestures from images with over 99% accuracy. Built to support accessibility and inclusive communication, this model uses deep learning techniques to classify ASL alphabet signs from grayscale images. The project leverages a Convolutional Neural Network (CNN) trained on the [Sign Language MNIST dataset](https://www.kaggle.com/datasets/datamunge/sign-language-mnist) to recognize 24 static ASL letters (excluding J and Z, which require motion).

---

## Dataset Overview

The [Sign Language MNIST dataset](https://www.kaggle.com/datasets/datamunge/sign-language-mnist) is a drop-in replacement for the classic MNIST dataset, designed specifically for hand gesture recognition tasks. It contains:

- **28×28 grayscale images** representing ASL letters
- **27,455 training samples** and **7,172 testing samples**
- **24 class labels** (A–Y, excluding J and Z as these gestures require motion)

Each row contains a label (0–25 excluding 9 and 25) and pixel values (784 total) flattened from a 28x28 image.

---

## Exploratory Data Analysis (EDA)

The EDA was essential for understanding the structure, distribution, and quality of the data. Visualizations included:

### Class Distribution

Label Frequency

The label distribution is relatively balanced across all 24 classes, eliminating the need for sampling or rebalancing.

### Example Sign Gestures

ASL Examples

Representative samples of each letter confirm visually distinct patterns, making this dataset suitable for image-based classification.

### Pixel Intensity Heatmap

Pixel Intensity

This plot of average pixel values across all samples shows common gesture patterns and confirms the center-weighted alignment of most hand signs.

---

## Model Selection

As a preliminary step, multiple traditional machine learning models were tested on the raw, flattened pixel data:

- **Random Forest**: ~81% accuracy
- **Histogram-Based Gradient Boosting**: ~78% accuracy
- **K-Nearest Neighbors**: ~81% accuracy (Performance plateaued after `k=1`)

Although traditional models demonstrated decent accuracy, their lack of spatial awareness limited performance. These served as helpful baselines to highlight the superiority of CNNs for image classification.

---

## Data Preprocessing for CNN

- Normalized all pixel values to the range [0, 1]
- Reshaped data to (28, 28, 1) for compatibility with CNNs
- Applied one-hot encoding to labels using `to_categorical` from TensorFlow

The Kaggle dataset was already split into training and test sets but data was prepocessed and saved into the `/data/processed/` folder for modeling.

---

## CNN Model Architecture

A deep CNN was constructed using TensorFlow’s Keras API, with the following layers:

- **Conv2D (128 filters, 5x5)** with ReLU + **MaxPooling (3x3)**
- **Conv2D (64 filters, 2x2)** with ReLU + **MaxPooling (2x2)**
- **Conv2D (32 filters, 2x2)** with ReLU + **MaxPooling (2x2)**
- **Flatten layer**
- **Dense (512 units)** with ReLU + **Dropout (0.25)**
- **Output Dense (25 units)** with softmax activation

**Below is the architecture:**

Model Architecture

This architecture was optimized for efficient feature extraction and classification while maintaining a relatively lightweight model footprint (≈1.2 MB).

---

## Model Training and Performance

The model was compiled using the **Adam optimizer** and **categorical crossentropy** as the loss function. It was trained over 35 epochs using an augmented data generator:

```python
ImageDataGenerator(
width_shift_range=0.2,
height_shift_range=0.2,
zoom_range=0.2,
horizontal_flip=True
)
```

### Training & Validation Metrics

Accuracy and Loss Curves

The model achieved a final **test accuracy of 99.55%** with a **loss of just 0.022**, indicating high generalization and minimal overfitting.

---

## Confusion Matrix

Confusion Matrix

The matrix confirms strong performance across all 24 classes, with very few misclassifications. The occasional confusion (e.g., between visually similar signs) was statistically negligible.

---

## Classification Report

The model achieved perfect or near-perfect scores across nearly all classes:

- **Precision**: 0.99–1.00
- **Recall**: 0.98–1.00
- **F1-Score**: 0.97–1.00
- **Overall Accuracy**: **99.55%**

These metrics confirm the model’s suitability for real-time applications and strong robustness across the dataset.

---

## Model Insights

- **Early Convergence**: Major improvements occurred within the first 15–20 epochs.
- **Dropout Effectiveness**: Helped mitigate overfitting without sacrificing accuracy.
- **Augmentation Impact**: Increased generalization on the test set despite synthetic data variations.
- **Class Separation**: Even challenging signs (e.g., T vs. R or M vs. N) were almost classified perfectly.

---

## Business and Social Impact

The AI Sign Language Interpreter addresses a critical communication gap by using computer vision to facilitate interaction between sign language users and non-signers. While built as a deep learning project, it has wide-ranging applications across accessibility, education, enterprise communication, and edge computing.

### Accessibility and Inclusion

Sign language is the primary mode of communication for millions of deaf and hard-of-hearing individuals. Yet, many public-facing and digital systems lack built-in support. This project provides a potential foundation for scalable accessibility solutions:

- **Real-Time ASL Translation**: The model can be integrated into video conferencing platforms to convert ASL gestures into real-time subtitles or speech, making meetings more inclusive.
- **Public Kiosks and Terminals**: Retail, banking, and healthcare kiosks can use this system to support ASL-based interaction, eliminating the need for touchscreens or verbal communication.
- **Wearable Device Integration**: Combined with AR devices or smart glasses, the model can power live gesture-to-text overlays to aid private and autonomous communication.
- **Assistive Technology**: Incorporation into text-to-speech tools enables individuals to speak through gesture recognition, particularly useful for non-verbal users.

### Education and Learning

The system also offers impactful uses in ASL education and language learning environments:

- **Interactive Learning Tools**: Applications built with this model can provide real-time feedback to students learning ASL, enhancing engagement and comprehension.
- **Gamified Platforms for Children**: ASL alphabet recognition can be used to create child-friendly games, making language acquisition more engaging.
- **Support for Special Education**: Educators can deploy this tool to support students with speech impairments or developmental conditions who benefit from visual communication.

### Customer Service and Industry Applications

Communication barriers in customer-facing environments and high-noise workplaces can be reduced using gesture recognition systems:

- **Retail and Healthcare Kiosks**: ASL recognition can empower customers to interact with service points using signs, improving experience for the hearing-impaired.
- **Emergency and Medical Intake**: Hospitals can use ASL interfaces in triage or emergency rooms where human interpreters may not be immediately available.
- **Industrial Communication**: In environments with high noise levels, such as airports, construction sites, or factories, gesture-based input reduces the risk of miscommunication.

### Edge Deployment and Efficiency

The model was built to run efficiently on local hardware without reliance on cloud infrastructure:

- **Edge-Ready Architecture**: With a compact design, the model can run on devices like Raspberry Pi or Jetson Nano, opening doors for deployment in embedded systems.
- **Fast Inference Time**: Predictions occur in under 5 milliseconds per frame, enabling real-time response rates necessary for practical, interactive applications.
- **Offline Functionality**: Since the model runs locally, it supports use cases in remote or low-connectivity regions, making it suitable for NGOs, education, and healthcare in underserved areas.

This project demonstrates how deep learning and computer vision can directly contribute to accessible technology, educational advancement, and inclusive design practices. Its potential spans from academic research to deployable, high-impact real-world tools.

---

## Project Structure

```
AI Sign Language Interpreter/
│
├── data/
│ ├── raw/ # Original Kaggle datasets (Split into train and test)
│ └── processed/ # Scaled and cleaned datasets
│
├── images/ # Visualizations
│ ├── accuracy_loss_plot.png
│ ├── ASL_images.png
│ ├── cnn_model_architecture.png
│ ├── confusion_matrix.png
│ ├── data_frequency.png
│ └── pixel_intensity.png
│
├── models/
│ └── model.keras # Trained CNN model
│
├── notebooks/
│ ├── eda.ipynb # Data analysis
│ ├── CNN_model.ipynb # CNN development & evaluation
│ └── ml_model_testing.ipynb # Traditional model benchmarking
│
├── webcam_app.py # Live webcam classification script
└── README.md
```

---

## Deployment

This project includes a live webcam interface for ASL recognition using Streamlit. It can be run locally as a web app:

```
streamlit run webcam_app.py
```

Once running, a browser-based UI allows real-time gesture translation using a webcam feed.

---

## Future Work

- Extend to full ASL alphabet including dynamic gestures (J and Z) via LSTM or 3D CNNs
- Integrate real-time webcam recognition with GUI overlay
- Explore model compression (pruning, quantization) for mobile deployment
- Add multilingual subtitle translation features
- Expand dataset with more diverse hand shapes, sizes, and skin tones

---

## License

This project is open-sourced under the [CC0: Public Domain License](https://creativecommons.org/publicdomain/zero/1.0/), allowing unrestricted use for educational and commercial purposes.

---

## Acknowledgments

- Dataset: [Sign Language MNIST on Kaggle](https://www.kaggle.com/datasets/datamunge/sign-language-mnist)
- Libraries: TensorFlow, Keras, Scikit-learn, pandas, NumPy, Seaborn, Matplotlib, OpenCV, MediaPipe, Streamlit, statistics, collections

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/adityashaarma/asl-interpreter

Awesome Lists containing this project

README