https://github.com/roshaan0/ai-audio-transcriber
End-to-end AI audio transcription, speaker diarization, and summarization pipeline with Urdu/English support using fine-tuned Whisper (LoRA).
https://github.com/roshaan0/ai-audio-transcriber
ai audio-processing bart machine-learning nlp pytorch speech-to-text streamlit whisper
Last synced: 2 months ago
JSON representation
End-to-end AI audio transcription, speaker diarization, and summarization pipeline with Urdu/English support using fine-tuned Whisper (LoRA).
- Host: GitHub
- URL: https://github.com/roshaan0/ai-audio-transcriber
- Owner: roshaan0
- Created: 2025-12-20T13:19:02.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-20T14:05:37.000Z (6 months ago)
- Last Synced: 2025-12-22T17:18:36.296Z (6 months ago)
- Topics: ai, audio-processing, bart, machine-learning, nlp, pytorch, speech-to-text, streamlit, whisper
- Language: Python
- Homepage:
- Size: 14.6 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AI Audio Transcriber & Analyzer (Urdu / English)
An end-to-end AI application for transcribing, diarizing, and summarizing mixed-language (Urdu/English) audio and video files.
The system is optimized to run locally on consumer hardware and includes a fine-tuned Whisper model using LoRA to improve Urdu script accuracy.
---
## Overview
Automatic speech recognition systems often struggle with low-resource languages such as Urdu, frequently misclassifying them as Hindi or producing incorrect scripts.
This project addresses that limitation by fine-tuning OpenAI Whisper using parameter-efficient techniques and integrating it into a full-stack AI pipeline.
The application supports speaker identification, mixed-language transcription, and automatic summarization through a unified web interface.
---
## Features
- Audio and video upload support (mp4, wav, mkv)
- Speaker diarization (identifying who spoke when)
- Speech-to-text transcription using OpenAI Whisper
- Mixed-language handling for Urdu and English
- Correct Urdu Nastaliq script output
- Automatic text summarization using BART
- Interactive web interface built with Streamlit
- Fully local execution on consumer GPUs (tested on RTX 4050)
---
## System Architecture
Audio / Video Input
->
Speaker Diarization (Pyannote)
->
Speech Transcription (Whisper + LoRA Fine-Tuning)
->
Text Summarization (BART)
->
Streamlit Web Interface
---
## Technology Stack
### Programming Language
- Python 3.10+
### Frameworks and Libraries
- PyTorch
- Hugging Face Transformers
- PEFT (LoRA)
- Pyannote.audio
- Streamlit
- FFmpeg
- Pydub
---
## Models Used
### Transcription Model
- Model: openai/whisper-small
- Architecture: Transformer-based encoder-decoder
- Customization: Fine-tuned using Low-Rank Adaptation (LoRA)
- Objective: Improve Urdu transcription accuracy and reduce Hindi misclassification
### Speaker Diarization Model
- Model: pyannote/speaker-diarization-3.1
- Purpose: Detect speaker boundaries and assign speaker labels
### Summarization Model
- Model: facebook/bart-large-cnn
- Purpose: Generate concise summaries from long transcriptions
---
## Installation and Setup
### 1. Clone the Repository
git clone https://github.com/roshaan0/ai-audio-transcriber.git
cd ai-audio-transcriber
### 2. Install Dependencies
pip install -r requirements.txt
### 3. Hugging Face Token Configuration
This project uses gated Hugging Face models (Pyannote).
A Hugging Face read-access token is required.
Steps
Create a Hugging Face account.
Generate a read-access token.
Create a file named hf_token.txt in the project root.
Paste your token inside the file.
### 4. Run the Application
streamlit run app.py
## Accuracy and Performance Notes
This project is a proof-of-concept focused on efficiency rather than maximum accuracy.
Approximately 65 percent transcription accuracy was achieved using limited training data and consumer-grade hardware.
The architecture is fully scalable and supports higher accuracy with additional data and compute resources.
---
## Project Scope
This project demonstrates full-stack AI engineering, including:
- Model fine-tuning workflows
- Backend pipeline integration
- Audio preprocessing
- Web-based frontend development
The project was developed for academic and research purposes.
---
## Author
Roshaan Ali
AI and Machine Learning Enthusiast