An open API service indexing awesome lists of open source software.

https://github.com/roshaan0/ai-audio-transcriber

End-to-end AI audio transcription, speaker diarization, and summarization pipeline with Urdu/English support using fine-tuned Whisper (LoRA).
https://github.com/roshaan0/ai-audio-transcriber

ai audio-processing bart machine-learning nlp pytorch speech-to-text streamlit whisper

Last synced: 2 months ago
JSON representation

End-to-end AI audio transcription, speaker diarization, and summarization pipeline with Urdu/English support using fine-tuned Whisper (LoRA).

Awesome Lists containing this project

README

          

# AI Audio Transcriber & Analyzer (Urdu / English)

An end-to-end AI application for transcribing, diarizing, and summarizing mixed-language (Urdu/English) audio and video files.
The system is optimized to run locally on consumer hardware and includes a fine-tuned Whisper model using LoRA to improve Urdu script accuracy.

---

## Overview

Automatic speech recognition systems often struggle with low-resource languages such as Urdu, frequently misclassifying them as Hindi or producing incorrect scripts.
This project addresses that limitation by fine-tuning OpenAI Whisper using parameter-efficient techniques and integrating it into a full-stack AI pipeline.

The application supports speaker identification, mixed-language transcription, and automatic summarization through a unified web interface.

---

## Features

- Audio and video upload support (mp4, wav, mkv)
- Speaker diarization (identifying who spoke when)
- Speech-to-text transcription using OpenAI Whisper
- Mixed-language handling for Urdu and English
- Correct Urdu Nastaliq script output
- Automatic text summarization using BART
- Interactive web interface built with Streamlit
- Fully local execution on consumer GPUs (tested on RTX 4050)

---

## System Architecture

Audio / Video Input
->
Speaker Diarization (Pyannote)
->
Speech Transcription (Whisper + LoRA Fine-Tuning)
->
Text Summarization (BART)
->
Streamlit Web Interface

---
## Technology Stack

### Programming Language
- Python 3.10+

### Frameworks and Libraries
- PyTorch
- Hugging Face Transformers
- PEFT (LoRA)
- Pyannote.audio
- Streamlit
- FFmpeg
- Pydub

---

## Models Used

### Transcription Model
- Model: openai/whisper-small
- Architecture: Transformer-based encoder-decoder
- Customization: Fine-tuned using Low-Rank Adaptation (LoRA)
- Objective: Improve Urdu transcription accuracy and reduce Hindi misclassification

### Speaker Diarization Model
- Model: pyannote/speaker-diarization-3.1
- Purpose: Detect speaker boundaries and assign speaker labels

### Summarization Model
- Model: facebook/bart-large-cnn
- Purpose: Generate concise summaries from long transcriptions

---

## Installation and Setup

### 1. Clone the Repository

git clone https://github.com/roshaan0/ai-audio-transcriber.git
cd ai-audio-transcriber

### 2. Install Dependencies
pip install -r requirements.txt

### 3. Hugging Face Token Configuration

This project uses gated Hugging Face models (Pyannote).
A Hugging Face read-access token is required.

Steps

Create a Hugging Face account.

Generate a read-access token.

Create a file named hf_token.txt in the project root.

Paste your token inside the file.

### 4. Run the Application
streamlit run app.py

## Accuracy and Performance Notes

This project is a proof-of-concept focused on efficiency rather than maximum accuracy.
Approximately 65 percent transcription accuracy was achieved using limited training data and consumer-grade hardware.

The architecture is fully scalable and supports higher accuracy with additional data and compute resources.

---

## Project Scope

This project demonstrates full-stack AI engineering, including:

- Model fine-tuning workflows
- Backend pipeline integration
- Audio preprocessing
- Web-based frontend development

The project was developed for academic and research purposes.

---

## Author

Roshaan Ali
AI and Machine Learning Enthusiast