https://github.com/theseraphim/scribe-forge-ai

🎵 Complete offline audio transcription system with speaker diarization using OpenAI Whisper and PyAnnote. Features automatic audio cleaning, precise timestamps, multiple output formats (JSON/TXT/Markdown), and support for 20+ audio formats. No external APIs required - works entirely offline.
https://github.com/theseraphim/scribe-forge-ai

audio-analysis audio-cleaning audio-processing audio-transcription diarization ffmpeg huggingface machine-learning multi-speaker nlp offline-transcription openai-whisper pyannote python speaker-diarization speech-recognition speech-to-text timestamps transcription-tool whisper

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/theseraphim/scribe-forge-ai
Owner: TheSeraphim
License: other
Created: 2025-05-27T07:06:44.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-06-16T13:08:56.000Z (4 months ago)
Last Synced: 2025-06-16T14:31:41.810Z (4 months ago)
Topics: audio-analysis, audio-cleaning, audio-processing, audio-transcription, diarization, ffmpeg, huggingface, machine-learning, multi-speaker, nlp, offline-transcription, openai-whisper, pyannote, python, speaker-diarization, speech-recognition, speech-to-text, timestamps, transcription-tool, whisper
Language: Python
Homepage:
Size: 2.28 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

![](./scribe-forge.ai.png)

## ☕ Support This Project

Support my work: [coff.ee/theseraphim](https://coff.ee/theseraphim)

# Audio Transcription Tool

Complete system for audio transcription with speaker diarization, using AI models that work entirely offline. Optimized for Windows with intelligent installation and automatic compatibility handling.

## Features

- **Accurate transcription** using OpenAI Whisper models
- **Intelligent speaker diarization** with automatic method selection:
- Python 3.12+: Resemblyzer (no compilation issues, no tokens needed)
- Python 3.11-: pyannote.audio (traditional method)
- **Precise timestamps** for every segment and word
- **Automatic audio cleaning** to improve quality
- **Support for multiple formats** (M4A, WAV, MP3, FLAC, etc.)
- **Multiple output types** (JSON, TXT, Markdown)
- **Offline operation** - no external APIs required
- **Optimized GPU acceleration** with CUDA 12.4 support
- **Local model caching** for offline usage
- **Detailed logging** with precise timestamps
- **Zero-config installation** for most scenarios

## Installation (Windows)

### 🚀 One-Click Installation (Recommended)

**Run PowerShell as Administrator** and execute:

```powershell
# Full installation with all features
.\install.ps1

# With model downloads for offline usage
.\install.ps1 -DownloadModels -DownloadDiarizationModels -HuggingFaceToken "hf_your_token"

# CPU-only installation (no CUDA)
.\install.ps1 -NoGPU

# Skip speaker diarization if not needed
.\install.ps1 -SkipDiarization
```

**The installation script automatically:**
- ✅ Detects your Python version and chooses the best compatibility approach
- ✅ Installs optimized PyTorch with CUDA 12.4 for maximum GPU performance
- ✅ Handles all dependency conflicts and compilation issues
- ✅ Sets up speaker diarization without HuggingFace token hassles (Python 3.12+)
- ✅ Downloads models locally for offline usage (optional)
- ✅ Creates a ready-to-use virtual environment

### Installation Options

```powershell
# Installation script options
.\install.ps1 [OPTIONS]

Options:
-SkipDependencies # Skip system tools (Python, FFmpeg, etc.)
-NoGPU # Force CPU-only PyTorch
-DownloadModels # Download Whisper models
-DownloadDiarizationModels # Download speaker diarization models locally
-SkipDiarization # Skip speaker diarization setup
-ForceNonAdmin # Run without Administrator (limited functionality)
-HuggingFaceToken # Specify HF token for pyannote models
-Help # Show detailed help
```

### System Requirements

**Minimum:**
- Windows 10/11
- PowerShell 5.1+
- Administrator privileges (for installation)
- 4GB RAM
- 2GB disk space

**Recommended:**
- Windows 11
- 16GB RAM
- NVIDIA GPU with 6GB+ VRAM (automatically detected)
- 10GB free disk space

### Manual Installation (Advanced Users)

If you prefer manual control:

```powershell
# 1. Create virtual environment
python -m venv venv
venv\Scripts\Activate.ps1

# 2. Install dependencies
pip install -r requirements.txt

# 3. For GPU acceleration (optional)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

# 4. For speaker diarization (Python 3.12+)
pip install resemblyzer scikit-learn

# 5. For speaker diarization (Python 3.11-)
pip install pyannote.audio
```

## Models Used

### Whisper (Transcription)
Whisper models are downloaded automatically on first use or with `-DownloadModels`:

| Model | Size | VRAM | Speed | Quality |
|-------|------|------|-------|---------|
| tiny | 39 MB | ~1GB | ~32x | Basic |
| base | 74 MB | ~1GB | ~16x | Good |
| small | 244 MB | ~2GB | ~6x | Very good |
| medium | 769 MB | ~5GB | ~2x | Excellent |
| large-v3| 1550 MB| ~10GB| ~1x | Best |

### Speaker Diarization

**Resemblyzer** (Python 3.12+, default):
- **Advantages**: No compilation, no tokens needed, works immediately
- **Size**: ~50MB
- **Requirements**: None beyond installation script

**PyAnnote** (Python 3.11-, optional):
- **Model**: `pyannote/speaker-diarization-3.1`
- **Size**: ~300MB
- **Requirements**: HuggingFace token and model acceptance

## Usage

### Basic Examples

```bash
# Activate virtual environment first
venv\Scripts\Activate.ps1

# Simple transcription
python main.py input.m4a -o output --format txt

# With speaker diarization
python main.py input.m4a -o output --format md --diarize

# High quality with GPU acceleration
python main.py input.m4a -o output --format json --model-size large-v3 --clean-audio --diarize

# Force CPU usage
python main.py input.m4a -o output --format txt --device cpu
```

### Full Parameters

```bash
python main.py INPUT_FILE -o OUTPUT_PATH [OPTIONS]

Required arguments:
INPUT_FILE Input audio file
-o, --output OUTPUT_PATH Output path (without extension)

Options:
--format {json,txt,md} Output format (default: txt)
--model-size {tiny,base,small,medium,large,large-v2,large-v3}
Whisper model size (default: base)
--diarize Enable speaker diarization
--language LANG Audio language (auto-detect if not specified)
--device {auto,cpu,cuda} Processing device (default: auto)
--download-models Download models before processing
--log-level {DEBUG,INFO,WARNING,ERROR}
Logging level (default: INFO)
--clean-audio Apply audio cleaning
```

### Practical Examples

```bash
# Italian meeting with multiple speakers
python main.py meeting.m4a -o meeting_transcript \
--format md --diarize --language it --clean-audio

# English interview, JSON output for analysis
python main.py interview.wav -o interview_data \
--format json --model-size medium --diarize

# Long podcast with best quality
python main.py podcast.mp3 -o podcast_transcript \
--format txt --model-size large-v3 --log-level DEBUG
```

## Output Structure

### TXT Format
```
Audio Transcription
Generated: 2025-06-17 14:30:00
Language: Italian
Speakers detected: 2

==================================================

[00:00:05] SPEAKER_00: Buongiorno e benvenuti alla nostra riunione.
[00:00:12] SPEAKER_01: Grazie, sono felice di essere qui.
[00:00:18] SPEAKER_00: Iniziamo con il primo punto all'ordine del giorno.
```

### JSON Format
```json
{
"metadata": {
"created_at": "2025-06-17T14:30:00",
"language": "Italian",
"has_speakers": true,
"total_segments": 45,
"diarization_method": "Resemblyzer"
},
"transcription": {
"text": "Full transcription text...",
"language": "Italian",
"segments": [
{
"id": 0,
"start": 5.2,
"end": 8.7,
"text": "Buongiorno e benvenuti",
"speaker": "SPEAKER_00",
"words": [
{
"word": "Buongiorno",
"start": 5.2,
"end": 5.8,
"probability": 0.95
}
]
}
]
}
}
```

### Markdown Format
```markdown
# Audio Transcription

**Generated:** 2025-06-17 14:30:00
**Language:** Italian
**Speakers:** 2
**Diarization:** Resemblyzer

---

## Transcription with Timestamps

### SPEAKER_00

**00:00:05**: Buongiorno e benvenuti alla nostra riunione.

**00:00:18**: Iniziamo con il primo punto all'ordine del giorno.

### SPEAKER_01

**00:00:12**: Grazie, sono felice di essere qui.
```

## Smart Installation Features

### Automatic Python Version Handling
The installation script detects your Python version and automatically chooses the best approach:

**Python 3.13+**: Uses Resemblyzer for speaker diarization (recommended)
- ✅ No compilation issues
- ✅ No HuggingFace tokens needed
- ✅ Works immediately after installation

**Python 3.11-3.12**: Offers both Resemblyzer and pyannote.audio options
- ✅ Fallback mechanisms for maximum compatibility
- ✅ Handles compilation issues automatically

### GPU Optimization
- **Automatic CUDA detection** and driver compatibility check
- **Optimized PyTorch** installation with CUDA 12.4 support
- **Eliminates Triton warnings** and performance issues
- **CPU fallback** for systems without compatible GPUs

### Local Model Caching
```powershell
# Download models for offline usage
.\install.ps1 -DownloadDiarizationModels -HuggingFaceToken "hf_your_token"
```
- Downloads pyannote models to local cache
- Enables completely offline operation
- Bypasses HuggingFace authentication during transcription
- Automatic cache setup and management

## Troubleshooting

### Installation Issues

**1. "Administrator privileges required"**
```
ERROR This script requires Administrator privileges!
```
**Solution**: Run PowerShell as Administrator or use:
```powershell
.\install.ps1 -ForceNonAdmin -SkipDependencies
```

**2. Python version compatibility**
```
WARNING Python 3.13 detected - using Resemblyzer for speaker diarization
```
**Solution**: This is normal! The script automatically uses the best method for your Python version.

**3. CUDA installation issues**
```
WARNING Failed to install CUDA toolkit
```
**Solution**: Use CPU-only mode:
```powershell
.\install.ps1 -NoGPU
```

### Runtime Issues

**1. Speaker diarization not working**
```
INFO No speaker diarization available
```
**Check**: Verify installation completed successfully:
```bash
python -c "import resemblyzer; print('Resemblyzer available')"
```

**2. GPU not detected**
```
INFO Using device: cpu
```
**Solution**: Check CUDA installation:
```bash
python -c "import torch; print(torch.cuda.is_available())"
```

**3. Model download issues**
```
WARNING Model download failed
```
**Solution**: Models download automatically on first use, or use:
```powershell
.\install.ps1 -DownloadModels
```

### Performance Optimization

**For long files (>1 hour):**
- Use `--model-size small` or `base`
- Avoid `--clean-audio` if not needed
- Consider using `--device cpu` for very long files

**For best quality:**
- Use `--model-size large-v3`
- Enable `--clean-audio`
- Ensure GPU acceleration: `--device cuda`

**For maximum speed:**
- Use `--model-size tiny`
- Disable diarization or use Resemblyzer method
- Skip audio cleaning

## Logging

The system uses detailed logging with timestamps:

```
[20250617-143000] INFO - Starting audio transcription process
[20250617-143001] INFO - Using device: cuda
[20250617-143001] INFO - Resemblyzer available - speaker diarization enabled
[20250617-143002] INFO - Processing audio file: meeting.m4a
[20250617-143005] INFO - Loaded audio: 1547.2s, 16000Hz
[20250617-143006] INFO - Loading Whisper model: large-v3
[20250617-143008] INFO - Starting transcription...
[20250617-143045] INFO - Detected language: Italian
[20250617-143046] INFO - Performing speaker diarization...
[20250617-143078] INFO - Diarization completed: 2 speakers detected
[20250617-143079] INFO - Saving output in md format...
[20250617-143080] INFO - ✅ Transcription completed with speaker diarization
```

## Supported Audio Formats

- **Input**: M4A, WAV, MP3, FLAC, OGG, WMA, AAC
- **Internal processing**: WAV 16kHz mono
- **Maximum duration**: Limited only by available memory

## Advanced Configuration

### Environment Variables
```powershell
# For offline pyannote usage (if downloaded locally)
$env:HF_HUB_OFFLINE = "1"

# Force CPU usage
$env:CUDA_VISIBLE_DEVICES = ""

# Custom cache directory
$env:TRANSFORMERS_CACHE = "D:\models\cache"
```

### Custom Diarization Settings
The system automatically chooses the best diarization method, but you can verify which is being used:

```bash
python -c "
import sys
sys.path.append('src')
from diarizer import Diarizer
print('Diarizer using optimal method for your Python version')
"
```

## License

This project uses:
- **Whisper**: MIT License (OpenAI)
- **Resemblyzer**: Apache 2.0 License
- **PyAnnote**: MIT License
- **Other dependencies**: Various open source licenses

See the LICENSE files of the individual dependencies for full details.

---

**🎯 Ready to transcribe? Run `.\install.ps1` as Administrator and you'll be set up in minutes!**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/theseraphim/scribe-forge-ai

Awesome Lists containing this project

README