{"id":28455414,"url":"https://github.com/davidamacey/opentranscribe","last_synced_at":"2026-01-17T01:50:07.515Z","repository":{"id":294141062,"uuid":"986040253","full_name":"davidamacey/OpenTranscribe","owner":"davidamacey","description":"Self-hosted AI-powered transcription platform with speaker diarization, search, and collaboration features. Built with Svelte, FastAPI, and Docker for easy deployment.","archived":false,"fork":false,"pushed_at":"2025-06-18T03:58:56.000Z","size":1516,"stargazers_count":0,"open_issues_count":14,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-18T04:35:56.263Z","etag":null,"topics":["ai","audio-processing","docker","fastapi","machine-learning","nlp","open-source","self-hosted","speaker-diarization","speech-to-text","svelte","transcription","video-transcription","whisper"],"latest_commit_sha":null,"homepage":"https://github.com/davidamacey/OpenTranscribe#readme","language":"Svelte","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidamacey.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"docs/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"docs/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-19T02:43:28.000Z","updated_at":"2025-06-18T03:59:00.000Z","dependencies_parsed_at":"2025-06-18T04:38:31.952Z","dependency_job_id":null,"html_url":"https://github.com/davidamacey/OpenTranscribe","commit_stats":null,"previous_names":["davidamacey/opentranscribe"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/davidamacey/OpenTranscribe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidamacey%2FOpenTranscribe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidamacey%2FOpenTranscribe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidamacey%2FOpenTranscribe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidamacey%2FOpenTranscribe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidamacey","download_url":"https://codeload.github.com/davidamacey/OpenTranscribe/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidamacey%2FOpenTranscribe/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261945358,"owners_count":23234236,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","audio-processing","docker","fastapi","machine-learning","nlp","open-source","self-hosted","speaker-diarization","speech-to-text","svelte","transcription","video-transcription","whisper"],"created_at":"2025-06-06T22:00:50.566Z","updated_at":"2026-01-17T01:50:07.462Z","avatar_url":"https://github.com/davidamacey.png","language":"Svelte","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/logo-banner.png\" alt=\"OpenTranscribe Logo\" width=\"400\"\u003e\n\n  **AI-Powered Transcription and Media Analysis Platform**\n\u003c/div\u003e\n\nOpenTranscribe is a powerful, containerized web application for transcribing and analyzing audio/video files using state-of-the-art AI models. Built with modern technologies and designed for scalability, it provides an end-to-end solution for speech-to-text conversion, speaker identification, and content analysis.\n\n\u003e **Note**: This application is 99.9% written by AI using frontier models from commercial providers, demonstrating the power of AI-assisted development.\n\n## 📸 Quick Look\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs-site/static/img/opentranscribe-workflow.gif\" alt=\"OpenTranscribe Workflow\" width=\"800\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cem\u003eComplete workflow: Login → Upload → Process → Transcribe → Speaker Identification → AI Tags \u0026 Collections\u003c/em\u003e\u003c/p\u003e\n\n\u003e 📚 **For detailed screenshots and visual guides**, see the [Complete Documentation](https://docs.opentranscribe.app)\n\n## ✨ Key Features\n\n### 🎧 **Advanced Transcription**\n- **High-Accuracy Speech Recognition**: Powered by WhisperX with faster-whisper backend\n- **Word-Level Timestamps**: Precise timing for every word using WAV2VEC2 alignment\n- **100+ Language Support**: Transcribe in 100+ languages with optional English translation\n- **Configurable Source Language**: Auto-detect or specify source language for improved accuracy\n- **Translation Toggle**: Choose to keep original language or translate non-English audio to English\n- **Language-Aware Alignment**: Indicators show which languages support word-level timestamps (~42 languages)\n- **Batch Processing**: 70x realtime speed with large-v2 model on GPU\n- **Pagination for Large Transcripts**: Efficient display of long transcripts without browser hanging\n- **Audio Waveform Visualization**: Interactive waveform player with precise timing and click-to-seek\n- **Browser Recording**: Built-in microphone recording with real-time audio level monitoring\n- **Recording Controls**: Pause/resume recording with duration tracking and quality settings\n\n### 👥 **Smart Speaker Management**\n- **Automatic Speaker Diarization**: Identify different speakers using PyAnnote.audio\n- **Cross-Video Speaker Recognition**: AI-powered voice fingerprinting to identify speakers across different media files\n- **Speaker Profile System**: Create and manage global speaker profiles that persist across all transcriptions\n- **Intelligent Speaker Suggestions**: Consolidated speaker identification with confidence scoring and automatic profile matching\n- **LLM-Enhanced Speaker Recognition**: Content-based speaker identification using conversational context analysis\n- **Profile Embedding Service**: Advanced voice similarity matching using vector embeddings for cross-video speaker linking\n- **Smart Speaker Status Tracking**: Comprehensive speaker verification status with computed fields for UI optimization\n- **Auto-Profile Creation**: Automatic speaker profile creation and assignment when speakers are labeled\n- **Retroactive Speaker Matching**: Cross-video speaker matching with automatic label propagation for high-confidence matches\n- **Custom Speaker Labels**: Edit and manage speaker names and information with intelligent suggestions\n- **Speaker Analytics**: View speaking time distribution, cross-media appearances, and interaction patterns\n- **Speaker Merge UI**: Combine duplicate speakers into single profiles with segment reassignment\n- **Per-File Speaker Settings**: Configure min/max speaker count per upload or reprocess operation\n- **User-Level Speaker Preferences**: Save default speaker detection settings (always prompt, use defaults, use custom values)\n\n### 🎬 **Rich Media Support**\n- **Universal Format Support**: Audio (MP3, WAV, FLAC, M4A) and Video (MP4, MOV, AVI, MKV)\n- **Universal Media URL Support**: Process videos from 1800+ platforms via yt-dlp (YouTube, Dailymotion, Twitter/X, TikTok, and more)\n- **Smart Platform Handling**: User-friendly error messages with platform-specific guidance for authentication-required videos\n- **YouTube Playlist Processing**: Extract and queue all videos from playlists for batch transcription\n- **Large File Support**: Upload files up to 4GB for GoPro and high-quality video content\n- **Interactive Media Player**: Click transcript to navigate playback\n- **Custom File Titles**: Edit display names for media files with real-time search index updates\n- **Advanced Upload Manager**: Floating, draggable upload manager with real-time progress tracking\n- **Concurrent Upload Processing**: Multiple file uploads with queue management and retry logic\n- **Intelligent Upload System**: Duplicate detection, hash verification, and automatic recovery\n- **Metadata Extraction**: Comprehensive file information using ExifTool\n- **Subtitle Export**: Generate SRT/VTT files for accessibility\n- **File Reprocessing**: Re-run AI analysis while preserving user comments and annotations\n- **Auto-Recovery System**: Intelligent detection and recovery of stuck or failed file processing\n\n### 🔍 **Powerful Search \u0026 Discovery**\n- **Hybrid Search**: Combine keyword and semantic search capabilities\n- **Full-Text Indexing**: Lightning-fast content search with OpenSearch 3.3.1 (Apache Lucene 10)\n- **9.5x Faster Vector Search**: Significantly improved semantic search performance\n- **25% Faster Queries**: Enhanced full-text search with lower latency\n- **Advanced Filtering**: Filter by speaker, date, tags, duration, and more with searchable dropdowns\n- **Smart Tagging**: Organize content with custom tags and categories\n- **Collections System**: Group related media files into organized collections for better project management\n- **Speaker Usage Counts**: See which speakers appear most frequently across your media library\n\n### 📊 **Analytics \u0026 Insights**\n- **Advanced Content Analysis**: Comprehensive speaker analytics including talk time, interruption detection, and turn-taking patterns\n- **Speaker Performance Metrics**: Speaking pace (WPM), question frequency, and conversation flow analysis\n- **Meeting Efficiency Analytics**: Silence ratio analysis and participation balance tracking\n- **Real-Time Analytics Computation**: Server-side analytics computation with automatic refresh capabilities\n- **Cross-Video Speaker Analytics**: Track speaker patterns and participation across multiple recordings\n- **AI-Powered Summarization**: Generate summaries with flexible JSON schemas from custom prompts\n- **BLUF Format Support**: Default Bottom Line Up Front structured summaries with action items\n- **Custom Summary Formats**: Create unlimited AI prompts with ANY JSON structure\n- **Flexible Schema Storage**: JSONB storage supporting multiple prompt types simultaneously\n- **Multi-Provider LLM Support**: Use local vLLM, OpenAI, Ollama, Claude, or OpenRouter for AI features\n- **Intelligent Section Processing**: Automatically handles transcripts of any length using section-by-section analysis\n- **Custom AI Prompts**: Create and manage custom summarization prompts for different content types\n- **LLM Configuration Management**: User-specific LLM settings with encrypted API key storage\n- **Provider Testing**: Test LLM connections and validate configurations before use\n- **Real-Time Topic Extraction**: AI-powered topic extraction with granular progress notifications\n- **LLM Output Language**: Generate AI summaries in 12 different languages (English, Spanish, French, German, etc.)\n- **Model Discovery**: Automatic discovery of available models for vLLM, Ollama, and Anthropic providers\n- **Auto-Cleanup Garbage Segments**: Automatic detection and cleanup of erroneous transcription segments\n\n### 💬 **Collaboration Features**\n- **Time-Stamped Comments**: Add annotations at specific moments\n- **User Management**: Role-based access control (admin/user) with personalized settings\n- **Recording Settings Management**: User-specific audio recording preferences with quality controls\n- **Export Options**: Download transcripts in multiple formats\n- **Real-Time Updates**: Live progress tracking with detailed WebSocket notifications\n- **Enhanced Progress Tracking**: 13 granular processing stages with descriptive messages\n- **Smart Notification System**: Persistent notifications with unread count badges and progress updates\n- **WebSocket Integration**: Real-time updates for transcription, summarization, and upload progress\n- **Collection Management**: Create, organize, and share collections of related media files\n- **Smart Error Recovery**: User-friendly error messages with specific guidance and auto-recovery options\n- **Full-Screen Transcript View**: Dedicated modal for reading and searching long transcripts\n- **Auto-Refresh Systems**: Background updates for file status without manual refreshing\n\n### 🎙️ **Recording \u0026 Audio Features**\n- **Browser-Based Recording**: Direct microphone recording with no plugins required\n- **Real-Time Audio Level Monitoring**: Visual audio level feedback during recording\n- **Multi-Device Support**: Choose from available microphone devices\n- **Recording Quality Control**: Configurable bitrate and format settings\n- **Pause/Resume Recording**: Full recording session control with duration tracking\n- **Background Upload Processing**: Seamless integration with upload queue system\n- **Recording Session Management**: Persistent recording state with navigation warnings\n\n### 🤖 **AI-Powered Features**\n- **Comprehensive LLM Integration**: Support for 6+ providers (OpenAI, Claude, vLLM, Ollama, etc.)\n- **Custom Prompt Management**: Create and manage AI prompts for different content types\n- **Encrypted Configuration Storage**: Secure API key storage with user-specific settings\n- **Provider Connection Testing**: Validate LLM configurations before use\n- **Intelligent Content Processing**: Context-aware summarization with section-by-section analysis\n- **BLUF Format Summaries**: Bottom Line Up Front structured summaries with action items\n- **Multi-Model Support**: Works with models from 3B to 200B+ parameters\n- **Local \u0026 Cloud Processing**: Support for both local (privacy-first) and cloud AI providers\n\n### ⚡ **Performance \u0026 Scaling**\n- **Multi-GPU Worker Scaling**: Optional parallel processing on dedicated GPUs for high-throughput systems\n- **Specialized Worker Queues**: GPU (transcription), Download (YouTube), CPU (waveform), NLP (AI features)\n- **Parallel Waveform Processing**: CPU-based waveform generation runs simultaneously with GPU transcription\n- **Non-Blocking Architecture**: LLM tasks don't delay next transcription (45-75s faster per 3-hour file)\n- **Configurable Concurrency**: GPU(1-4), CPU(8), Download(3), NLP(4) workers for optimal resource utilization\n- **Enhanced Speaker Detection**: Support for 20+ speakers (can scale to 50+ for large conferences)\n- **Accurate GPU Monitoring**: nvidia-smi integration for real-time system-wide memory tracking\n\n### 📱 **Enhanced User Experience**\n- **Progressive Web App**: Installable app experience with offline capabilities\n- **Responsive Design**: Optimized for desktop, tablet, and mobile devices\n- **UI Internationalization**: Interface available in 8 languages (English, Spanish, French, German, Portuguese, Chinese, Japanese, Russian)\n- **Interactive Waveform Player**: Click-to-seek audio visualization with precise timing\n- **Floating Upload Manager**: Draggable upload interface with real-time progress\n- **Smart Modal System**: Consistent modal design with improved accessibility\n- **Enhanced Data Formatting**: Server-side formatting service for consistent display of dates, durations, and file sizes\n- **Error Categorization**: Intelligent error classification with user-friendly suggestions and retry guidance\n- **Smart Status Management**: Comprehensive file and task status tracking with formatted display text\n- **Auto-Refresh Systems**: Background data updates without manual page refreshing\n- **Theme Support**: Seamless dark/light mode switching\n- **Keyboard Shortcuts**: Efficient navigation and control via hotkeys\n- **System Statistics**: CPU, memory, disk, and GPU usage visible to all users\n- **Admin Password Reset**: Secure password reset functionality with validation\n\n## 🛠️ Technology Stack\n\n### **Frontend**\n- **Svelte** - Reactive UI framework with excellent performance\n- **TypeScript** - Type-safe development with modern JavaScript and comprehensive ESLint integration\n- **Progressive Web App** - Offline capabilities and native-like experience\n- **Internationalization (i18n)** - Multi-language UI support with 7 languages\n- **Responsive Design** - Seamless experience across all devices\n- **Advanced UI Components** - Draggable upload manager, modal consistency, and real-time status updates\n- **Code Quality Tooling** - ESLint, TypeScript strict mode, and automated formatting\n\n### **Backend**\n- **FastAPI** - High-performance async Python web framework\n- **SQLAlchemy 2.0** - Modern ORM with type safety\n- **Celery + Redis** - Multi-queue distributed task processing for AI workloads\n  - **GPU Queue** (concurrency=1-4): GPU-intensive transcription and diarization\n  - **Download Queue** (concurrency=3): Parallel YouTube video/playlist downloads\n  - **CPU Queue** (concurrency=8): Waveform generation and audio processing\n  - **NLP Queue** (concurrency=4): LLM API calls and AI features\n  - **Utility Queue** (concurrency=2): Health checks and maintenance tasks\n- **WebSocket** - Real-time communication for live updates\n\n### **AI/ML Stack**\n- **WhisperX** - Advanced speech recognition with 100+ language support\n- **PyAnnote.audio** - Speaker diarization and voice analysis\n- **Faster-Whisper** - Optimized inference engine\n- **Multi-Provider LLM Integration** - Support for vLLM, OpenAI, Ollama, Anthropic Claude, and OpenRouter\n- **Local LLM Support** - Privacy-focused processing with vLLM and Ollama\n- **Intelligent Context Processing** - Section-by-section analysis handles unlimited transcript lengths\n- **Universal Model Compatibility** - Works with any model size from 3B to 200B+ parameters\n- **Multilingual AI Output** - Generate summaries in 12 different languages\n- **Model Auto-Discovery** - Automatic detection of available models from vLLM, Ollama, and Anthropic\n\n### **Infrastructure**\n- **PostgreSQL** - Reliable relational database with JSONB support for flexible schemas\n- **MinIO** - S3-compatible object storage\n- **OpenSearch 3.3.1** - Full-text and vector search engine with Apache Lucene 10\n  - 9.5x faster vector search performance\n  - 25% faster queries with lower latency\n  - 75% lower p90 latency for aggregations\n- **Docker** - Containerized deployment with multi-stage builds\n- **NGINX** - Production web server\n- **Complete Offline Support** - Full airgapped/offline deployment capability\n\n## 🚀 Quick Start\n\n### Prerequisites\n\n```bash\n# Required\n- Docker and Docker Compose\n- 8GB+ RAM (16GB+ recommended)\n\n# Recommended for optimal performance\n- NVIDIA GPU with CUDA support\n```\n\n### Quick Installation (Using Docker Hub Images)\n\nRun this one-liner to download and set up OpenTranscribe using our pre-built Docker Hub images:\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash\n```\n\nThen follow the on-screen instructions. The setup script will:\n- Detect your hardware (NVIDIA GPU, Apple Silicon, or CPU)\n- Download the production Docker Compose file\n- Configure environment variables with optimal settings for your hardware\n- **Prompt for your HuggingFace token** (required for speaker diarization)\n- **Automatically download and cache AI models (~2.5GB)** if token is provided\n- Set up the management script (`opentranscribe.sh`)\n\n**⚠️ IMPORTANT - HuggingFace Setup:**\nThe script will prompt you for your HuggingFace token during setup. **BEFORE running the installer:**\n\n1. **Get a FREE token:** Visit [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)\n2. **Accept BOTH gated model agreements** (required for speaker diarization):\n   - [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) - Click \"Agree and access repository\"\n   - [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) - Click \"Agree and access repository\"\n3. **Enter your token** when prompted by the installer\n\nIf you provide a valid token with both model agreements accepted, AI models will be downloaded and cached before Docker starts, ensuring the app is ready to use immediately. If you skip this step, models will download on first use (10-30 minute delay).\n\nOnce setup is complete, start OpenTranscribe with:\n\n```bash\ncd opentranscribe\n./opentranscribe.sh start\n```\n\nThe Docker images are available on Docker Hub as separate repositories:\n- `davidamacey/opentranscribe-backend`: Backend service (also used for celery-worker and flower)\n- `davidamacey/opentranscribe-frontend`: Frontend service\n\nAccess the web interface at http://localhost:5173\n\n### Manual Installation (From Source)\n\n1. **Clone the Repository**\n   ```bash\n   git clone https://github.com/davidamacey/OpenTranscribe.git\n   cd OpenTranscribe\n\n   # Make utility script executable\n   chmod +x opentr.sh\n   ```\n\n2. **Environment Configuration**\n   ```bash\n   # Copy environment template\n   cp .env.example .env\n\n   # Edit .env file with your settings (optional for development)\n   # Key variables:\n   # - HUGGINGFACE_TOKEN (required for speaker diarization)\n   # - GPU settings for optimal performance\n   ```\n\n3. **Start OpenTranscribe**\n   ```bash\n   # Start in development mode (with hot reload)\n   ./opentr.sh start dev\n\n   # Or start in production mode\n   ./opentr.sh start prod\n   ```\n\n4. **Access the Application**\n   - 🌐 **Web Interface**: http://localhost:5173\n   - 📚 **API Documentation**: http://localhost:5174/docs\n   - 🌺 **Task Monitor**: http://localhost:5175/flower\n   - 🔍 **Search Engine**: http://localhost:9200\n   - 📁 **File Storage**: http://localhost:9091\n\n## 📋 OpenTranscribe Utility Commands\n\nThe `opentr.sh` script provides comprehensive management for all application operations:\n\n### **Basic Operations**\n```bash\n# Start the application\n./opentr.sh start [dev|prod]     # Start in development or production mode\n./opentr.sh start dev --gpu-scale # Start with multi-GPU scaling (optional)\n./opentr.sh stop                 # Stop all services\n./opentr.sh status               # Show container status\n./opentr.sh logs [service]       # View logs (all or specific service)\n```\n\n### **Multi-GPU Scaling (Optional)**\nFor systems with multiple GPUs, enable parallel GPU workers for significantly increased transcription throughput:\n\n```bash\n# Configure in .env\nGPU_SCALE_ENABLED=true      # Enable multi-GPU scaling\nGPU_SCALE_DEVICE_ID=2       # Which GPU to use (default: 2)\nGPU_SCALE_WORKERS=4         # Number of parallel workers (default: 4)\n\n# Start with GPU scaling\n./opentr.sh start dev --gpu-scale\n./opentr.sh reset dev --gpu-scale\n\n# Example hardware setup:\n# GPU 0: LLM model (vLLM, Ollama)\n# GPU 1: Default single worker (disabled when scaling)\n# GPU 2: 4 parallel workers (processes 4 videos simultaneously)\n```\n\n**Performance:** Process 4 transcriptions simultaneously on a high-end GPU, significantly reducing total processing time for batch uploads.\n\n### **Development Workflow**\n```bash\n# Service management\n./opentr.sh restart-backend      # Restart API and workers without database reset\n./opentr.sh restart-frontend     # Restart frontend only\n./opentr.sh restart-all          # Restart all services without data loss\n\n# Container rebuilding (after code changes)\n./opentr.sh rebuild-backend      # Rebuild backend with new code\n./opentr.sh rebuild-frontend     # Rebuild frontend with new code\n./opentr.sh build                # Rebuild all containers\n```\n\n### **Database Management**\n```bash\n# Data operations (⚠️ DESTRUCTIVE)\n./opentr.sh reset [dev|prod]     # Complete reset - deletes ALL data!\n./opentr.sh init-db              # Initialize database without container reset\n\n# Backup and restore\n./opentr.sh backup               # Create timestamped database backup\n./opentr.sh restore [file]       # Restore from backup file\n```\n\n### **System Administration**\n```bash\n# Maintenance\n./opentr.sh clean                # Remove unused containers and images\n./opentr.sh health               # Check service health status\n./opentr.sh shell [service]      # Open shell in container\n\n# Available services: backend, frontend, postgres, redis, minio, opensearch, celery-worker\n```\n\n### **Monitoring and Debugging**\n```bash\n# View specific service logs\n./opentr.sh logs backend         # API server logs\n./opentr.sh logs celery-worker   # AI processing logs\n./opentr.sh logs frontend        # Frontend development logs\n./opentr.sh logs postgres        # Database logs\n\n# Follow logs in real-time\n./opentr.sh logs backend -f\n```\n\n## 🎯 Usage Guide\n\n### **Getting Started**\n\n1. **User Registration**\n   - Navigate to http://localhost:5173\n   - Create an account or use default admin credentials\n   - Set up your profile and preferences\n\n2. **Upload or Record Content**\n   - **File Upload**: Click \\\"Upload Files\\\" or drag-and-drop media files (up to 4GB)\n   - **Direct Recording**: Use the microphone button in the navbar for browser-based recording\n   - **URL Processing**: Paste video URLs from 1800+ platforms (YouTube, Dailymotion, Twitter/X, TikTok, etc.)\n   - **Playlist Support**: Import entire YouTube playlists with one URL\n   - Supported formats: MP3, WAV, MP4, MOV, and more\n   - Files are automatically queued for concurrent processing\n\n3. **Monitor Processing**\n   - Watch detailed real-time progress with 13 processing stages\n   - Use the floating upload manager for multi-file progress tracking\n   - View task status in Flower monitor or notifications panel\n   - Receive live WebSocket notifications for all status changes\n\n4. **Explore Your Content**\n   - **Interactive Transcript**: Click on transcript text to navigate media playback\n   - **Waveform Player**: Click on audio waveform for precise seeking\n   - **Custom Titles**: Edit file display names for better organization and searchability\n   - **Speaker Management**: Edit speaker names and add custom labels\n   - **AI Summaries**: Generate BLUF format summaries with custom prompts\n   - **Comments**: Add time-stamped comments and annotations\n   - **Collections**: Organize files into themed collections\n   - **Full-Screen View**: Use transcript modal for detailed reading and searching\n\n5. **Configure AI Features** (Optional)\n   - Set up LLM providers in User Settings for AI summarization\n   - Create custom prompts for different content types\n   - Test provider connections before processing\n\n### **Advanced Features**\n\n#### **Recording Workflow**\n```\n🎙️ Device Selection → 📊 Level Monitoring → ⏸️ Session Control → ⬆️ Background Upload\n```\n- Choose from available microphone devices\n- Monitor real-time audio levels during recording\n- Pause/resume recording sessions with duration tracking\n- Seamless integration with background upload processing\n\n#### **AI-Powered Processing**\n```\n🤖 LLM Configuration → 📝 Custom Prompts → 🔍 Content Analysis → 📊 BLUF Summaries\n```\n- Configure multiple LLM providers (OpenAI, Claude, vLLM, Ollama, etc.)\n- Create custom prompts for different content types (meetings, interviews, podcasts)\n- Test provider connections and validate configurations\n- Generate structured summaries with action items and key decisions\n\n#### **Speaker Management**\n```\n👥 Automatic Detection → 🤖 AI Recognition → 🏷️ Profile Management → 🔍 Cross-Media Tracking\n```\n- Speakers are automatically detected and assigned labels using advanced AI diarization\n- AI suggests speaker identities based on voice fingerprinting across your media library\n- Create global speaker profiles that persist across all your transcriptions\n- Accept or reject AI suggestions with confidence scores to improve accuracy over time\n- Track speaker appearances across multiple media files with detailed analytics\n\n#### **Advanced Upload Management**\n```\n⬆️ Concurrent Uploads → 📊 Progress Tracking → 🔄 Retry Logic → 📋 Queue Management\n```\n- Floating, draggable upload manager with real-time progress\n- Multiple file uploads with intelligent queue processing\n- Automatic retry logic for failed uploads with exponential backoff\n- Duplicate detection with hash verification\n\n#### **Search and Discovery**\n```\n🔍 Keyword Search → 🧠 Semantic Search → 🏷️ Smart Filtering → 🎯 Waveform Navigation\n```\n- Search transcript content with advanced filters\n- Use semantic search to find related concepts\n- Click-to-seek navigation via interactive waveform visualization\n- Organize content with custom tags and categories\n\n#### **Collections Management**\n```\n📁 Create Collections → 📂 Organize Files → 🏷️ Bulk Operations → 🎯 Inline Editing\n```\n- Group related media files into named collections\n- Inline collection editing with tag-style interface\n- Filter library view by specific collections\n- Bulk add/remove files from collections with drag-and-drop support\n\n#### **Real-Time Notifications**\n```\n🔔 Progress Updates → 📊 Status Tracking → 🔄 WebSocket Integration → ✅ Completion Alerts\n```\n- Persistent notification panel with unread count badges\n- Real-time updates for transcription, summarization, and upload progress\n- WebSocket integration for instant status updates\n- Smart notification grouping and auto-refresh systems\n\n#### **Export and Integration**\n```\n📄 Multiple Formats → 📺 Subtitle Files → 🔗 API Access → 🎬 Media Downloads\n```\n- Export transcripts as TXT, JSON, or CSV\n- Generate SRT/VTT subtitle files with embedded timing\n- Access data programmatically via comprehensive REST API\n- Download media files with embedded subtitles\n\n## 📁 Project Structure\n\n```\nOpenTranscribe/\n├── 📁 backend/                 # Python FastAPI backend\n│   ├── 📁 app/                # Application modules\n│   │   ├── 📁 api/            # REST API endpoints\n│   │   ├── 📁 models/         # Database models\n│   │   ├── 📁 services/       # Business logic\n│   │   ├── 📁 tasks/          # Background AI processing\n│   │   ├── 📁 utils/          # Common utilities\n│   │   └── 📁 db/             # Database configuration\n│   ├── 📁 scripts/            # Admin and maintenance scripts\n│   ├── 📁 tests/              # Comprehensive test suite\n│   └── 📄 README.md           # Backend documentation\n├── 📁 frontend/               # Svelte frontend application\n│   ├── 📁 src/                # Source code\n│   │   ├── 📁 components/     # Reusable UI components\n│   │   ├── 📁 routes/         # Page components\n│   │   ├── 📁 stores/         # State management\n│   │   └── 📁 styles/         # CSS and themes\n│   └── 📄 README.md           # Frontend documentation\n├── 📁 database/               # Database initialization\n├── 📁 models_ai/              # AI model storage (runtime)\n├── 📁 scripts/                # Utility scripts\n├── 📄 docker-compose.yml      # Container orchestration\n├── 📄 opentr.sh               # Main utility script\n└── 📄 README.md               # This file\n```\n\n## 🔧 Configuration\n\n### **Environment Variables**\n\n#### **Core Application**\n```bash\n# Database\nDATABASE_URL=postgresql://postgres:password@postgres:5432/opentranscribe\n\n# Security\nSECRET_KEY=your-super-secret-key-here\nJWT_SECRET_KEY=your-jwt-secret-key\n\n# Object Storage\nMINIO_ROOT_USER=minioadmin\nMINIO_ROOT_PASSWORD=minioadmin\nMINIO_BUCKET_NAME=transcribe-app\n```\n\n#### **AI Processing**\n```bash\n# Required for speaker diarization - see setup instructions below\nHUGGINGFACE_TOKEN=your_huggingface_token_here\n\n# Model configuration\nWHISPER_MODEL=large-v2              # large-v2, medium, small, base\nCOMPUTE_TYPE=float16                # float16, int8\nBATCH_SIZE=16                       # Reduce if GPU memory limited\n\n# Speaker detection\nMIN_SPEAKERS=1                      # Minimum speakers to detect\nMAX_SPEAKERS=20                     # Maximum speakers to detect (can be increased to 50+ for large conferences)\n\n# Model caching (recommended)\nMODEL_CACHE_DIR=./models            # Directory to store downloaded AI models\n```\n\n#### **LLM Configuration (AI Features)**\nOpenTranscribe offers flexible AI deployment options. Choose the approach that best fits your infrastructure:\n\n**🔧 Quick Setup Options:**\n\n1. **Cloud-Only (Recommended for Most Users)**\n   ```bash\n   # Configure for OpenAI in .env\n   LLM_PROVIDER=openai\n   OPENAI_API_KEY=your_openai_key\n   OPENAI_MODEL_NAME=gpt-4o-mini\n\n   # Start without local LLM\n   ./opentr.sh start dev\n   ```\n\n2. **Local vLLM (Self-Hosted)**\n   ```bash\n   # Deploy vLLM server separately, then configure in .env\n   LLM_PROVIDER=vllm\n   VLLM_BASE_URL=http://your-vllm-server:8000/v1\n   VLLM_MODEL_NAME=gpt-oss-20b\n\n   # Start OpenTranscribe\n   ./opentr.sh start dev\n   ```\n\n3. **Local Ollama (Self-Hosted)**\n   ```bash\n   # Deploy Ollama server separately, then configure in .env\n   LLM_PROVIDER=ollama\n   OLLAMA_BASE_URL=http://your-ollama-server:11434\n   OLLAMA_MODEL_NAME=llama3.2:3b-instruct-q4_K_M\n\n   # Start OpenTranscribe\n   ./opentr.sh start dev\n   ```\n\n**📋 Complete Provider Configuration:**\n```bash\n# Cloud Providers (configure in .env)\nLLM_PROVIDER=openai                  # openai, anthropic, custom (openrouter)\nOPENAI_API_KEY=your_openai_key       # OpenAI GPT models\nANTHROPIC_API_KEY=your_claude_key    # Anthropic Claude models\nOPENROUTER_API_KEY=your_or_key       # OpenRouter (multi-provider)\n\n# Local Providers (requires additional Docker services)\nLLM_PROVIDER=vllm                    # Local vLLM server\nLLM_PROVIDER=ollama                  # Local Ollama server\n```\n\n**🎯 Deployment Scenarios:**\n- **💰 Cost-Effective**: OpenRouter with Claude Haiku (~$0.25/1M tokens)\n- **🔒 Privacy-First**: Local vLLM or Ollama (no data leaves your server)\n- **⚡ Performance**: OpenAI GPT-4o-mini (fastest cloud option)\n- **📱 Small Models**: Even 3B Ollama models can handle hours of content via intelligent sectioning\n- **🚫 No LLM**: Leave `LLM_PROVIDER` empty for transcription-only mode\n\nSee [LLM_DEPLOYMENT_OPTIONS.md](LLM_DEPLOYMENT_OPTIONS.md) for detailed setup instructions.\n\n#### **🗂️ Model Caching**\n\nOpenTranscribe automatically downloads and caches AI models for optimal performance. Models are saved locally and reused across container restarts.\n\n**Default Setup:**\n- All models are cached to `./models/` directory in your project folder\n- Models persist between Docker container restarts\n- No re-downloading required after initial setup\n\n**Directory Structure:**\n```\n./models/\n├── huggingface/          # PyAnnote + WhisperX models\n│   ├── hub/             # WhisperX transcription models (~1.5GB)\n│   └── transformers/    # PyAnnote transformer models\n└── torch/               # PyTorch cache\n    ├── hub/checkpoints/ # Wav2Vec2 alignment model (~360MB)\n    └── pyannote/        # PyAnnote diarization models (~500MB)\n```\n\n**Custom Cache Location:**\n```bash\n# Set custom directory in your .env file\nMODEL_CACHE_DIR=/path/to/your/models\n\n# Examples:\nMODEL_CACHE_DIR=~/ai-models          # Home directory\nMODEL_CACHE_DIR=/mnt/storage/models  # Network storage\nMODEL_CACHE_DIR=./cache              # Project subdirectory\n```\n\n**Storage Requirements:**\n- **WhisperX Models**: ~1.5GB (depends on model size)\n- **PyAnnote Models**: ~500MB (diarization + embedding)\n- **Alignment Model**: ~360MB (Wav2Vec2)\n- **Total**: ~2.5GB for complete setup```\n\n### **🔑 HuggingFace Token Setup**\n\nOpenTranscribe requires a HuggingFace token for speaker diarization and voice fingerprinting features. Follow these steps:\n\n#### **1. Generate HuggingFace Token**\n1. Visit [HuggingFace Settings \u003e Access Tokens](https://huggingface.co/settings/tokens)\n2. Click \"New token\" and select \"Read\" access\n3. Copy the generated token\n\n#### **2. Accept Model User Agreements** ⚠️ **CRITICAL - MUST ACCEPT BOTH!**\n\n**You MUST accept the user agreements for BOTH PyAnnote models or speaker diarization will fail:**\n\n1. **Segmentation Model** (Required):\n   - Visit: [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)\n   - Click: **\"Agree and access repository\"**\n\n2. **Speaker Diarization Model** (Required):\n   - Visit: [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)\n   - Click: **\"Agree and access repository\"**\n\n\u003e **⚠️ Common Issue:** If you only accept one model agreement, downloads will fail with `'NoneType' object has no attribute 'eval'` error. You MUST accept BOTH agreements.\n\n#### **3. Configure Token**\nAdd your token to the environment configuration:\n\n**For Production Installation:**\n```bash\n# The setup script will prompt you for your token\ncurl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash\n```\n\n**For Manual Installation:**\n```bash\n# Add to .env file\necho \"HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\" \u003e\u003e .env\n```\n\n**Note:** Without a valid HuggingFace token, speaker diarization will be disabled and speakers will not be automatically detected or identified across different media files.\n\n#### **Performance Tuning**\n```bash\n# GPU settings\nUSE_GPU=true                        # Enable GPU acceleration\nCUDA_VISIBLE_DEVICES=0              # GPU device selection\n\n# Resource limits\nMAX_UPLOAD_SIZE=4GB                 # Maximum file size (supports GoPro videos)\nCELERY_WORKER_CONCURRENCY=2         # Concurrent tasks\n```\n\n### **Production Deployment**\n\nFor production use, ensure you:\n\n1. **Security Configuration**\n   ```bash\n   # Generate strong secrets\n   openssl rand -hex 32  # For SECRET_KEY\n   openssl rand -hex 32  # For JWT_SECRET_KEY\n\n   # Set strong database passwords\n   # Configure proper firewall rules\n   # Set up SSL/TLS certificates\n   ```\n\n2. **Performance Optimization**\n   ```bash\n   # Use production environment\n   NODE_ENV=production\n\n   # Configure resource limits\n   # Set up monitoring and logging\n   # Configure backup strategies\n   ```\n\n3. **HTTPS/SSL Setup** (Required for microphone recording from other devices)\n\n   OpenTranscribe includes built-in NGINX reverse proxy support with SSL/TLS:\n\n   ```bash\n   # Quick setup for homelab/local network\n   ./scripts/generate-ssl-cert.sh opentranscribe.local --auto-ip\n\n   # Add to .env\n   NGINX_SERVER_NAME=opentranscribe.local\n\n   # Start with HTTPS enabled\n   ./opentr.sh start dev\n   ```\n\n   For detailed instructions including Let's Encrypt setup, see [docs/NGINX_SETUP.md](docs/NGINX_SETUP.md).\n\n   \u003e **Note**: Modern browsers require HTTPS for microphone access. Without NGINX/SSL setup,\n   \u003e microphone recording will only work when accessing via `localhost`.\n\n## 🧪 Development\n\n### **Development Environment**\n```bash\n# Start development with hot reload\n./opentr.sh start dev\n\n# Backend development\ncd backend/\npip install -r requirements.txt\npytest tests/                    # Run tests\nblack app/                       # Format code\nflake8 app/                      # Lint code\n\n# Frontend development\ncd frontend/\nnpm install\nnpm run dev                      # Development server\nnpm run test                     # Run tests\nnpm run lint                     # Lint code\n```\n\n### **Testing**\n```bash\n# Backend tests\n./opentr.sh shell backend\npytest tests/                    # All tests\npytest tests/api/                # API tests only\npytest --cov=app tests/          # With coverage\n\n# Frontend tests\ncd frontend/\nnpm run test                     # Unit tests\nnpm run test:e2e                 # End-to-end tests\nnpm run test:components          # Component tests\n```\n\n### **Contributing**\nWe welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines.\n\n## 🔍 Troubleshooting\n\n### **Common Issues**\n\n#### **GPU Not Detected**\n```bash\n# Check GPU availability\nnvidia-smi\n\n# Verify Docker GPU support\ndocker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi\n\n# Set CPU-only mode if needed\necho \"USE_GPU=false\" \u003e\u003e .env\n```\n\n#### **Permission Errors (Model Cache / yt-dlp)**\n\n**Symptoms:**\n- Error: `Permission denied: '/home/appuser/.cache/huggingface/hub'`\n- Error: `Permission denied: '/home/appuser/.cache/yt-dlp'`\n- YouTube downloads fail with permission errors\n- Models fail to download or save\n\n**Cause:** Docker creates model cache directories with root ownership, but containers run as non-root user (UID 1000) for security.\n\n**Solution:**\n```bash\n# Option 1: Run the automated permission fix script (recommended)\ncd opentranscribe  # Or your installation directory\n./scripts/fix-model-permissions.sh\n\n# Option 2: Manual fix using Docker\ndocker run --rm -v ./models:/models busybox chown -R 1000:1000 /models\n\n# Option 3: Manual fix using sudo (if available)\nsudo chown -R 1000:1000 ./models\nsudo chmod -R 755 ./models\n```\n\n**Prevention for New Installations:**\n- The latest setup script automatically creates directories with correct permissions\n- Re-run the one-line installer for new deployments:\n  ```bash\n  curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash\n  ```\n\n**Why This Happens:**\n- Different Linux users have different UIDs (e.g., 1001, 1002)\n- Running setup as root creates root-owned directories\n- Docker version differences affect directory creation behavior\n- The containers run as UID 1000 for security (non-root user)\n\n**Verification:**\n```bash\n# Check directory ownership (should show UID 1000 or your user)\nls -la models/\n\n# Test write permissions\ntouch models/huggingface/test.txt \u0026\u0026 rm models/huggingface/test.txt\n```\n\n#### **Memory Issues**\n```bash\n# Reduce model size\necho \"WHISPER_MODEL=medium\" \u003e\u003e .env\necho \"BATCH_SIZE=8\" \u003e\u003e .env\necho \"COMPUTE_TYPE=int8\" \u003e\u003e .env\n\n# Monitor memory usage\ndocker stats\n```\n\n#### **Slow Transcription**\n- Use GPU acceleration (`USE_GPU=true`)\n- Reduce model size (`WHISPER_MODEL=medium`)\n- Increase batch size if you have GPU memory\n- Split large files into smaller segments\n\n#### **Database Connection Issues**\n```bash\n# Reset database\n./opentr.sh reset dev\n\n# Check database logs\n./opentr.sh logs postgres\n\n# Verify database is running\n./opentr.sh shell postgres psql -U postgres -l\n```\n\n#### **Container Issues**\n```bash\n# Check service status\n./opentr.sh status\n\n# Clean up resources\n./opentr.sh clean\n\n# Full reset (⚠️ deletes all data)\n./opentr.sh reset dev\n```\n\n### **Getting Help**\n\n- 📚 **Documentation**: Check README files in each component directory\n- 🐛 **Issues**: Report bugs on GitHub Issues\n- 💬 **Discussions**: Ask questions in GitHub Discussions\n- 📊 **Monitoring**: Use Flower dashboard for task debugging\n\n## 📈 Performance \u0026 Scalability\n\n### **Hardware Recommendations**\n\n#### **Minimum Requirements**\n- 8GB RAM\n- 4 CPU cores\n- 50GB disk space\n- Any modern GPU (optional but recommended)\n\n#### **Recommended Configuration**\n- 16GB+ RAM\n- 8+ CPU cores\n- 100GB+ SSD storage\n- NVIDIA GPU with 8GB+ VRAM (RTX 3070 or better)\n- High-speed internet for model downloads\n\n#### **Production Scale**\n- 32GB+ RAM\n- 16+ CPU cores\n- Multiple GPUs for parallel processing\n- Fast NVMe storage\n- Load balancer for multiple instances\n\n### **Performance Tuning**\n\n```bash\n# GPU optimization\nCOMPUTE_TYPE=float16              # Use half precision\nBATCH_SIZE=32                     # Increase for more GPU memory\nWHISPER_MODEL=large-v2            # Best accuracy\n\n# CPU optimization (if no GPU)\nCOMPUTE_TYPE=int8                 # Use quantization\nBATCH_SIZE=1                      # Reduce memory usage\nWHISPER_MODEL=base                # Faster processing\n```\n\n## 🔐 Security Considerations\n\n### **Data Privacy**\n- All processing happens locally - no data sent to external services\n- Optional: Disable external model downloads for air-gapped environments\n- User data is encrypted at rest and in transit\n- Configurable data retention policies\n\n### **Access Control**\n- Role-based permissions (admin/user)\n- File ownership validation\n- API rate limiting\n- Secure session management\n\n### **Network Security**\n- All services run in isolated Docker network\n- Configurable firewall rules\n- Optional SSL/TLS termination\n- Secure default configurations\n\n## 📄 License\n\nThis project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see the [LICENSE](LICENSE) file for details.\n\nThe AGPL-3.0 license ensures that:\n- The source code remains open and accessible to everyone\n- Any modifications to the software must be made available to users\n- Network use (SaaS) requires source code availability\n- Protects the open source community and prevents proprietary forks\n\n## 🙏 Acknowledgments\n\n- **OpenAI Whisper** - Foundation speech recognition model\n- **WhisperX** - Enhanced alignment and diarization\n- **PyAnnote.audio** - Speaker diarization capabilities\n- **FastAPI** - Modern Python web framework\n- **Svelte** - Reactive frontend framework\n- **Docker** - Containerization platform\n\n## 🔗 Useful Links\n\n- 📚 **Documentation**:\n  - [Database Schema \u0026 Architecture](docs/database-schema.md) - ERD diagrams and system architecture\n  - [Backend Documentation](docs/BACKEND_DOCUMENTATION.md)\n  - [Prompt Engineering Guide](docs/PROMPT_ENGINEERING_README.md) - Best practices for LLM prompts\n  - [Scripts Documentation](scripts/README.md) - Docker build and deployment guide\n- 🛠️ **API Reference**: http://localhost:5174/docs (when running)\n- 🌺 **Task Monitor**: http://localhost:5175/flower (when running)\n- 🤝 **Contributing**: [Contribution guidelines](CONTRIBUTING.md)\n- 🐛 **Issues**: [GitHub Issues](https://github.com/yourusername/OpenTranscribe/issues)\n- 💬 **Discussions**: [GitHub Discussions](https://github.com/yourusername/OpenTranscribe/discussions)\n\n---\n\n**Built with ❤️ using AI assistance and modern open-source technologies.**\n\n*OpenTranscribe demonstrates the power of AI-assisted development while maintaining full local control over your data and processing.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidamacey%2Fopentranscribe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidamacey%2Fopentranscribe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidamacey%2Fopentranscribe/lists"}