{"id":30654793,"url":"https://github.com/FluidInference/FluidAudio","last_synced_at":"2025-08-31T09:04:53.507Z","repository":{"id":301622609,"uuid":"1006258471","full_name":"FluidInference/FluidAudio","owner":"FluidInference","description":"Native Swift and CoreML SDK for local speaker diarization, VAD, and speech-to-text for real-time workloads. Works on iOS and macOS.","archived":false,"fork":false,"pushed_at":"2025-08-30T00:20:11.000Z","size":14317,"stargazers_count":560,"open_issues_count":2,"forks_count":65,"subscribers_count":35,"default_branch":"main","last_synced_at":"2025-08-30T02:13:34.410Z","etag":null,"topics":["ane","asr","audio","automatic-speech-recognition","avfoundation","coreml","ios","macos","nvidia","parakeet","real-time","speaker-diarization","speaker-embedding","speaker-identification","speaker-recognition","speech-to-text","swift","vad","voice-activity-detection"],"latest_commit_sha":null,"homepage":"https://deepwiki.com/FluidInference/FluidAudio","language":"Swift","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FluidInference.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-06-21T21:09:30.000Z","updated_at":"2025-08-30T00:20:12.000Z","dependencies_parsed_at":"2025-07-16T10:20:42.675Z","dependency_job_id":"34f61a1c-b13b-49d9-9018-9f8a070bf8d1","html_url":"https://github.com/FluidInference/FluidAudio","commit_stats":null,"previous_names":["fluidinference/fluidaudioswift","fluidinference/fluidaudio"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/FluidInference/FluidAudio","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FluidInference%2FFluidAudio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FluidInference%2FFluidAudio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FluidInference%2FFluidAudio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FluidInference%2FFluidAudio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FluidInference","download_url":"https://codeload.github.com/FluidInference/FluidAudio/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FluidInference%2FFluidAudio/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272959477,"owners_count":25022057,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-31T02:00:09.071Z","response_time":79,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ane","asr","audio","automatic-speech-recognition","avfoundation","coreml","ios","macos","nvidia","parakeet","real-time","speaker-diarization","speaker-embedding","speaker-identification","speaker-recognition","speech-to-text","swift","vad","voice-activity-detection"],"created_at":"2025-08-31T09:02:29.158Z","updated_at":"2025-08-31T09:04:53.484Z","avatar_url":"https://github.com/FluidInference.png","language":"Swift","readme":"![banner.png](banner.png)\n\n# FluidAudio - Swift SDK for Speaker Diarization and ASR with CoreML\n\n[![Swift](https://img.shields.io/badge/Swift-5.9+-orange.svg)](https://swift.org)\n[![Platform](https://img.shields.io/badge/Platform-macOS%20%7C%20iOS-blue.svg)](https://developer.apple.com)\n[![Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289da.svg)](https://discord.gg/WNsvaCtmDe)\n[![Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/collections/FluidInference/coreml-models-6873d9e310e638c66d22fba9)\n\nFluid Audio is a Swift framework for fully local, low-latency audio processing on Apple devices. It provides state-of-the-art speaker diarization, ASR, and voice activity detection through open-source models (MIT/Apache 2.0 licensed) that we've converted to Core ML.\n\nOur models are optimized for background processing on CPU, avoiding GPU/MPS/Shaders to ensure reliable performance. While we've tested CPU/GPU-based alternatives, they proved too slow or resource-intensive for our near real-time requirements.\n\nFor custom use cases, feedback, more model support, and other platform requests, join our Discord. We’re also working on porting video, language, and TTS models to run on device, and will share updates there.\n\n## Features\n\n- **Automatic Speech Recognition (ASR)**: Parakeet TDT v3 (0.6b) with Token Duration Transducer; supports 25 European languages\n- **Speaker Diarization**: Speaker separation with speaker clustering via Pyannote models\n- **Speaker Embedding Extraction**: Generate speaker embeddings for voice comparison and clustering, you can use this for speaker identification\n- **Voice Activity Detection (VAD)**: Voice activity detection with Silero models\n- **CoreML Models**: Native Apple CoreML backend with custom-converted models optimized for Apple Silicon\n- **Open-Source Models**: All models are [publicly available on HuggingFace](https://huggingface.co/FluidInference) - converted and optimized by our team. Permissive licenses.\n- **Real-time Processing**: Designed for near real-time workloads but also works for offline processing\n- **Cross-platform**: Support for macOS 14.0+ and iOS 17.0+ and Apple Sillicon device\n- **Apple Neural Engine**: Models run efficiently on Apple's ANE for maximum performance with minimal power consumption\n\n## Installation\n\nAdd FluidAudio to your project using Swift Package Manager:\n\n```swift\ndependencies: [\n    .package(url: \"https://github.com/FluidInference/FluidAudio.git\", from: \"0.3.0\"),\n],\n```\n\n**Important**: When adding FluidAudio as a package dependency, **only add the library to your target** (not the executable). Select \"FluidAudio\" library in the package products dialog and add it to your app target.\n\n## Documentation\n\n- **DeepWiki**: [https://deepwiki.com/FluidInference/FluidAudio](https://deepwiki.com/FluidInference/FluidAudio) - Primary documentation\n- **Local Docs**: [Documentation/](Documentation/) - Additional guides and API references\n\n## MCP\n\nThe repo is indexed by [DeepWiki](https://docs.devin.ai/work-with-devin/deepwiki-mcp) - the MCP server gives your coding tool access to the docs already.\n\nFor most clients:\n\n```json\n{\n  \"mcpServers\": {\n    \"deepwiki\": {\n      \"url\": \"https://mcp.deepwiki.com/mcp\"\n    }\n  }\n}\n```\n\nFor claude code:\n\n```bash\nclaude mcp add -s user -t http deepwiki https://mcp.deepwiki.com/mcp\n```\n\n## Speaker Diarization\n\n**AMI Benchmark Results** (Single Distant Microphone) using a subset of the files:\n\n- **DER: 17.7%** - Competitive with Powerset BCE 2023 (18.5%)\n- **JER: 28.0%** - Outperforms EEND 2019 (25.3%) and x-vector clustering (28.7%)\n- **RTF: 0.02x** - Real-time processing with 50x speedup\n\n```text\n  RTF = Processing Time / Audio Duration\n\n  With RTF = 0.02x:\n  - 1 minute of audio takes 0.02 × 60 = 1.2 seconds to process\n  - 10 minutes of audio takes 0.02 × 600 = 12 seconds to process\n\n  For real-time speech-to-text:\n  - Latency: ~1.2 seconds per minute of audio\n  - Throughput: Can process 50x faster than real-time\n  - Pipeline impact: Minimal - diarization won't be the bottleneck\n```\n\n## Voice Activity Detection (VAD) (beta)\n\nThe APIs here are too complicated for production usage; please use with caution and tune them as needed. To be transparent, VAD is the lowest priority in terms of maintenance for us at this point. If you need support here, please file an issue or contribute back!\n\nOur goal is to offer a similar API to what Apple will introudce in OS26: https://developer.apple.com/documentation/speech/speechdetector\n\n## Automatic Speech Recognition (ASR)\n\n- **Model**: [`FluidInference/parakeet-tdt-0.6b-v3-coreml`](https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v3-coreml)\n- **Languages**: All European languages (25)\n- **Processing Mode**: Batch transcription for complete audio files\n- **Real-time Factor**: ~110x on M4 Pro (processes 1 minute of audio in ~0.5 seconds)\n- **Streaming Support**: Coming soon - batch processing is recommended for production use\n- **Backend**: Same Parakeet TDT v3 model powers our backend ASR\n\n### CLI Transcription\n\n```bash\n# Transcribe an audio file using batch processing\nswift run fluidaudio transcribe audio.wav\n\n# Show help and usage options\nswift run fluidaudio transcribe --help\n```\n\n### Benchmark Performance\n\n```bash\nswift run fluidaudio asr-benchmark --subset test-clean --max-files 25\n```\n\n## Showcase\n\nFluidAudio powers local AI apps like:\n\n- **[Slipbox](https://slipbox.ai/)**: Privacy-first meeting assistant for real-time conversation intelligence. Uses FluidAudio Parakeet for iOS transcription and speaker diarization across all platforms.\n- **[Whisper Mate](https://whisper.marksdo.com)**: Transcribes movies and audio to text locally. Records and transcribes in real time from speakers or system apps. Uses FluidAudio for speaker diarization.\n- **[Voice Ink](https://tryvoiceink.com/)**: Uses local AI models to instantly transcribe speech with near-perfect accuracy and complete privacy. Utilizes FluidAudio for Parakeet ASR.\n- **[Spokenly](https://spokenly.app/)**: Mac dictation app that provides fast, accurate voice-to-text conversion anywhere on your system with Parakeet ASR powered by FluidAudio. Supports real-time dictation, file transcription, and speaker diarization.\n\nMake a PR if you want to add your app!\n\n## Contributing\n\n### Code Style\n\nThis project uses `swift-format` to maintain consistent code style. All pull requests are automatically checked for formatting compliance.\n\n**Local Development:**\n```bash\n# Format all code (requires Swift 6+ for contributors only)\n# Users of the library don't need Swift 6\nswift format --in-place --recursive --configuration .swift-format Sources/ Tests/ Examples/\n\n# Check formatting without modifying\nswift format lint --recursive --configuration .swift-format Sources/ Tests/ Examples/\n\n# For Swift \u003c6, install swift-format separately:\n# git clone https://github.com/apple/swift-format\n# cd swift-format \u0026\u0026 swift build -c release\n# cp .build/release/swift-format /usr/local/bin/\n```\n\n**Automatic Checks:**\n- PRs will fail if code is not properly formatted\n- GitHub Actions runs formatting checks on all Swift file changes\n- See `.swift-format` for style configuration\n\n## Batch ASR Usage\n\n### CLI Command (Recommended)\n\n```bash\n# Simple transcription\nswift run fluidaudio transcribe audio.wav\n\n# This will output:\n# - Audio format information (sample rate, channels, duration)\n# - Final transcription text\n# - Performance metrics (processing time, RTFx, confidence)\n```\n\n### Programmatic API\n\n```swift\nimport AVFoundation\nimport FluidAudio\n\n// Batch transcription from an audio source\nTask {\n    // 1) Initialize ASR manager and load models\n    let models = try await AsrModels.downloadAndLoad()\n    let asrManager = AsrManager(config: .default)\n    try await asrManager.initialize(models: models)\n\n    // 2) Load and convert audio to 16kHz mono Float32 samples\n    let samples = try await AudioProcessor.loadAudioFile(path: \"path/to/audio.wav\")\n\n    // 3) Transcribe the audio\n    let result = try await asrManager.transcribe(samples, source: .system)\n    print(\"Transcription: \\(result.text)\")\n    print(\"Confidence: \\(result.confidence)\")\n}\n```\n\n### Speaker Diarization\n\n```swift\nimport FluidAudio\n\n// Initialize and process audio\nTask {\n    let models = try await DiarizerModels.downloadIfNeeded()\n    let diarizer = DiarizerManager()  // Uses optimal defaults (0.7 threshold = 17.7% DER)\n    diarizer.initialize(models: models)\n\n    let audioSamples: audioSamples[1000..\u003c5000]  // your 16kHz audio data, No memory copy!\n    let result = try diarizer.performCompleteDiarization(audioSamples)\n\n    for segment in result.segments {\n        print(\"Speaker \\(segment.speakerId): \\(segment.startTimeSeconds)s - \\(segment.endTimeSeconds)s\")\n    }\n}\n```\n\n**Speaker Enrollment (NEW)**: The `Speaker` class now includes a `name` field for enrollment workflows. When users introduce themselves (\"My name is Alice\"), you can update the speaker's name from the default \"Speaker_1\" to their actual name, enabling personalized speaker identification throughout the session.\n\n\n## CLI Usage\n\nFluidAudio includes a powerful command-line interface for benchmarking and audio processing:\n\n**Note**: The CLI is available on macOS only. For iOS applications, use the FluidAudio library programmatically as shown in the usage examples above.\n**Note**: FluidAudio automatically downloads required models during audio processing. If you encounter network restrictions when accessing Hugging Face, you can configure an HTTPS proxy by setting the environment variable. For example: `export https_proxy=http://127.0.0.1:7890`\n\n### Diarization Benchmark\n\n```bash\n# Run AMI benchmark with automatic dataset download\nswift run fluidaudio diarization-benchmark --auto-download\n\n# Test with specific parameters\nswift run fluidaudio diarization-benchmark --threshold 0.7 --output results.json\n\n# Test a single file for quick parameter tuning\nswift run fluidaudio diarization-benchmark --single-file ES2004a --threshold 0.8\n```\n\n### ASR Commands\n\n```bash\n# Transcribe an audio file (batch processing)\nswift run fluidaudio transcribe audio.wav\n\n# Run LibriSpeech ASR benchmark\nswift run fluidaudio asr-benchmark --subset test-clean --num-files 50\n\n# Benchmark with specific configuration  \nswift run fluidaudio asr-benchmark --subset test-other --output asr_results.json\n\n# Test with automatic download\nswift run fluidaudio asr-benchmark --auto-download --subset test-clean\n```\n\n### Process Individual Files\n\n```bash\n# Process a single audio file for diarization\nswift run fluidaudio process meeting.wav\n\n# Save results to JSON\nswift run fluidaudio process meeting.wav --output results.json --threshold 0.6\n```\n\n### Download Datasets\n\n```bash\n# Download AMI dataset for diarization benchmarking\nswift run fluidaudio download --dataset ami-sdm\n\n# Download LibriSpeech for ASR benchmarking\nswift run fluidaudio download --dataset librispeech-test-clean\nswift run fluidaudio download --dataset librispeech-test-other\n```\n\n## API Reference\n\n**Diarization:**\n\n- **`DiarizerManager`**: Main diarization class\n- **`performCompleteDiarization(_:sampleRate:)`**: Process audio and return speaker segments\n  - Accepts any `RandomAccessCollection\u003cFloat\u003e` (Array, ArraySlice, ContiguousArray, etc.)\n- **`compareSpeakers(audio1:audio2:)`**: Compare similarity between two audio samples\n- **`validateAudio(_:)`**: Validate audio quality and characteristics\n\n**Voice Activity Detection:**\n\n- **`VadManager`**: Voice activity detection with CoreML models\n- **`VadConfig`**: Configuration for VAD processing with adaptive thresholding\n- **`processChunk(_:)`**: Process a single audio chunk and detect voice activity\n- **`processAudioFile(_:)`**: Process complete audio file in chunks\n- **`VadAudioProcessor`**: Advanced audio processing with SNR filtering\n\n**Automatic Speech Recognition:**\n\n- **`AsrManager`**: Main ASR class with TDT decoding for batch processing\n- **`AsrModels`**: Model loading and management with automatic downloads\n- **`ASRConfig`**: Configuration for ASR processing\n- **`transcribe(_:source:)`**: Process complete audio and return transcription results\n- **`AudioProcessor.loadAudioFile(path:)`**: Load and convert audio files to required format\n- **`AudioSource`**: Enum for microphone vs system audio separation\n\n## License\n\nApache 2.0 - see [LICENSE](LICENSE) for details.\n\n## Acknowledgments\n\nThis project builds upon the excellent work of the [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) project for speaker diarization algorithms and techniques. We extend our gratitude to the sherpa-onnx contributors for their foundational work in on-device speech processing.\n\nPyannote: https://github.com/pyannote/pyannote-audio\n\nWewpeaker: https://github.com/wenet-e2e/wespeaker\n\nParakeet-mlx: https://github.com/senstella/parakeet-mlx\n\nsilero-vad: https://github.com/snakers4/silero-vad\n","funding_links":[],"categories":["Software","Libs","Audio","Open Source Projects","Misc"],"sub_categories":["Framework","Audio","2. Audio","Speech Recognition"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFluidInference%2FFluidAudio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFluidInference%2FFluidAudio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFluidInference%2FFluidAudio/lists"}