{"id":51326829,"url":"https://github.com/darylalim/granite-speech-studio","last_synced_at":"2026-07-01T19:02:29.572Z","repository":{"id":336951452,"uuid":"1124955754","full_name":"darylalim/granite-speech-studio","owner":"darylalim","description":"Streamlit application for transcription and translation using IBM Granite Speech on Apple Silicon with MLX.","archived":false,"fork":false,"pushed_at":"2026-06-27T03:00:47.000Z","size":2028,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-27T05:03:29.840Z","etag":null,"topics":["apple-silicon","automatic-speech-recognition","granite","ibm-granite","mlx","mlx-audio","silero-vad","speech-to-text","streamlit","toxicity-detection","transcription","translation","voice-activity-detection"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/darylalim.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-29T23:00:43.000Z","updated_at":"2026-06-27T03:00:51.000Z","dependencies_parsed_at":"2026-03-17T20:05:31.513Z","dependency_job_id":null,"html_url":"https://github.com/darylalim/granite-speech-studio","commit_stats":null,"previous_names":["darylalim/granite-speech-pipeline","darylalim/granite-speech-studio"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/darylalim/granite-speech-studio","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darylalim%2Fgranite-speech-studio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darylalim%2Fgranite-speech-studio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darylalim%2Fgranite-speech-studio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darylalim%2Fgranite-speech-studio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/darylalim","download_url":"https://codeload.github.com/darylalim/granite-speech-studio/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darylalim%2Fgranite-speech-studio/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35019037,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-01T02:00:05.325Z","response_time":130,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-silicon","automatic-speech-recognition","granite","ibm-granite","mlx","mlx-audio","silero-vad","speech-to-text","streamlit","toxicity-detection","transcription","translation","voice-activity-detection"],"created_at":"2026-07-01T19:02:28.834Z","updated_at":"2026-07-01T19:02:29.564Z","avatar_url":"https://github.com/darylalim.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Granite Speech Studio\n\n[![CI](https://github.com/darylalim/granite-speech-studio/actions/workflows/ci.yml/badge.svg)](https://github.com/darylalim/granite-speech-studio/actions/workflows/ci.yml)\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)\n[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/downloads/)\n\nStreamlit application for transcription and translation using IBM Granite Speech on Apple Silicon with MLX.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/screenshot-light.png\" alt=\"Granite Speech Studio transcribing an English clip and translating it to French (light theme)\" width=\"49%\"\u003e\n  \u003cimg src=\"docs/screenshot-dark.png\" alt=\"Granite Speech Studio transcribing an English clip and translating it to French (dark theme)\" width=\"49%\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eTranscription + French translation of a sample clip, shown in the light and dark themes.\u003c/em\u003e\u003c/p\u003e\n\n## Features\n\n- **Pipeline processing** — run multiple transcription and translation tasks on the same audio (Transcribe + one translation runs as a single inference per segment via chain-of-thought prompting)\n- **Transcription** — English, French, German, Spanish, Portuguese, Japanese\n- **Translation** — English ↔ French, German, Spanish, Portuguese, Italian, Japanese, Mandarin Chinese (Italian and Mandarin: English source only)\n- **Keywords** — bias recognition toward up to 15 user-provided terms (proper nouns, acronyms, jargon)\n- **VAD segmentation** — automatic speech detection with timestamped per-segment output (togglable; disable to process whole audio in one pass; auto-required for audio over 5 minutes)\n- **Toxicity check** — togglable (on by default); surfaces the worst per-segment toxicity score on English output (transcription or translation to English) via Granite Guardian HAP 125m\n- **Source language** — pick once; valid tasks update accordingly\n- **Audio input** — upload audio (WAV, FLAC, M4A, MP3, OGG, AAC) or video (MP4, MOV, WebM, MKV — audio track is extracted) or record from microphone\n- **Side-by-side results** — compare outputs in a column grid (up to 3 columns)\n- **Themed UI** — cohesive IBM Carbon-inspired theme with automatic light and dark modes\n- **Deferred loading** — models load on first pipeline run for instant page startup\n- **Export** — download per-task transcriptions and translations as text\n\n## How it works\n\nThree models run as a pipeline, loaded on first run and cached thereafter:\n\n| Model | Role | Runs on |\n|-------|------|---------|\n| [Granite 4.0 1B Speech (8-bit, MLX)](https://huggingface.co/mlx-community/granite-4.0-1b-speech-8bit) | Transcription and translation | Apple GPU (MLX) |\n| [Silero VAD](https://github.com/snakers4/silero-vad) | Splits audio into speech segments | CPU |\n| [Granite Guardian HAP 125m](https://huggingface.co/ibm-granite/granite-guardian-hap-125m) | English toxicity detection | CPU |\n\nAudio is loaded and resampled to 16 kHz mono, optionally segmented with VAD, then transcribed and translated segment-by-segment on the GPU. English output (English-source transcription or translation into English) is scored for toxicity. Transcribe plus a single translation runs as one chain-of-thought inference per segment rather than two passes.\n\n## Requirements\n\n- Apple Silicon Mac (M1/M2/M3/M4)\n- Python 3.12+\n- [uv](https://docs.astral.sh/uv/) — Python package manager (`curl -LsSf https://astral.sh/uv/install.sh | sh`)\n- [FFmpeg](https://ffmpeg.org/) — `brew install ffmpeg` (required: `torchcodec` loads FFmpeg's shared libraries at import time, so the app won't start without it)\n\n## Setup\n\n```bash\nbrew install ffmpeg   # required at runtime by torchcodec\nuv sync\nuv run streamlit run streamlit_app.py\n```\n\n\u003e First run downloads the Granite Speech model (~2.9 GB) plus the VAD and guardian models, then caches them; inference runs on the Apple Silicon GPU.\n\n## Usage\n\n\u003e New here? Try it with the bundled sample clip: `tests/data/audio/sample_10s.wav`.\n\n1. Upload an audio or video file, or record from your microphone\n2. Pick the source language of your audio\n3. Pick tasks (transcribe, translate to a language)\n4. Optionally toggle **VAD segmentation** (on by default)\n5. Optionally add **Keywords** (proper nouns, acronyms, jargon)\n6. Optionally toggle **Toxicity check** (on by default)\n7. Click **Transcribe** to process all selected tasks\n8. View side-by-side results and download as text\n\n## Notes\n\n- **Apple Silicon only** — inference uses MLX; there's no CUDA or CPU-only fallback.\n- **Translation pivots through English** — English ↔ X only; no direct X → Y (e.g. French → German).\n- **Toxicity detection is English-only** (Granite Guardian HAP).\n- **Upload limit 500 MB**; with VAD off, clips are capped at 5 minutes (the model's context window).\n\n## Development\n\n```bash\nuv run ruff check .     # lint\nuv run ruff format .    # format\nuv run ty check         # type-check\nuv run pytest           # run tests\n```\n\n## Resources\n\n- [Granite 4.0 1B Speech (8-bit, MLX)](https://huggingface.co/mlx-community/granite-4.0-1b-speech-8bit) — model card\n- [Granite Speech collection](https://huggingface.co/collections/ibm-granite/granite-speech)\n- [Technical report](https://arxiv.org/abs/2505.08699)\n\n## Acknowledgements\n\n- [IBM Granite](https://huggingface.co/ibm-granite) — Speech and Guardian models\n- [Silero VAD](https://github.com/snakers4/silero-vad) — voice activity detection\n- [Apple MLX](https://github.com/ml-explore/mlx) and [mlx-audio](https://github.com/Blaizzy/mlx-audio) — on-device inference\n- [Streamlit](https://streamlit.io/) — web UI\n\n## License\n\nLicensed under the [Apache License 2.0](LICENSE). See [NOTICE](NOTICE) for third-party attributions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdarylalim%2Fgranite-speech-studio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdarylalim%2Fgranite-speech-studio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdarylalim%2Fgranite-speech-studio/lists"}