https://github.com/darylalim/granite-speech-studio

Streamlit application for transcription and translation using IBM Granite Speech on Apple Silicon with MLX.
https://github.com/darylalim/granite-speech-studio

apple-silicon automatic-speech-recognition granite ibm-granite mlx mlx-audio silero-vad speech-to-text streamlit toxicity-detection transcription translation voice-activity-detection

Last synced: about 18 hours ago
JSON representation

Streamlit application for transcription and translation using IBM Granite Speech on Apple Silicon with MLX.

Host: GitHub
URL: https://github.com/darylalim/granite-speech-studio
Owner: darylalim
Created: 2025-12-29T23:00:43.000Z (6 months ago)
Default Branch: main
Last Pushed: 2026-06-27T03:00:47.000Z (5 days ago)
Last Synced: 2026-06-27T05:03:29.840Z (5 days ago)
Topics: apple-silicon, automatic-speech-recognition, granite, ibm-granite, mlx, mlx-audio, silero-vad, speech-to-text, streamlit, toxicity-detection, transcription, translation, voice-activity-detection
Language: Python
Homepage:
Size: 1.93 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Granite Speech Studio

[![CI](https://github.com/darylalim/granite-speech-studio/actions/workflows/ci.yml/badge.svg)](https://github.com/darylalim/granite-speech-studio/actions/workflows/ci.yml)

[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/downloads/)

Streamlit application for transcription and translation using IBM Granite Speech on Apple Silicon with MLX.



  

  



Transcription + French translation of a sample clip, shown in the light and dark themes.


## Features

- **Pipeline processing** — run multiple transcription and translation tasks on the same audio (Transcribe + one translation runs as a single inference per segment via chain-of-thought prompting)

- **Transcription** — English, French, German, Spanish, Portuguese, Japanese

- **Translation** — English ↔ French, German, Spanish, Portuguese, Italian, Japanese, Mandarin Chinese (Italian and Mandarin: English source only)

- **Keywords** — bias recognition toward up to 15 user-provided terms (proper nouns, acronyms, jargon)

- **VAD segmentation** — automatic speech detection with timestamped per-segment output (togglable; disable to process whole audio in one pass; auto-required for audio over 5 minutes)

- **Toxicity check** — togglable (on by default); surfaces the worst per-segment toxicity score on English output (transcription or translation to English) via Granite Guardian HAP 125m

- **Source language** — pick once; valid tasks update accordingly

- **Audio input** — upload audio (WAV, FLAC, M4A, MP3, OGG, AAC) or video (MP4, MOV, WebM, MKV — audio track is extracted) or record from microphone

- **Side-by-side results** — compare outputs in a column grid (up to 3 columns)

- **Themed UI** — cohesive IBM Carbon-inspired theme with automatic light and dark modes

- **Deferred loading** — models load on first pipeline run for instant page startup

- **Export** — download per-task transcriptions and translations as text

## How it works

Three models run as a pipeline, loaded on first run and cached thereafter:

| Model | Role | Runs on |

|-------|------|---------|

| [Granite 4.0 1B Speech (8-bit, MLX)](https://huggingface.co/mlx-community/granite-4.0-1b-speech-8bit) | Transcription and translation | Apple GPU (MLX) |

| [Silero VAD](https://github.com/snakers4/silero-vad) | Splits audio into speech segments | CPU |

| [Granite Guardian HAP 125m](https://huggingface.co/ibm-granite/granite-guardian-hap-125m) | English toxicity detection | CPU |

Audio is loaded and resampled to 16 kHz mono, optionally segmented with VAD, then transcribed and translated segment-by-segment on the GPU. English output (English-source transcription or translation into English) is scored for toxicity. Transcribe plus a single translation runs as one chain-of-thought inference per segment rather than two passes.

## Requirements

- Apple Silicon Mac (M1/M2/M3/M4)

- Python 3.12+

- [uv](https://docs.astral.sh/uv/) — Python package manager (`curl -LsSf https://astral.sh/uv/install.sh | sh`)

- [FFmpeg](https://ffmpeg.org/) — `brew install ffmpeg` (required: `torchcodec` loads FFmpeg's shared libraries at import time, so the app won't start without it)

## Setup

```bash

brew install ffmpeg   # required at runtime by torchcodec

uv sync

uv run streamlit run streamlit_app.py

```

> First run downloads the Granite Speech model (~2.9 GB) plus the VAD and guardian models, then caches them; inference runs on the Apple Silicon GPU.

## Usage

> New here? Try it with the bundled sample clip: `tests/data/audio/sample_10s.wav`.

1. Upload an audio or video file, or record from your microphone

2. Pick the source language of your audio

3. Pick tasks (transcribe, translate to a language)

4. Optionally toggle **VAD segmentation** (on by default)

5. Optionally add **Keywords** (proper nouns, acronyms, jargon)

6. Optionally toggle **Toxicity check** (on by default)

7. Click **Transcribe** to process all selected tasks

8. View side-by-side results and download as text

## Notes

- **Apple Silicon only** — inference uses MLX; there's no CUDA or CPU-only fallback.

- **Translation pivots through English** — English ↔ X only; no direct X → Y (e.g. French → German).

- **Toxicity detection is English-only** (Granite Guardian HAP).

- **Upload limit 500 MB**; with VAD off, clips are capped at 5 minutes (the model's context window).

## Development

```bash

uv run ruff check .     # lint

uv run ruff format .    # format

uv run ty check         # type-check

uv run pytest           # run tests

```

## Resources

- [Granite 4.0 1B Speech (8-bit, MLX)](https://huggingface.co/mlx-community/granite-4.0-1b-speech-8bit) — model card

- [Granite Speech collection](https://huggingface.co/collections/ibm-granite/granite-speech)

- [Technical report](https://arxiv.org/abs/2505.08699)

## Acknowledgements

- [IBM Granite](https://huggingface.co/ibm-granite) — Speech and Guardian models

- [Silero VAD](https://github.com/snakers4/silero-vad) — voice activity detection

- [Apple MLX](https://github.com/ml-explore/mlx) and [mlx-audio](https://github.com/Blaizzy/mlx-audio) — on-device inference

- [Streamlit](https://streamlit.io/) — web UI

## License

Licensed under the [Apache License 2.0](LICENSE). See [NOTICE](NOTICE) for third-party attributions.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/darylalim/granite-speech-studio

Awesome Lists containing this project

README