https://github.com/devalade/whisper-yoruba
Local-first Yoruba voice-query pipeline (Whisper ASR + diacritic restoration + NLLB + RAG + MMS-TTS) on Apple Silicon, with LoRA fine-tuning for whisper-large-v3
https://github.com/devalade/whisper-yoruba
Last synced: 1 day ago
JSON representation
Local-first Yoruba voice-query pipeline (Whisper ASR + diacritic restoration + NLLB + RAG + MMS-TTS) on Apple Silicon, with LoRA fine-tuning for whisper-large-v3
- Host: GitHub
- URL: https://github.com/devalade/whisper-yoruba
- Owner: devalade
- Created: 2026-06-04T21:04:33.000Z (17 days ago)
- Default Branch: main
- Last Pushed: 2026-06-14T16:28:39.000Z (7 days ago)
- Last Synced: 2026-06-14T18:16:43.470Z (7 days ago)
- Language: Jupyter Notebook
- Size: 280 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Yoruba Voice Query Pipeline
[](https://github.com/devalade/whisper-yoruba)
End-to-end voice assistant for Yoruba speakers. Speak a question in Yoruba, get
a spoken Yoruba answer grounded in an English Wikipedia corpus. Runs fully
locally on Apple Silicon — no cloud calls after the initial model downloads.
```
WAV (yo, 16 kHz mono)
│
▼
┌──────────────┐ raw Yoruba text
│ M1 ASR │ ─ Whisper Large v3 (mlx-whisper)
└──────────────┘
│
▼
┌──────────────┐ diacritized Yoruba
│ M2 ADR │ ─ Davlan/mT5_base_yoruba_adr
└──────────────┘
│
▼
┌──────────────┐ English query
│ M3 YO→EN │ ─ NLLB-200 distilled-600M
└──────────────┘
│
▼
┌──────────────┐ English answer
│ M4 RAG │ ─ MiniLM-L6 + FAISS + Mistral-7B Q4 (llama.cpp)
└──────────────┘
│
▼
┌──────────────┐ Yoruba answer audio
│ M5 TTS │ ─ NLLB EN→YO → M2 diacritize → MMS-TTS-yor
└──────────────┘
│
▼
WAV (yo)
```
M2 runs twice on purpose: once after ASR to clean the input for translation,
and once inside M5 to clean the NLLB EN→YO output before TTS.
## Requirements
- Apple Silicon Mac (M1/M2/M3/M4). Tested on M4 Pro / 24 GB.
- macOS with Miniforge (ARM64), Python 3.11.
- ~15 GB free disk for model weights.
- First run downloads several gigabytes from Hugging Face.
## Setup
```bash
# 1. Create env
conda create -n yoruba python=3.11 -y
conda activate yoruba
# 2. Install deps
make install # or: pip install -r requirements.txt
# 3. Drop the Mistral GGUF into models/
# Expected file: models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
# Source: TheBloke/Mistral-7B-Instruct-v0.2-GGUF on Hugging Face
mkdir -p models
# (download mistral-7b-instruct-v0.2.Q4_K_M.gguf into models/)
# 4. Build the Wikipedia FAISS index used by M4
make index # or: python -m scripts.build_index
# writes data/wikipedia/faiss.index and data/wikipedia/passages.jsonl
```
The seed corpus (in `scripts/build_index.py`) covers Yoruba people, language,
religion, Lagos, Nigeria, Ile-Ife, Oyo Empire, Wole Soyinka, Fela Kuti, Olumo
Rock, Olusegun Obasanjo, Chinua Achebe. Edit `ARTICLES` to add more topics, then
re-run the script.
## Run the pipeline
```bash
make run # uses the FLEURS sample
make run INPUT=path/to.wav OUTPUT=out.wav # custom files
# equivalent without make:
python pipeline.py [output.wav]
```
`` must be 16 kHz mono. `[output.wav]` defaults to
`data/outputs/response.wav`.
Grab a sample Yoruba clip from FLEURS first if you don't have one:
```bash
make sample # or: python tests/fetch_yoruba_sample.py
make run
```
### Talk to it through your microphone
```bash
make talk
# or: python pipeline.py --mic
```
Conversation mode. Models load once, then the system waits for you on each
turn:
1. Press Enter to start speaking.
2. Press Enter again to stop.
3. The pipeline runs and the Yoruba answer auto-plays.
4. The prompt comes back for the next turn.
Type `q` + Enter at the prompt to quit (Ctrl+C also works). macOS will ask for
microphone permission the first time. Each turn's audio is saved as
`data/outputs/response_NNN.wav` and logged to `logs/run-mic_capture.jsonl`.
### Benchmark ASR backends (WER)
```bash
make wer-mlx N=50 # eval mlx-whisper Large v3 on 50 FLEURS yo_ng samples
make wer-hf N=50 # eval the HF Yoruba fine-tune on the same set
make wer N=100 # run both back-to-back
```
Reference comes from `google/fleurs` (`yo_ng` config), split defaults to
`validation`. Both reference and hypothesis are normalized (lowercase,
diacritics stripped, punctuation removed) before scoring — fair to M1 since it
emits non-diacritized text and M2 handles diacritics later. Per-sample results
land in `logs/wer_.jsonl` with the aggregate WER on the first line.
### Swap the ASR backend (M1)
By default M1 uses `mlx-whisper` (Whisper Large v3, Apple-Silicon-optimized).
An alternate HuggingFace backend loads a Yoruba-fine-tuned Whisper Large v2
(`RafatK/Whisper_Largev2-Yoruba-Decodis_Comb_FT`):
```bash
make talk-hf # RAG + HF Whisper
make chat-hf # free-form + HF Whisper
# or: python pipeline.py --mic --asr hf
# also on a file: python pipeline.py input.wav --asr hf
```
The HF backend auto-picks the right device/attention for the host: CUDA + fp16
+ flash-attn-2 when available, otherwise MPS or CPU + sdpa + fp32 (fp16 on MPS
is avoided — Whisper can produce NaNs there).
### Chat mode (no retrieval)
To bypass M4's Wikipedia RAG and let Mistral answer from its own knowledge:
```bash
make chat
# or: python pipeline.py --mic --chat
# also works on a file: python pipeline.py input.wav --chat
```
Same pipeline (M1→M2→M3→M4→M5), but M4 becomes `M4Chat` which prompts the local
Mistral directly with no retrieved context. Useful for open-ended questions
outside the indexed corpus. Trade-off: answers can be less factual and aren't
grounded in citable passages.
Each stage's intermediate result is appended to `logs/.jsonl` for
error-propagation analysis (which stage degraded the output).
Expected console output:
```
=== M1 raw YO ===
=== M2 diacrit. ===
=== M3 EN query ===
=== M4 EN answer (max_sim=0.612) ===
=== M5 YO answer ===
WAV: data/outputs/response.wav (4.81s)
log: logs/run-fleurs_yo_sample.jsonl
```
## Per-module testing
Each module can be exercised in isolation:
```bash
make test # all per-module tests (M1..M5)
make test-m1 # individual module (test-m1 .. test-m5)
make test-chain # M1→M3 and M1→M4 chained tests
# equivalent without make:
python -m tests.test_m1 # ASR only
python -m tests.test_m2 # diacritic restoration
python -m tests.test_m3 # YO→EN translation
python -m tests.test_m4 # RAG (requires the FAISS index)
python -m tests.test_m5 # EN answer → YO audio
```
## Project layout
```
config.py model IDs, paths, thresholds
pipeline.py YorubaPipeline class + CLI entrypoint
modules/
base.py shared Module interface
m1_asr.py Whisper Large v3 (mlx)
m2_diacritic.py mT5 Yoruba ADR
m3_translate.py NLLB YO→EN
m4_rag.py MiniLM + FAISS + Mistral-7B
m5_tts.py NLLB EN→YO + M2 + MMS-TTS
scripts/build_index.py Wikipedia fetch / chunk / embed / index
tests/ per-module + chained sanity tests
utils/logging.py JSONL stage logger
data/ audio/, wikipedia/, outputs/
logs/ per-run JSONL stage logs
models/ local GGUF weights
```
## Key knobs (`config.py`)
| Setting | Default | Notes |
| -------------------- | ------- | ------------------------------------------------ |
| `M4_SIM_THRESHOLD` | 0.5 | Below this max cosine sim, M4 refuses to answer rather than hallucinate. |
| `M4_CHUNK_TOKENS` | 200 | Passage window size (embedder's own tokenizer). |
| `M4_CHUNK_OVERLAP` | 50 | Sliding-window overlap. |
| `M4_TOP_K` | 4 | Passages fed to Mistral as context. |
| `M1_LANGUAGE` | `yo` | Forces Whisper into Yoruba mode. |
## Troubleshooting
- **`transformers` tokenizer error on mT5 or NLLB** — ensure `transformers<5`
and that `sentencepiece` + `protobuf` are installed. Pinned in
`requirements.txt`.
- **`faiss.index` missing** — run `python -m scripts.build_index`.
- **Mistral load fails** — confirm the GGUF path matches `config.M4_LLM_PATH`.
- **Whisper transcript is English** — input WAV may not actually be Yoruba, or
not 16 kHz mono. Resample with `ffmpeg -i in.wav -ar 16000 -ac 1 out.wav`.
- **M4 returns "no answer"** — `max_sim` is below `M4_SIM_THRESHOLD`. Either
expand the Wikipedia seed corpus in `scripts/build_index.py` or lower the
threshold.