{"id":51001510,"url":"https://github.com/devalade/whisper-yoruba","last_synced_at":"2026-06-20T14:33:24.452Z","repository":{"id":362570186,"uuid":"1259774751","full_name":"devalade/whisper-yoruba","owner":"devalade","description":"Local-first Yoruba voice-query pipeline (Whisper ASR + diacritic restoration + NLLB + RAG + MMS-TTS) on Apple Silicon, with LoRA fine-tuning for whisper-large-v3","archived":false,"fork":false,"pushed_at":"2026-06-14T16:28:39.000Z","size":287,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-14T18:16:43.470Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/devalade.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-04T21:04:33.000Z","updated_at":"2026-06-14T16:28:43.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/devalade/whisper-yoruba","commit_stats":null,"previous_names":["devalade/whisper-yoruba"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/devalade/whisper-yoruba","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devalade%2Fwhisper-yoruba","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devalade%2Fwhisper-yoruba/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devalade%2Fwhisper-yoruba/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devalade%2Fwhisper-yoruba/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/devalade","download_url":"https://codeload.github.com/devalade/whisper-yoruba/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devalade%2Fwhisper-yoruba/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34573803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-20T02:00:06.407Z","response_time":98,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-20T14:33:24.275Z","updated_at":"2026-06-20T14:33:24.444Z","avatar_url":"https://github.com/devalade.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Yoruba Voice Query Pipeline\n\n[![GitHub](https://img.shields.io/badge/GitHub-devalade%2Fwhisper--yoruba-181717?logo=github)](https://github.com/devalade/whisper-yoruba)\n\nEnd-to-end voice assistant for Yoruba speakers. Speak a question in Yoruba, get\na spoken Yoruba answer grounded in an English Wikipedia corpus. Runs fully\nlocally on Apple Silicon — no cloud calls after the initial model downloads.\n\n```\nWAV (yo, 16 kHz mono)\n        │\n        ▼\n ┌──────────────┐   raw Yoruba text\n │ M1  ASR      │ ─ Whisper Large v3 (mlx-whisper)\n └──────────────┘\n        │\n        ▼\n ┌──────────────┐   diacritized Yoruba\n │ M2  ADR      │ ─ Davlan/mT5_base_yoruba_adr\n └──────────────┘\n        │\n        ▼\n ┌──────────────┐   English query\n │ M3  YO→EN    │ ─ NLLB-200 distilled-600M\n └──────────────┘\n        │\n        ▼\n ┌──────────────┐   English answer\n │ M4  RAG      │ ─ MiniLM-L6 + FAISS + Mistral-7B Q4 (llama.cpp)\n └──────────────┘\n        │\n        ▼\n ┌──────────────┐   Yoruba answer audio\n │ M5  TTS      │ ─ NLLB EN→YO  →  M2 diacritize  →  MMS-TTS-yor\n └──────────────┘\n        │\n        ▼\n   WAV (yo)\n```\n\nM2 runs twice on purpose: once after ASR to clean the input for translation,\nand once inside M5 to clean the NLLB EN→YO output before TTS.\n\n## Requirements\n\n- Apple Silicon Mac (M1/M2/M3/M4). Tested on M4 Pro / 24 GB.\n- macOS with Miniforge (ARM64), Python 3.11.\n- ~15 GB free disk for model weights.\n- First run downloads several gigabytes from Hugging Face.\n\n## Setup\n\n```bash\n# 1. Create env\nconda create -n yoruba python=3.11 -y\nconda activate yoruba\n\n# 2. Install deps\nmake install               # or: pip install -r requirements.txt\n\n# 3. Drop the Mistral GGUF into models/\n#    Expected file: models/mistral-7b-instruct-v0.2.Q4_K_M.gguf\n#    Source: TheBloke/Mistral-7B-Instruct-v0.2-GGUF on Hugging Face\nmkdir -p models\n# (download mistral-7b-instruct-v0.2.Q4_K_M.gguf into models/)\n\n# 4. Build the Wikipedia FAISS index used by M4\nmake index                 # or: python -m scripts.build_index\n# writes data/wikipedia/faiss.index and data/wikipedia/passages.jsonl\n```\n\nThe seed corpus (in `scripts/build_index.py`) covers Yoruba people, language,\nreligion, Lagos, Nigeria, Ile-Ife, Oyo Empire, Wole Soyinka, Fela Kuti, Olumo\nRock, Olusegun Obasanjo, Chinua Achebe. Edit `ARTICLES` to add more topics, then\nre-run the script.\n\n## Run the pipeline\n\n```bash\nmake run                                    # uses the FLEURS sample\nmake run INPUT=path/to.wav OUTPUT=out.wav   # custom files\n\n# equivalent without make:\npython pipeline.py \u003cinput.wav\u003e [output.wav]\n```\n\n`\u003cinput.wav\u003e` must be 16 kHz mono. `[output.wav]` defaults to\n`data/outputs/response.wav`.\n\nGrab a sample Yoruba clip from FLEURS first if you don't have one:\n\n```bash\nmake sample        # or: python tests/fetch_yoruba_sample.py\nmake run\n```\n\n### Talk to it through your microphone\n\n```bash\nmake talk\n# or: python pipeline.py --mic\n```\n\nConversation mode. Models load once, then the system waits for you on each\nturn:\n\n1. Press Enter to start speaking.\n2. Press Enter again to stop.\n3. The pipeline runs and the Yoruba answer auto-plays.\n4. The prompt comes back for the next turn.\n\nType `q` + Enter at the prompt to quit (Ctrl+C also works). macOS will ask for\nmicrophone permission the first time. Each turn's audio is saved as\n`data/outputs/response_NNN.wav` and logged to `logs/run-mic_capture.jsonl`.\n\n### Benchmark ASR backends (WER)\n\n```bash\nmake wer-mlx N=50          # eval mlx-whisper Large v3 on 50 FLEURS yo_ng samples\nmake wer-hf  N=50          # eval the HF Yoruba fine-tune on the same set\nmake wer N=100             # run both back-to-back\n```\n\nReference comes from `google/fleurs` (`yo_ng` config), split defaults to\n`validation`. Both reference and hypothesis are normalized (lowercase,\ndiacritics stripped, punctuation removed) before scoring — fair to M1 since it\nemits non-diacritized text and M2 handles diacritics later. Per-sample results\nland in `logs/wer_\u003cbackend\u003e.jsonl` with the aggregate WER on the first line.\n\n### Swap the ASR backend (M1)\n\nBy default M1 uses `mlx-whisper` (Whisper Large v3, Apple-Silicon-optimized).\nAn alternate HuggingFace backend loads a Yoruba-fine-tuned Whisper Large v2\n(`RafatK/Whisper_Largev2-Yoruba-Decodis_Comb_FT`):\n\n```bash\nmake talk-hf                      # RAG + HF Whisper\nmake chat-hf                      # free-form + HF Whisper\n# or: python pipeline.py --mic --asr hf\n# also on a file: python pipeline.py input.wav --asr hf\n```\n\nThe HF backend auto-picks the right device/attention for the host: CUDA + fp16\n+ flash-attn-2 when available, otherwise MPS or CPU + sdpa + fp32 (fp16 on MPS\nis avoided — Whisper can produce NaNs there).\n\n### Chat mode (no retrieval)\n\nTo bypass M4's Wikipedia RAG and let Mistral answer from its own knowledge:\n\n```bash\nmake chat\n# or: python pipeline.py --mic --chat\n# also works on a file: python pipeline.py input.wav --chat\n```\n\nSame pipeline (M1→M2→M3→M4→M5), but M4 becomes `M4Chat` which prompts the local\nMistral directly with no retrieved context. Useful for open-ended questions\noutside the indexed corpus. Trade-off: answers can be less factual and aren't\ngrounded in citable passages.\n\nEach stage's intermediate result is appended to `logs/\u003crun_id\u003e.jsonl` for\nerror-propagation analysis (which stage degraded the output).\n\nExpected console output:\n\n```\n=== M1 raw YO    ===   \u003cwhisper transcript\u003e\n=== M2 diacrit.  ===   \u003crestored tone marks\u003e\n=== M3 EN query  ===   \u003cEnglish question\u003e\n=== M4 EN answer (max_sim=0.612) ===   \u003cretrieved/generated English answer\u003e\n=== M5 YO answer ===   \u003cYoruba answer text\u003e\nWAV: data/outputs/response.wav  (4.81s)\nlog: logs/run-fleurs_yo_sample.jsonl\n```\n\n## Per-module testing\n\nEach module can be exercised in isolation:\n\n```bash\nmake test           # all per-module tests (M1..M5)\nmake test-m1        # individual module (test-m1 .. test-m5)\nmake test-chain     # M1→M3 and M1→M4 chained tests\n\n# equivalent without make:\npython -m tests.test_m1     # ASR only\npython -m tests.test_m2     # diacritic restoration\npython -m tests.test_m3     # YO→EN translation\npython -m tests.test_m4     # RAG (requires the FAISS index)\npython -m tests.test_m5     # EN answer → YO audio\n```\n\n## Project layout\n\n```\nconfig.py              model IDs, paths, thresholds\npipeline.py            YorubaPipeline class + CLI entrypoint\nmodules/\n  base.py              shared Module interface\n  m1_asr.py            Whisper Large v3 (mlx)\n  m2_diacritic.py      mT5 Yoruba ADR\n  m3_translate.py      NLLB YO→EN\n  m4_rag.py            MiniLM + FAISS + Mistral-7B\n  m5_tts.py            NLLB EN→YO + M2 + MMS-TTS\nscripts/build_index.py Wikipedia fetch / chunk / embed / index\ntests/                 per-module + chained sanity tests\nutils/logging.py       JSONL stage logger\ndata/                  audio/, wikipedia/, outputs/\nlogs/                  per-run JSONL stage logs\nmodels/                local GGUF weights\n```\n\n## Key knobs (`config.py`)\n\n| Setting              | Default | Notes                                            |\n| -------------------- | ------- | ------------------------------------------------ |\n| `M4_SIM_THRESHOLD`   | 0.5     | Below this max cosine sim, M4 refuses to answer rather than hallucinate. |\n| `M4_CHUNK_TOKENS`    | 200     | Passage window size (embedder's own tokenizer).  |\n| `M4_CHUNK_OVERLAP`   | 50      | Sliding-window overlap.                          |\n| `M4_TOP_K`           | 4       | Passages fed to Mistral as context.              |\n| `M1_LANGUAGE`        | `yo`    | Forces Whisper into Yoruba mode.                 |\n\n## Troubleshooting\n\n- **`transformers` tokenizer error on mT5 or NLLB** — ensure `transformers\u003c5`\n  and that `sentencepiece` + `protobuf` are installed. Pinned in\n  `requirements.txt`.\n- **`faiss.index` missing** — run `python -m scripts.build_index`.\n- **Mistral load fails** — confirm the GGUF path matches `config.M4_LLM_PATH`.\n- **Whisper transcript is English** — input WAV may not actually be Yoruba, or\n  not 16 kHz mono. Resample with `ffmpeg -i in.wav -ar 16000 -ac 1 out.wav`.\n- **M4 returns \"no answer\"** — `max_sim` is below `M4_SIM_THRESHOLD`. Either\n  expand the Wikipedia seed corpus in `scripts/build_index.py` or lower the\n  threshold.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevalade%2Fwhisper-yoruba","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevalade%2Fwhisper-yoruba","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevalade%2Fwhisper-yoruba/lists"}