https://github.com/zxkane/audio-transcriber-funasr
Agent skill for multi-speaker meeting & podcast transcription with FunASR speaker diarization and LLM cleanup. Supports 99 languages (zh/en/ja/ko/yue + Whisper). GPU & CPU. Packaged as a Claude Code plugin.
https://github.com/zxkane/audio-transcriber-funasr
agent-skill asr chinese-asr claude-code claude-code-skill funasr interview-transcription meeting-transcription multilingual paraformer podcast-transcription skills-sh speaker-diarization speech-to-text whisper
Last synced: about 1 month ago
JSON representation
Agent skill for multi-speaker meeting & podcast transcription with FunASR speaker diarization and LLM cleanup. Supports 99 languages (zh/en/ja/ko/yue + Whisper). GPU & CPU. Packaged as a Claude Code plugin.
- Host: GitHub
- URL: https://github.com/zxkane/audio-transcriber-funasr
- Owner: zxkane
- License: mit
- Created: 2026-04-13T04:14:26.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-25T09:06:32.000Z (about 2 months ago)
- Last Synced: 2026-04-25T09:24:25.984Z (about 2 months ago)
- Topics: agent-skill, asr, chinese-asr, claude-code, claude-code-skill, funasr, interview-transcription, meeting-transcription, multilingual, paraformer, podcast-transcription, skills-sh, speaker-diarization, speech-to-text, whisper
- Language: Python
- Size: 116 KB
- Stars: 0
- Watchers: 0
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# FunASR Audio Transcriber
[](https://clawhub.ai/zxkane/zxkane-audio-transcriber-funasr)
Claude Code plugin for multi-speaker meeting and podcast transcription with automatic speaker diarization and LLM cleanup, powered by [FunASR](https://github.com/modelscope/FunASR).
## Features
- **Meetings & podcasts** — Handles large meetings (10+ speakers) and podcasts/interviews (2–3 speakers) with CAM++ speaker diarization, `--num-speakers` hint, real name mapping, and speaker context for LLM identification
- **Hotword biasing** — SeACo-Paraformer accepts participant names and domain terms to improve recognition accuracy (+50% on tested Chinese terms)
- **Multi-language** — Chinese (SeACo-Paraformer, CER 1.95%), English (Paraformer-en), auto-detect (SenseVoiceSmall: zh/en/ja/ko/yue), or 99 languages (Whisper-large-v3-turbo)
- **Long recordings** — Handles 4+ hour recordings without splitting (includes spectral clustering performance patch)
- **LLM cleanup** — Bedrock Claude removes fillers, fixes ASR errors, polishes grammar
- **GPU & CPU** — Auto-detects CUDA; fully functional on CPU with [low-memory guidance](plugins/funasr-transcriber/skills/funasr-transcribe/references/pipeline-details.md#running-on-cpu-only--low-memory-machines)
- **Resume support** — Checkpoint at every phase for interrupted runs
- **Local MiMo-V2.5-ASR support (new in 1.7.0):** opt-in `--lang mimo` runs
Xiaomi's 8B ASR model locally on a CUDA GPU for dialect-heavy or
code-switching audio, with diarization preserved via FSMN VAD + CAM++.
Requires Python 3.12, ≥20 GB VRAM, and `INSTALL_MIMO=1 bash setup_env.sh`.
## Installation
### As Agent Skill (via [skills.sh](https://skills.sh))
```bash
npx skills add zxkane/audio-transcriber-funasr
```
### As Claude Code Plugin
Add as a marketplace, then install:
```bash
# In Claude Code
/plugin marketplace add zxkane/audio-transcriber-funasr
/plugin install funasr-transcriber@zxkane-audio-transcriber-funasr
```
### Manual Usage
```bash
# 1. Set up environment (auto-detects GPU/CPU)
bash plugins/funasr-transcriber/skills/funasr-transcribe/scripts/setup_env.sh
source .venv/bin/activate
# 2. Convert audio to 16kHz mono FLAC (lossless, ~50% smaller than WAV)
ffmpeg -i recording.m4a -ar 16000 -ac 1 -sample_fmt s16 meeting.flac
# 3. Chinese meeting, 9 speakers
python3 plugins/funasr-transcriber/skills/funasr-transcribe/scripts/transcribe_funasr.py \
meeting.flac --lang zh --num-speakers 9
# 4. English meeting, with real names
python3 plugins/funasr-transcriber/skills/funasr-transcribe/scripts/transcribe_funasr.py \
meeting.flac --lang en --speakers "Alice,Bob,Carol,Dave"
# 5. English podcast, 2 speakers
python3 plugins/funasr-transcriber/skills/funasr-transcribe/scripts/transcribe_funasr.py \
episode.flac --lang en --num-speakers 2 --speakers "Host,Guest" \
--title "Podcast Transcript"
# 6. Auto-detect language (zh/en/ja/ko/yue)
python3 plugins/funasr-transcriber/skills/funasr-transcribe/scripts/transcribe_funasr.py \
meeting.flac --lang auto --num-speakers 6
```
## Plugin Structure
```
.
├── .claude-plugin/
│ └── marketplace.json # skills.sh marketplace registration
├── .claude/
│ └── skills/
│ └── funasr-transcribe -> ../../plugins/funasr-transcriber/skills/funasr-transcribe
├── plugins/
│ └── funasr-transcriber/
│ └── skills/
│ └── funasr-transcribe/
│ ├── SKILL.md # Skill entry point
│ ├── references/
│ │ └── pipeline-details.md
│ └── scripts/
│ ├── transcribe_funasr.py # Main pipeline
│ ├── patch_clustering.py # Long-audio perf fix
│ └── setup_env.sh # One-click env setup
├── CLAUDE.md
├── README.md
└── .gitignore
```
## Pipeline
```
Audio (.m4a/.mp3) ─► ffmpeg ─► 16kHz WAV
│
Phase 1: FunASR ASR │
├─ FSMN-VAD (voice detection) │
├─ ASR model (lang-dependent) ├─► raw_transcript.json
├─ Hotword biasing (zh only) │
├─ Punctuation restoration │
└─ CAM++ (speaker clustering) │
│
Phase 2: Post-process │
├─ Merge consecutive utterances├─► merged segments
└─ Map speaker IDs to names │
│
Phase 3: LLM cleanup (optional) │
└─ Bedrock Claude └─► transcript.md
```
## Performance
Benchmarked on a 4h14m, 9-speaker Chinese meeting recording:
| Phase | GPU (L40S) | CPU |
|-------|-----------|-----|
| Model load | 14s | ~30s |
| Transcription | 169s | ~30-60 min |
| Clustering (patched) | ~10s | ~2-5 min |
| LLM cleanup (17 chunks) | ~35 min | ~35 min |
**Without the clustering patch**, speaker clustering on long audio takes 10+ hours.
The patch replaces O(N^3) `scipy.linalg.eigh` with O(N^2·k) `scipy.sparse.linalg.eigsh`.
## License
MIT