{"id":51027220,"url":"https://github.com/deepgram/llm-smart-formatting-benchmark","last_synced_at":"2026-06-21T20:30:53.263Z","repository":{"id":358097979,"uuid":"1236831003","full_name":"deepgram/llm-smart-formatting-benchmark","owner":"deepgram","description":null,"archived":false,"fork":false,"pushed_at":"2026-05-15T17:49:57.000Z","size":24110,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-15T19:59:14.105Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deepgram.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-12T16:02:01.000Z","updated_at":"2026-05-15T17:50:01.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/deepgram/llm-smart-formatting-benchmark","commit_stats":null,"previous_names":["deepgram/llm-smart-formatting-benchmark"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/deepgram/llm-smart-formatting-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepgram%2Fllm-smart-formatting-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepgram%2Fllm-smart-formatting-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepgram%2Fllm-smart-formatting-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepgram%2Fllm-smart-formatting-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deepgram","download_url":"https://codeload.github.com/deepgram/llm-smart-formatting-benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepgram%2Fllm-smart-formatting-benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34625624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-21T02:00:05.568Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-21T20:30:51.287Z","updated_at":"2026-06-21T20:30:53.254Z","avatar_url":"https://github.com/deepgram.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# smart-formatting-llm-benchmark\n\nBenchmarks for **smart formatting** — turning raw spoken-form transcripts\n(`\"my number is two four eight...\"`) into clean written form\n(`\"My number is (248) 123-4567.\"`).\n\nThree evals, each with its own CLI:\n\n| Eval | Where | Measures |\n| --- | --- | --- |\n| LLM post-processing | `runner/` + `evaluator/` | Frontier LLMs as a formatting step after STT (text in, text out). |\n| Fine-tuned small models | `finetune/` | LoRA fine-tunes on Together.ai (Gemma, Llama, Qwen, Mercury). |\n| Competitor STT | `competitor-formatting/` | Deepgram vs ElevenLabs / OpenAI / Azure / Google / Soniox, on audio. |\n\nAlso `iterate/` for fast prompt tuning against one or two cheap models.\nTask contract: [`GUIDELINE.md`](GUIDELINE.md). Prompt guidance:\n[`PROMPT_GUIDE.md`](PROMPT_GUIDE.md). Latest fine-tune writeup:\n[`finetune/together_ai_v3.md`](finetune/together_ai_v3.md).\n\n## Install\n\nPython 3.11+. `uv sync` (or `pip install -e .`). Set only the API keys\nyou need per eval below.\n\n---\n\n## 1. LLM post-processing (`runner` + `evaluator`)\n\n```bash\nexport OPENROUTER_API_KEY=...   # runner\nexport ANTHROPIC_API_KEY=...    # evaluator's LLM judge\n```\n\n```bash\nuv run runner list-models                                   # registered models + slugs\nuv run runner show-prompt                                   # base prompt + hash\nuv run runner run --dry-run --limit 3 --models claude-opus-4-7   # no API calls\nuv run runner run --models all --concurrency 16 --parallel-models 2\nuv run runner resume \u003crun_id\u003e                               # re-run, skip done rows\nuv run evaluator score --responses results/\u003crun_id\u003e         # writes scored.csv, summary.csv, report.md\n```\n\nOutput lands in `results/\u003crun_id\u003e/`: `responses.csv`,\n`run_manifest.json`, plus `scored.csv` / `summary.csv` /\n`canonical.csv` / `report.md` after scoring. Resume is keyed on\n`(model_id, sample_id)`; delete failed rows from `responses.csv` to\nretry.\n\nFour scorers per row: exact match, entity-class regex, Claude Opus 4.7\njudge for accuracy (`pass | style_violation | numeric_drift |\nwrong_value | other`, plus `catastrophic` + `promptability`), and the\nsame judge for hallucination (`none | minor_addition | dropped_content\n| fabricated`).\n\n### Baseline + chained\n\nThe existing Deepgram pipeline (Impeller `/v2/read` → entity-tag\nadapter → Stem `/dev/format-entities`) as a baseline. Start Impeller\nand Stem locally with Cargo (or via `docker-compose.baseline.yml`),\nthen:\n\n```bash\nuv run runner baseline --limit 50 --run-id impeller-stem-smoke\nuv run runner chained --models qwen3-32b-groq --prompt prompts/system_prompt.md   # baseline → LLM cleanup\nuv run runner determinism --models qwen3-32b-groq --prompt iterate/results/iter-009-XXXX/prompt.txt --trials 100\n```\n\nBoth `baseline` and `chained` write a normal `responses.csv`, so\n`evaluator score` works the same on them. The baseline ignores\n`formatting_prompt` (no prompt channel in the existing pipeline).\n\n### Editing models / prompts\n\n- `runner/models.py` is the source of truth. `model_id` is the unique key written everywhere — make a new entry rather than reusing one.\n- `runner/prompts.py::BASE_PROMPT` is the always-on system prompt; its 12-char SHA-256 is recorded per row, so old/new runs stay distinguishable.\n- Reasoning is **off by default** — latency \u003e peak quality for this task.\n- A few OpenRouter slugs are speculative; verify against `openrouter.ai/api/v1/models` before a real spend.\n\n---\n\n## 2. Fine-tuning (`finetune`)\n\n```bash\nexport TOGETHER_API_KEY=...\nexport ANTHROPIC_API_KEY=...\n```\n\nOne-shot (`split → upload → train → wait → infer → score`):\n\n```bash\nuv run finetune all --base-model meta-llama/Llama-3.2-3B-Instruct\n```\n\nOr step by step: `split`, `upload`, `train`, `status`, `deploy`,\n`infer`, `score`, `stop` (or `deploy-eval-stop` for the last three).\nPer-run artifacts (job ids, endpoint info) land under\n`finetune/runs/\u003crun_name\u003e/`. Existing runs: Gemma 3 (270m, 1b),\nLlama 3.2 3B, Qwen3 8B, Qwen3.5 9B, Mercury 2. Data augmentation\npasses live in `finetune/augment*.py`.\n\n---\n\n## 3. Competitor STT (`competitor-formatting`)\n\n160 audio clips (10 per entity class), synthesized with Deepgram\nAura-2 TTS, transcribed by each provider, judged by Claude Opus 4.7.\nSee [`competitor-formatting/README.md`](competitor-formatting/README.md).\n\n```bash\nexport DEEPGRAM_API_KEY=...                      # TTS + Deepgram STT\nexport ELEVENLABS_API_KEY / OPENAI_API_KEY / \\\n       AZURE_SPEECH_KEY / GOOGLE_API_KEY / \\\n       SONIOX_API_KEY=...                        # optional, per provider\nexport ANTHROPIC_API_KEY=...                     # judge\n\nuv run python competitor-formatting/synthesize.py\nuv run python competitor-formatting/transcribe.py --providers all\nfor p in deepgram elevenlabs openai azure google soniox; do\n  uv run python competitor-formatting/score.py --provider $p\ndone\n```\n\nAll three steps are resumable.\n\n---\n\n## Prompt iteration (`iterate`)\n\nTight loop while editing `prompts/system_prompt.md`. Runs a fixed\nstratified subset against cheap models, appends to a\nper-prompt-hash leaderboard.\n\n```bash\nuv run iterate run --prompt prompts/system_prompt.md\nuv run iterate show --top 10\nuv run iterate failures iterate/results/iter-009-XXXX --model qwen3-32b-groq\nuv run iterate matrix --prompts prompts/variants/A.md,prompts/variants/B.md \\\n                     --models qwen3-32b-groq,gpt-oss-120b-groq\n```\n\n`iterate/` deliberately bypasses `runner/prompts.py` and assembles its\nown `\u003ctranscript\u003e...\u003c/transcript\u003e`-spotlit messages — keep that in\nmind if you change message construction in either place.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgram%2Fllm-smart-formatting-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepgram%2Fllm-smart-formatting-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgram%2Fllm-smart-formatting-benchmark/lists"}