{"id":50554132,"url":"https://github.com/arvarik/trebek","last_synced_at":"2026-06-04T05:30:37.909Z","repository":{"id":353707115,"uuid":"1055977991","full_name":"arvarik/trebek","owner":"arvarik","description":"A highly resilient, fault-tolerant data extraction pipeline daemon for transcribing and extracting structured game events from Jeopardy! episodes.","archived":false,"fork":false,"pushed_at":"2026-05-07T21:07:23.000Z","size":604,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-07T23:16:08.102Z","etag":null,"topics":["data-extraction","google-gemini","jeopardy","llm","machine-learning","pipeline","python","sqlite","whisperx"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/trebek/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arvarik.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-13T06:31:38.000Z","updated_at":"2026-05-07T21:07:27.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/arvarik/trebek","commit_stats":null,"previous_names":["arvarik/trebek"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/arvarik/trebek","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arvarik%2Ftrebek","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arvarik%2Ftrebek/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arvarik%2Ftrebek/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arvarik%2Ftrebek/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arvarik","download_url":"https://codeload.github.com/arvarik/trebek/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arvarik%2Ftrebek/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33891721,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-04T02:00:06.755Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-extraction","google-gemini","jeopardy","llm","machine-learning","pipeline","python","sqlite","whisperx"],"created_at":"2026-06-04T05:30:36.854Z","updated_at":"2026-06-04T05:30:37.904Z","avatar_url":"https://github.com/arvarik.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n```\n ╺┳╸┏━┓┏━╸┏┓ ┏━╸┃┏╸\n  ┃ ┣┳┛┣╸ ┣┻┓┣╸ ┣┻┓\n  ╹ ╹┗╸┗━╸┗━┛┗━╸╹ ╹\n```\n\n**The definitive multimodal AI pipeline for extracting structured game data from J! episodes.**\n\n\u003ca href=\"https://github.com/arvarik/trebek/actions/workflows/ci.yml\"\u003e\n  \u003cimg alt=\"CI\" src=\"https://github.com/arvarik/trebek/actions/workflows/ci.yml/badge.svg\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://pypi.org/project/trebek/\"\u003e\n  \u003cimg alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/trebek\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://pypi.org/project/trebek/\"\u003e\n  \u003cimg alt=\"PyPI - Downloads\" src=\"https://img.shields.io/pypi/dm/trebek\" /\u003e\n\u003c/a\u003e\n\u003cimg alt=\"Python 3.11+\" src=\"https://img.shields.io/badge/python-3.11%2B-blue?logo=python\u0026logoColor=white\" /\u003e\n\u003cimg alt=\"Ruff\" src=\"https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json\" /\u003e\n\u003cimg alt=\"Mypy\" src=\"https://img.shields.io/badge/types-Mypy-blue.svg\" /\u003e\n\u003cimg alt=\"License\" src=\"https://img.shields.io/badge/license-AGPL_3.0-green\" /\u003e\n\n\u003c/div\u003e\n\n---\n\n## What is Trebek?\n\nTrebek is an advanced, fault-tolerant pipeline that processes **raw J! video recordings** — not scraped web pages — and produces a **surgically clean, event-sourced relational dataset** of every game event that occurred on screen. It bridges local GPU compute (WhisperX, Pyannote), cloud LLMs (Google Gemini 3.1), and a deterministic Python state machine into a single, continuously running daemon.\n\nThe resulting dataset doesn't just capture questions and answers. It captures the full *cognitive fingerprint* of each game:\n\n- ⚡ **Millisecond-precision buzzer latencies** — cross-referenced from visual podium illumination and acoustic buzz detection\n- 🗣️ **Speech disfluency tracking** — ums, uhs, and stutters from WhisperX logprobs, not LLM hallucinations\n- 🎲 **Game-theory optimal wager analysis** — calculated wagers compared against actual contestant choices\n- 🔍 **Post-extraction verification** — Stage 5.5 cross-validates every clue against transcript context to correct ASR errors\n- 🧠 **Semantic lateral distance** — cosine distance on embeddings distinguishing wordplay from direct recall *(schema ready, see `docs/embeddings_feature.md`)*\n- 🏗️ **Board control \u0026 Forrest Bounce detection** — strategic selection pattern analysis\n- 📊 **Coryat scores** — calculated deterministically per contestant per episode\n\n---\n\n## Trebek vs. J-Archive\n\nExisting J! datasets are **static text scrapes** — frozen lists of clues and responses with no temporal, behavioral, or strategic context. Trebek extracts from the *raw video*, producing an entirely different class of dataset.\n\n| Dimension | J-Archive / Scraped Data | Trebek |\n|-----------|--------------------------|--------|\n| **Source** | Web scraping | Raw video processing |\n| **Buzzer timing** | ❌ Not available | ✅ True ms-precision latency |\n| **Speech patterns** | ❌ Not available | ✅ Disfluency counts, acoustic confidence |\n| **Wager analysis** | Partial (raw numbers only) | ✅ Game-theory optimal + irrationality delta |\n| **Board control** | ❌ Not available | ✅ Full selection order + Forrest Bounce index |\n| **Score adjustments** | Sometimes noted | ✅ Chronologically anchored to exact clue index |\n| **Visual clues** | Text description | ✅ Multimodal extraction from video frames |\n| **Semantic analysis** | ❌ Not available | ✅ Embedding schema ready (see `docs/embeddings_feature.md`) |\n| **Data format** | Flat HTML / CSV | ✅ Normalized relational DB (9 tables) |\n| **Freshness** | Depends on scraper maintenance | ✅ Process your own recordings on demand |\n| **Coryat scores** | Manual fan calculation | ✅ Deterministic, per-contestant |\n\n---\n\n## Who Is This For?\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd width=\"33%\" align=\"center\"\u003e\n\n### 🎯 Trivia Enthusiasts\n\nExplore your favorite episodes with deep analytics. Query buzzer speeds, track contestant strategies, and discover board control patterns across seasons.\n\n\u003c/td\u003e\n\u003ctd width=\"33%\" align=\"center\"\u003e\n\n### 📊 Data Scientists\n\nA richly normalized relational dataset designed for analytical queries. 9 tables, foreign keys, embeddings, and temporal data — ready for your notebooks.\n\n\u003c/td\u003e\n\u003ctd width=\"33%\" align=\"center\"\u003e\n\n### 🤖 ML Engineers\n\nTrain predictive models on human decision-making under televised pressure. Buzzer latency, wager irrationality, disfluency signals — features you can't get anywhere else.\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n---\n\n## ✨ Feature Highlights\n\n### 🔄 True Crash Immunity\nDatabase-backed queueing via SQLite `pipeline_state`. Kill the daemon at any point — `SIGINT`, `SIGTERM`, crash, power failure — and it resumes exactly where it left off. Zero data loss. Zero re-processing.\n\n### 🧠 Multi-Pass LLM Architecture\n- **Pass 1** (Flash-Lite): Speaker anchoring from host interview audio\n- **Pass 2** (Pro): Manifest-Verify-Fill structured extraction with category gap detection\n- **Stage 5.5** (Flash-Lite): Post-extraction verification — cross-validates every clue against transcript context, corrects ASR errors, normalizes response formatting, and tracks `is_verified` / `original_response` metadata\n- **Pass 3** (Pro): Multimodal visual clue reconstruction + podium illumination detection\n\n### ⚙️ Deterministic State Machine\nPure Python `TrebekStateMachine` replays game events chronologically. LLMs extract facts; the state machine does all arithmetic. Running scores, True Daily Double resolution, Coryat scores, and game-theory optimal wagers — all calculated deterministically.\n\n### 🎯 Deterministic Inference\nInstead of relying on LLM hallucinations for grid positions, Trebek parses exact dollar values (\"for $800\") and deterministically maps them to the correct board row based on the round format. Response formats are strictly normalized into J! question form.\n\n### 🔥 Warm Worker GPU Architecture\nPyTorch/WhisperX model weights stay resident in VRAM. No cold starts. Automatic OOM recovery with pool restarts. Explicit memory management for multi-day inference runs.\n\n### 🎯 Physics Engine\nCross-references visual podium illumination (Gemini Vision) with WhisperX acoustic boundaries to compute true contestant reaction speeds. Also calculates acoustic confidence scores, brain freeze durations, and semantic lateral distance.\n\n### 🗄️ Actor-Pattern Database\nAll SQLite writes serialized through a single `DatabaseWriter` actor (`asyncio.Queue` + `Future`). No `database is locked` exceptions. Atomic transactions for high-throughput batched commits.\n\n---\n\n## 🚀 Quick Start\n\nThe fastest way to get Trebek running is using the official Docker image via Hybrid Mode. The lightweight CLI runs on your host, while the heavy GPU workloads (PyTorch, WhisperX) are safely delegated to the `ghcr.io` container.\n\n```bash\n# 1. Install lightweight CLI\npip install trebek\n\n# 2. Configure (requires a free Gemini API key)\necho \"GEMINI_API_KEY=your_key_here\" \u003e .env\n\n# 3. Run with Docker GPU delegation\ntrebek run --input-dir /path/to/your/videos --docker\n```\n\n\u003e **📖 Full installation guide**: See **[SETUP.md](SETUP.md)** for `docker-compose` deployments, native installations (no Docker), HuggingFace token configuration, and detailed CLI usage.\n\n\u003e **🏗️ Architecture deep-dive**: See **[DESIGN.md](DESIGN.md)** for the complete system architecture, data model, pipeline stages, and safety invariants.\n\n---\n\n## 📊 Stats Dashboard\n\nRun `trebek stats` for a live analytics dashboard showing pipeline health, cost tracking, stage timing, and recent episode status:\n\n```\n┌─ Pipeline Health ─────────────────────────────────────────┐\n│  ✅ COMPLETED  42    ⏳ PENDING  3    ❌ FAILED  1       │\n│  ████████████████████████████████████░░░░  91.3%          │\n├─ Cost \u0026 Performance ──────────────────────────────────────┤\n│  Tokens: 12.4M in / 2.1M out    Cost: $4.82 USD          │\n│  Peak VRAM: 14.2 GB    Avg GPU: 87%                      │\n├─ Stage Timing (avg) ──────────────────────────────────────┤\n│  transcribe: 4m 12s    extract: 2m 38s    verify: 0.4s   │\n└───────────────────────────────────────────────────────────┘\n```\n\n---\n\n## 🧪 Development\n\n```bash\nmake all          # Full quality gate (test + lint + typecheck)\nmake test         # pytest with coverage\nmake lint         # ruff check\nmake typecheck    # mypy strict mode\n```\n\n| Tool | Purpose |\n|------|---------|\n| **pytest** | Test runner (`pytest-asyncio` for async) |\n| **ruff** | Linter + formatter (line-length 120) |\n| **mypy** | Static type checker (strict mode) |\n| **pre-commit** | Git hook enforcement |\n\n---\n\n## 📁 Project Structure\n\n```\ntrebek/\n├── trebek/\n│   ├── cli.py              # CLI parser + Docker orchestration\n│   ├── config.py           # Pydantic Settings + model constants + pricing\n│   ├── schemas.py          # Pydantic v2 data contracts (Episode, Clue, etc.)\n│   ├── schema.sql          # SQLite DDL (9 tables + schema_version)\n│   ├── state_machine.py    # Deterministic game state replay\n│   ├── status.py           # Pipeline status enum (StrEnum)\n│   ├── database/           # Actor-pattern writer + relational commit ops\n│   ├── gpu/                # Warm Worker pool + VRAM management\n│   ├── llm/                # Multi-pass Gemini extraction (anchoring, extraction, verify, multimodal)\n│   ├── pipeline/           # Async orchestrator + stage workers (ingestion, gpu, llm, state_machine)\n│   ├── analysis/           # Post-extraction analytics (buzzer physics, embeddings math)\n│   └── ui/                 # Rich console dashboard + rendering\n├── tests/                  # Comprehensive test suite (512+ tests)\n├── scripts/                # Local testing + validation utilities\n├── docs/                   # Design docs, embedding feature plan, archived architecture\n├── Dockerfile              # GPU-enabled container (CUDA + WhisperX + Pyannote)\n├── docker-compose.yml      # One-command deployment\n├── Makefile                # Developer shortcuts (test, lint, typecheck)\n└── pyproject.toml          # Build system + tool config\n```\n\n---\n\n## 📄 License\n\n[AGPL-3.0](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farvarik%2Ftrebek","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farvarik%2Ftrebek","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farvarik%2Ftrebek/lists"}