{"id":50216104,"url":"https://github.com/jsleekr/youtube-subtitle-extractor","last_synced_at":"2026-05-26T09:03:43.807Z","repository":{"id":351238659,"uuid":"1209389379","full_name":"JSLEEKR/youtube-subtitle-extractor","owner":"JSLEEKR","description":"YouTube 영상·채널을 한국어 지식 번들(영상·전사·번역·리서치 문서·찬반 토론)로 변환하는 파이프라인. Claude Code 스킬이 yt-dlp + faster-whisper를 오케스트레이션.","archived":false,"fork":false,"pushed_at":"2026-04-14T06:05:39.000Z","size":30,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-14T08:12:19.502Z","etag":null,"topics":["agent","claude-code","faster-whisper","knowledge-base","korean","llm","pipeline","transcription","translation","whisper","youtube","yt-dlp"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JSLEEKR.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-13T11:32:22.000Z","updated_at":"2026-04-14T06:05:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/JSLEEKR/youtube-subtitle-extractor","commit_stats":null,"previous_names":["jsleekr/youtube-subtitle-extractor"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/JSLEEKR/youtube-subtitle-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSLEEKR%2Fyoutube-subtitle-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSLEEKR%2Fyoutube-subtitle-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSLEEKR%2Fyoutube-subtitle-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSLEEKR%2Fyoutube-subtitle-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JSLEEKR","download_url":"https://codeload.github.com/JSLEEKR/youtube-subtitle-extractor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSLEEKR%2Fyoutube-subtitle-extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33512335,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T03:12:49.672Z","status":"ssl_error","status_checked_at":"2026-05-26T03:12:47.976Z","response_time":63,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","claude-code","faster-whisper","knowledge-base","korean","llm","pipeline","transcription","translation","whisper","youtube","yt-dlp"],"created_at":"2026-05-26T09:03:38.245Z","updated_at":"2026-05-26T09:03:43.799Z","avatar_url":"https://github.com/JSLEEKR.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# youtube-subtitle-extractor\n\n\u003e A pipeline that turns YouTube videos and channels into **Korean knowledge bundles**.\n\u003e From the raw video to an English transcript, a natural Korean translation, a research-backed article, and a 3-round pro/con debate — all generated in one pass.\n\nBuilt from **Claude Code skills plus four small Python scripts**. Deterministic work (video/subtitle download, Whisper transcription) runs in the scripts; language work (translation, research, debate) is performed directly by Claude Code skills inside the session.\n\n---\n\n## ✨ What you get\n\nFor each video, the following six files land in `output/\u003cchannel_handle\u003e/\u003cupload_date\u003e_\u003cvideo_id\u003e/`:\n\n| File | Contents |\n|---|---|\n| `video.mp4` | Original YouTube video (best quality via yt-dlp) |\n| `transcript_en.txt` | Official English subtitles, or a Whisper transcript as fallback |\n| `transcript_ko.md` | Natural Korean translation (idiomatic, not literal; proper nouns kept alongside) |\n| `document.md` | Blog-article-style research document (sources verified via WebSearch) |\n| `debate.md` | 3-round pro/con debate plus synthesis (each round explicitly rebuts the previous one) |\n| `meta.json` | Metadata (title, upload date, duration, subtitle source, etc.) |\n\nIn channel mode, the bundle above is generated for every video in the time window, and the channel `README.md` is auto-updated as a dashboard.\n\n---\n\n## 🏗️ Architecture\n\n```\n┌──────────────────────────────────────────┐\n│  Claude Code Skills (.claude/skills/)    │\n│  ┌────────────────┐  ┌────────────────┐  │\n│  │ extract-video  │  │extract-channel │  │\n│  └───────┬────────┘  └───────┬────────┘  │\n└──────────┼───────────────────┼───────────┘\n           │                   │\n           ▼                   ▼\n┌──────────────────────────────────────────┐\n│  Python scripts (scripts/)               │\n│  ┌──────────────┐  ┌──────────────────┐  │\n│  │ list_videos  │  │ fetch_video      │  │\n│  │ fetch_subs   │  │ transcribe       │  │\n│  └──────────────┘  └──────────────────┘  │\n└──────────────────────────────────────────┘\n```\n\n- **Script layer**: deterministic and idempotent. Each script prints a single JSON line to stdout and reports failure via a non-zero exit code. If the output file already exists, it's skipped.\n- **Skill layer**: language tasks (translation, research, debate) are performed inside the Claude Code session. Sources are verified and cited via `WebSearch`. The skills themselves are orchestration rather than implementation, so logic changes are easy.\n\n---\n\n## 🚀 Quick start\n\n### Setup\n\n```bash\n# Install dependencies\npython -m venv .venv\n# Windows:\n.venv\\Scripts\\activate\n# macOS/Linux:\nsource .venv/bin/activate\npip install -r requirements-dev.txt\n```\n\nSystem requirements:\n\n- **Python 3.10+**\n- **ffmpeg** — must be on PATH (required by yt-dlp for audio/video extraction)\n- **yt-dlp** — installed via `pip install`\n- **faster-whisper** — uses large-v3 on GPU (CUDA) when available, falls back to CPU int8 otherwise\n\n### Usage (inside a Claude Code session)\n\nProcess a single video:\n```\n/extract-video https://www.youtube.com/watch?v=\u003cid\u003e\n```\n\nChannel mode (all videos from the last 30 days):\n```\n/extract-channel https://www.youtube.com/@\u003chandle\u003e --days 30\n```\n\nOptions:\n- `--days N` — only videos from the last N days (default 30)\n- `--limit N` — cap at N videos (for testing)\n- `--skip-debate` — skip debate generation\n\nResults accumulate under `output/\u003cchannel_handle\u003e/`.\n\n---\n\n## 🔁 Pipeline stages (per video)\n\nThe nine stages orchestrated by `.claude/skills/extract-video/SKILL.md`:\n\n1. **Metadata resolution** — parse video_id, title, upload_date, channel_handle via `yt-dlp`\n2. **Write `meta.json`** — skipped if it already exists (idempotency)\n3. **Video download** — `fetch_video.py` → `video.mp4` (pipeline continues on failure)\n4. **Secure English transcript** — try official subtitles via `fetch_subs.py`; on `exit 2`, fall back to `transcribe.py` (Whisper)\n5. **Korean translation** — `transcript_ko.md`, idiomatic, proper nouns / technical terms kept alongside\n6. **Research document** — `document.md`, with sources verified and cited via WebSearch\n7. **Pro/con debate** — `debate.md`, 3 rounds plus synthesis; each round explicitly rebuts the previous one\n8. **Channel README refresh** — regenerate the dashboard\n9. **Final report** — print generated/skipped files and output path\n\nFailure handling:\n- Step 3 (video download) failure is **non-fatal** — log the error and continue\n- Step 4 (transcript) failure is fatal — stop and report\n- Steps 5–7 failures are recorded in `meta.json`; partial results are preserved\n- Every stage is **idempotent** — rerunning picks up only the missing pieces\n\n---\n\n## 🧪 Tests\n\n```bash\npytest -v\n```\n\nUnit tests cover only the pure functions in `scripts/_common.py` (date-window filter, VTT parser, directory-name formatter). Script bodies are validated by integration smoke tests.\n\n---\n\n## 📁 Project layout\n\n```\nyoutube-subtitle-extractor/\n├── scripts/\n│   ├── _common.py         # Pure helpers (date filter, VTT parser, dir names)\n│   ├── list_videos.py     # Channel → filtered list of videos as JSON\n│   ├── fetch_video.py     # Download the source video → video.mp4\n│   ├── fetch_subs.py      # Official English subtitles → transcript_en.txt\n│   └── transcribe.py      # Audio download + Whisper → transcript_en.txt\n├── tests/\n│   ├── test_common.py     # Unit tests for the pure helpers\n│   └── fixtures/sample.vtt\n├── .claude/skills/\n│   ├── extract-video/SKILL.md\n│   └── extract-channel/SKILL.md\n├── docs/superpowers/      # Plan / spec documents\n├── requirements.txt       # Runtime: yt-dlp, faster-whisper\n├── requirements-dev.txt   # Dev: pytest\n└── README.md\n```\n\n`output/` is gitignored and is not committed to the repository.\n\n---\n\n## 🧠 Design principles\n\n1. **Each script does exactly one thing.** On success, emit a single JSON line on stdout; on failure, stderr plus a non-zero exit code.\n2. **Side effects live only inside `main()`.** Branching logic is moved into pure functions in `_common.py` so it stays testable.\n3. **Skills orchestrate; scripts are deterministic.** Scripts never call Claude.\n4. **Every stage is idempotent.** If the output file exists, skip it. Reruns resume where the previous run left off.\n5. **Failure never corrupts partial progress.** A video-download failure doesn't block the transcript; a failure on one video doesn't block the next.\n\n---\n\n## 📝 License\n\nMIT\n\n---\n\n## 🙏 Credits\n\n- [yt-dlp](https://github.com/yt-dlp/yt-dlp) — YouTube downloading\n- [faster-whisper](https://github.com/SYSTRAN/faster-whisper) — Whisper inference on CTranslate2\n- [Claude Code](https://claude.com/claude-code) — orchestration runtime\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjsleekr%2Fyoutube-subtitle-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjsleekr%2Fyoutube-subtitle-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjsleekr%2Fyoutube-subtitle-extractor/lists"}