{"id":49003970,"url":"https://github.com/michellepace/youtube-to-xml","last_synced_at":"2026-04-18T19:34:12.944Z","repository":{"id":308710066,"uuid":"1033549399","full_name":"michellepace/youtube-to-xml","owner":"michellepace","description":"Convert YouTube transcripts to XML with chapter elements for improved AI comprehension.","archived":false,"fork":false,"pushed_at":"2026-03-29T01:44:20.000Z","size":1960,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-29T01:46:56.461Z","etag":null,"topics":["ai-workflow","python","transcripts","uv","youtube"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michellepace.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-07T01:58:14.000Z","updated_at":"2026-03-29T01:44:23.000Z","dependencies_parsed_at":"2025-08-07T13:15:10.600Z","dependency_job_id":"79ed3107-9751-4bd9-a758-5d4e5ceb6e40","html_url":"https://github.com/michellepace/youtube-to-xml","commit_stats":null,"previous_names":["michellepace/youtube-to-xml"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/michellepace/youtube-to-xml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michellepace%2Fyoutube-to-xml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michellepace%2Fyoutube-to-xml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michellepace%2Fyoutube-to-xml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michellepace%2Fyoutube-to-xml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michellepace","download_url":"https://codeload.github.com/michellepace/youtube-to-xml/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michellepace%2Fyoutube-to-xml/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31982743,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T17:30:12.329Z","status":"ssl_error","status_checked_at":"2026-04-18T17:29:59.069Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-workflow","python","transcripts","uv","youtube"],"created_at":"2026-04-18T19:34:11.392Z","updated_at":"2026-04-18T19:34:12.928Z","avatar_url":"https://github.com/michellepace.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🎥 YouTube-to-XML\n\nConvert YouTube transcripts to structured XML format with automatic chapter detection.\n\n**Problem**: Raw YouTube transcripts are unstructured text that LLMs struggle to parse, degrading AI chat responses about video content.\n\n**Solution**: Converts transcripts to XML with chapter elements for improved AI comprehension.\n\n![Description](docs/images/cover-skinny.jpg)\n\n## 📦 Install\n\n(1) First, install UV Python Package and Project Manager [from here](https://docs.astral.sh/uv/getting-started/installation/).\n\n(2) Then, install `youtube-to-xml` accessible from anywhere in your terminal:\n\n```bash\nuv tool install git+https://github.com/michellepace/youtube-to-xml.git\n```\n\n## 🚀 Usage\n\nThe `youtube-to-xml` command intelligently auto-detects whether you're providing a YouTube URL or a transcript file.\n\n### Option 1: URL Method (Easiest)\n\nConvert directly from YouTube URL:\n\n```bash\nyoutube-to-xml https://youtu.be/Q4gsvJvRjCU\n\n🎬 Processing: https://www.youtube.com/watch?v=Q4gsvJvRjCU\n✅ Created: how-claude-code-hooks-save-me-hours-daily.xml\n```\n\nOutput XML (condensed - 4 chapters, 88 lines total):\n\n```xml\n\u003c?xml version='1.0' encoding='utf-8'?\u003e\n\u003ctranscript video_title=\"How Claude Code Hooks Save Me HOURS Daily\"\n            video_published=\"2025-07-12\"\n            video_duration=\"2m 43s\"\n            video_url=\"https://www.youtube.com/watch?v=Q4gsvJvRjCU\"\u003e\n  \u003cchapters\u003e\n    \u003cchapter title=\"Intro\" start_time=\"0:00\"\u003e\n      0:00 Hooks are hands down one of the best\n      0:02 features in Claude Code and for some\n      \u003c!-- ... more transcript content ... --\u003e\n    \u003c/chapter\u003e\n    \u003cchapter title=\"Hooks\" start_time=\"0:19\"\u003e\n      0:20 To create your first hook, use the hooks\n      \u003c!-- ... more transcript content ... --\u003e\n    \u003c/chapter\u003e\n    \u003c!-- ... 2 more chapters ... --\u003e\n  \u003c/chapters\u003e\n\u003c/transcript\u003e\n```\n\n\u003e 📁 **[View Output XML →](example_transcripts/how-claude-code-hooks-save-me-hours-daily.xml)**\n\n### Option 2: File Method\n\nManually copy YouTube transcript into a text file, then:\n\n```bash\nyoutube-to-xml my_transcript.txt\n# ✅ Created: my_transcript.xml\n```\n\nCopy-Paste Exact YT Format for `my_transcript.txt`:\n\n```text\nIntroduction to Cows\n0:02\nWelcome to this talk about erm.. er\n2:30\nLet's start with the fundamentals\nWashing the cow\n15:45\nFirst, we'll start with the patches\n```\n\nOutput XML:\n\n```xml\n\u003c?xml version='1.0' encoding='utf-8'?\u003e\n\u003ctranscript video_title=\"\" video_published=\"\" video_duration=\"\" video_url=\"\"\u003e\n  \u003cchapters\u003e\n    \u003cchapter title=\"Introduction to Cows\" start_time=\"0:02\"\u003e\n      0:02 Welcome to this talk about erm.. er\n      2:30 Let's start with the fundamentals\n    \u003c/chapter\u003e\n    \u003cchapter title=\"Washing the cow\" start_time=\"15:45\"\u003e\n      15:45 First, we'll start with the patches\n    \u003c/chapter\u003e\n  \u003c/chapters\u003e\n\u003c/transcript\u003e\n```\n\n## 🤖 Demo: Claude Code Analysis\n\nSee **[demo-analysing-transcripts-with-claude-code.md](docs/demo-analysing-transcripts-with-claude-code.md)** for a real conversation where Claude Code analyses a 2-hour video transcript.\n\nInteresting findings:\n\n- **63,231 tokens** — too large for Claude Code to read at once, but it adapted by using grep and reading specific line ranges\n- **XML chapters** — made it trivial to target specific sections (e.g., \"analyse chapter 10 and 11\")\n- **Follow-up questions** — improved answer completeness. In an App this could be handled by prompt engineering.\n\nClaude Code can only read 25,000 tokens at a time. But the Anthropic API has a 200,000 token window. So, we still don't have to use RAG (later).\n\n## 📊 Technical Details\n\n- **Architecture**: Pure functions with clear module separation\n- **Key Modules**: See [CLAUDE.md](.claude/CLAUDE.md)\n- **Dependencies**: Python 3.14+, `yt-dlp` for YouTube downloads, see [pyproject.toml](pyproject.toml)\n- **Python Package Management**: [UV](https://docs.astral.sh/uv/concepts/projects/)\n- **Test-Driven Development**: 125 tests (19 slow, 106 unit)\n- **Terminology**: Uses TRANSCRIPT terminology throughout codebase, see [docs/terminology.md](docs/terminology.md)\n\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"docs/terminology.md\"\u003e\n    \u003cimg src=\"docs/images/terminology-youtube.jpg\" alt=\"YouTube video interface showing the Transcript panel with timestamp and text displayed on single lines (e.g., '0:02 features in Claude Code and for some'). Orange annotations highlight chapter titles and transcript lines structure.\" width=\"750\"\u003e\n  \u003c/a\u003e\n  \u003cp\u003e\u003cem\u003eYouTube transcript terminology throughout codebase: (click to read)\u003c/em\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n## 🛠️ Development\n\n🤖 *Repo 100% generated by Claude Code — every single line.*\n\nSetup:\n\n```bash\ngit clone https://github.com/michellepace/youtube-to-xml.git\ncd youtube-to-xml\nuv sync\nuv run pre-commit install\nuv run pre-commit install --hook-type pre-push\n```\n\nCode Quality:\n\n```bash\nuv run ruff check --fix           # Lint and auto-fix (see pyproject.toml)\nuv run ruff format                # Format code (see pyproject.toml)\n```\n\nTesting:\n\n```bash\nuv run pytest                     # All tests\nuv run pytest -m \"slow\"           # Only slow tests (internet required)\nuv run pytest -m \"not slow\"       # All tests except slow tests\nuv run pre-commit run --all-files # (see .pre-commit-config.yaml)\n```\n\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"docs/images/repo_evolution_commit.webp\"\u003e\n    \u003cimg src=\"docs/images/repo_evolution_commit.webp\" alt=\"Stacked area chart showing repository growth from 500 to 3700 lines across 250 commits, with test code (blue) comprising 60% of codebase, source code (red) 30%, and comments (green) 10%\" width=\"750\"\u003e\n  \u003c/a\u003e\n  \u003cp\u003e\u003cem\u003eCounted by my \u003ca href=\"https://github.com/michellepace/plot-py-repo\"\u003eplot-py-repo\u003c/a\u003e tool\u003c/em\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n## 🏗️ Architecture\n\n```text\n                    youtube-to-xml CLI\n                           │\n                    ┌──────┴───────┐\n                    │  cli.py      │\n                    │ (auto-detect)│\n                    └──────┬───────┘\n                           │\n              ┌────────────┴────────────┐\n              │                         │\n        [URL Input]              [File Input]\n              │                         │\n      ┌───────▼────────┐       ┌────────▼────────┐\n      │ url_parser.py  │       │ file_parser.py  │\n      │                │       │                 │\n      │ • yt-dlp API   │       │ • Pattern match │\n      │ • JSON3 parse  │       │ • Chapter rules │\n      │ • Metadata     │       │ • Empty metadata│\n      └───────┬────────┘       └────────┬────────┘\n              │                         │\n              └────────────┬────────────┘\n                           │\n                    ┌──────▼──────┐\n                    │  models.py  │\n                    │             │\n                    │TranscriptDoc│\n                    │  Chapters   │\n                    │  Metadata   │\n                    └──────┬──────┘\n                           │\n                    ┌──────▼──────────┐\n                    │ xml_builder.py  │\n                    │                 │\n                    │ • Format times  │\n                    │ • Build XML tree│\n                    └──────┬──────────┘\n                           │\n                      ┌────▼────┐\n                      │ XML File│\n                      └─────────┘\n```\n\n---\n\n## Appendix 1: Decision - Inline Transcript Timestamps\n\nEach transcript line now places the timestamp and text on the **same line**, rather than on separate lines:\n\n**Before (separate lines)**\n\n```text\n0:02\nWelcome to this talk about cows\n2:30\nLet's start with the fundamentals\n15:45\nFirst, we'll start with the patches\n```\n\n**After (same line)**\n\n```text\n0:02 Welcome to this talk about cows\n2:30 Let's start with the fundamentals\n15:45 First, we'll start with the patches\n```\n\n**Why?** The primary consumer of these transcripts is an LLM agent (e.g. Claude Code) that navigates large files by searching for keywords and reading line ranges. With inline timestamps, every search hit is a self-contained record — the agent immediately knows *what* was said and *when*, in a single operation. No follow-up read to find the timestamp on the line above.\n\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"docs/images/inline-timestamps.svg\"\u003e\n    \u003cimg src=\"docs/images/inline-timestamps.svg\" alt=\"Diagram comparing search workflows: Before requires two steps to find the timestamp, After returns timestamp and text together in one step\" width=\"680\"\u003e\n  \u003c/a\u003e\n  \u003cp\u003e\u003cem\u003eSearching a transcript with thousands of lines: separate lines require a second lookup for the timestamp, same-line format returns a complete record\u003c/em\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n---\n\n## 📕 Appendix 2: *Personal Notes*\n\nEvals To Do (transcript.txt vs transcript.xml):\n\n- [ ] Build Shiny for Python app to use Hamel's [simple error analysis approach](https://hamel.dev/blog/posts/field-guide/index.html)\n- [ ] But I don't like Hamel's binary approach, what about a Six Sigma ordinal data approach like in [docs/idea-evals.md](docs/idea-evals.md)?\n- [ ] Automate the evals with pytest as far as possible, LLM as a Judge for others\n- [ ] If XML is the winner, try tweak the XML structure to improve, for example [this](docs/knowledge/working-notes.md#better-format). Like whitespace, more tags, or maybe JSON?\n- [ ] But now I've got a problem with cost because xml is in the context window. So can RAG perform equally as well and fast?\n- [ ] Can use a cheaper model that performs equally as well, like Haiku over Sonnet (for some things)?\n- [ ] At some point I'm going to have to head over to [BrainTrust.dev](https://www.braintrust.dev/) - use an agnostic SDK?\n\nLearnings To Carry Over:\n\n- [Use CodeRabbit for PR review](https://www.anthropic.com/customers/coderabbit) to improve code\n- [Use Claude Code Docs](https://github.com/ericbuess/claude-code-docs) so Claude Code knows what it can do\n- [Use Claude Code Project Index](https://github.com/ericbuess/claude-code-project-index) so Claude Code sees entire project easily\n- [Manage MCPs nicely](docs/knowledge/manage-mcps-nicely.md) constrain what you use, put API keys in one place\n- [Git branch workflow](docs/knowledge/git-branch-flow.md) try put everything on a purposeful branch\n- Always use strict linting and typing and enforce in [pre-commit hook](.pre-commit-config.yaml)\n- Always do test-driven development and [manual LLM testing](docs/refactor-todo/exceptions/test_url.md) is useful too\n- Manage LLM Context: set [terminology](docs/terminology.md), use clear naming, keep docstrings/comments accurate, at 60% context window `/clear` Claude Code\n\nOpen Questions:\n\n- Q1. Is there something I could have done better with UV?\n- Q2. Is the system architecture well-designed and elegant?\n- Q3. Is the exception design suitable for a future API service?\n- Q4. Are [tests/](tests/) clear and sane, or over-engineered?\n- Q5. Was it safe to exclude \"XML security\" Ruff [S314](pyproject.toml)?\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichellepace%2Fyoutube-to-xml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichellepace%2Fyoutube-to-xml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichellepace%2Fyoutube-to-xml/lists"}