An open API service indexing awesome lists of open source software.

https://github.com/michellepace/youtube-to-xml

Convert YouTube transcripts to XML with chapter elements for improved AI comprehension.
https://github.com/michellepace/youtube-to-xml

ai-workflow python transcripts uv youtube

Last synced: about 1 month ago
JSON representation

Convert YouTube transcripts to XML with chapter elements for improved AI comprehension.

Awesome Lists containing this project

README

          

# πŸŽ₯ YouTube-to-XML

Convert YouTube transcripts to structured XML format with automatic chapter detection.

**Problem**: Raw YouTube transcripts are unstructured text that LLMs struggle to parse, degrading AI chat responses about video content.

**Solution**: Converts transcripts to XML with chapter elements for improved AI comprehension.

![Description](docs/images/cover-skinny.jpg)

## πŸ“¦ Install

(1) First, install UV Python Package and Project Manager [from here](https://docs.astral.sh/uv/getting-started/installation/).

(2) Then, install `youtube-to-xml` accessible from anywhere in your terminal:

```bash
uv tool install git+https://github.com/michellepace/youtube-to-xml.git
```

## πŸš€ Usage

The `youtube-to-xml` command intelligently auto-detects whether you're providing a YouTube URL or a transcript file.

### Option 1: URL Method (Easiest)

Convert directly from YouTube URL:

```bash
youtube-to-xml https://youtu.be/Q4gsvJvRjCU

🎬 Processing: https://www.youtube.com/watch?v=Q4gsvJvRjCU
βœ… Created: how-claude-code-hooks-save-me-hours-daily.xml
```

Output XML (condensed - 4 chapters, 88 lines total):

```xml



0:00 Hooks are hands down one of the best
0:02 features in Claude Code and for some



0:20 To create your first hook, use the hooks



```

> πŸ“ **[View Output XML β†’](example_transcripts/how-claude-code-hooks-save-me-hours-daily.xml)**

### Option 2: File Method

Manually copy YouTube transcript into a text file, then:

```bash
youtube-to-xml my_transcript.txt
# βœ… Created: my_transcript.xml
```

Copy-Paste Exact YT Format for `my_transcript.txt`:

```text
Introduction to Cows
0:02
Welcome to this talk about erm.. er
2:30
Let's start with the fundamentals
Washing the cow
15:45
First, we'll start with the patches
```

Output XML:

```xml



0:02 Welcome to this talk about erm.. er
2:30 Let's start with the fundamentals


15:45 First, we'll start with the patches

```

## πŸ€– Demo: Claude Code Analysis

See **[demo-analysing-transcripts-with-claude-code.md](docs/demo-analysing-transcripts-with-claude-code.md)** for a real conversation where Claude Code analyses a 2-hour video transcript.

Interesting findings:

- **63,231 tokens** β€” too large for Claude Code to read at once, but it adapted by using grep and reading specific line ranges
- **XML chapters** β€” made it trivial to target specific sections (e.g., "analyse chapter 10 and 11")
- **Follow-up questions** β€” improved answer completeness. In an App this could be handled by prompt engineering.

Claude Code can only read 25,000 tokens at a time. But the Anthropic API has a 200,000 token window. So, we still don't have to use RAG (later).

## πŸ“Š Technical Details

- **Architecture**: Pure functions with clear module separation
- **Key Modules**: See [CLAUDE.md](.claude/CLAUDE.md)
- **Dependencies**: Python 3.14+, `yt-dlp` for YouTube downloads, see [pyproject.toml](pyproject.toml)
- **Python Package Management**: [UV](https://docs.astral.sh/uv/concepts/projects/)
- **Test-Driven Development**: 125 tests (19 slow, 106 unit)
- **Terminology**: Uses TRANSCRIPT terminology throughout codebase, see [docs/terminology.md](docs/terminology.md)



YouTube video interface showing the Transcript panel with timestamp and text displayed on single lines (e.g., '0:02 features in Claude Code and for some'). Orange annotations highlight chapter titles and transcript lines structure.

YouTube transcript terminology throughout codebase: (click to read)


## πŸ› οΈ Development

πŸ€– *Repo 100% generated by Claude Code β€” every single line.*

Setup:

```bash
git clone https://github.com/michellepace/youtube-to-xml.git
cd youtube-to-xml
uv sync
uv run pre-commit install
uv run pre-commit install --hook-type pre-push
```

Code Quality:

```bash
uv run ruff check --fix # Lint and auto-fix (see pyproject.toml)
uv run ruff format # Format code (see pyproject.toml)
```

Testing:

```bash
uv run pytest # All tests
uv run pytest -m "slow" # Only slow tests (internet required)
uv run pytest -m "not slow" # All tests except slow tests
uv run pre-commit run --all-files # (see .pre-commit-config.yaml)
```



Stacked area chart showing repository growth from 500 to 3700 lines across 250 commits, with test code (blue) comprising 60% of codebase, source code (red) 30%, and comments (green) 10%

Counted by my plot-py-repo tool


## πŸ—οΈ Architecture

```text
youtube-to-xml CLI
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
β”‚ cli.py β”‚
β”‚ (auto-detect)β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
[URL Input] [File Input]
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ url_parser.py β”‚ β”‚ file_parser.py β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β€’ yt-dlp API β”‚ β”‚ β€’ Pattern match β”‚
β”‚ β€’ JSON3 parse β”‚ β”‚ β€’ Chapter rules β”‚
β”‚ β€’ Metadata β”‚ β”‚ β€’ Empty metadataβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
β”‚ models.py β”‚
β”‚ β”‚
β”‚TranscriptDocβ”‚
β”‚ Chapters β”‚
β”‚ Metadata β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ xml_builder.py β”‚
β”‚ β”‚
β”‚ β€’ Format times β”‚
β”‚ β€’ Build XML treeβ”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ XML Fileβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## Appendix 1: Decision - Inline Transcript Timestamps

Each transcript line now places the timestamp and text on the **same line**, rather than on separate lines:

**Before (separate lines)**

```text
0:02
Welcome to this talk about cows
2:30
Let's start with the fundamentals
15:45
First, we'll start with the patches
```

**After (same line)**

```text
0:02 Welcome to this talk about cows
2:30 Let's start with the fundamentals
15:45 First, we'll start with the patches
```

**Why?** The primary consumer of these transcripts is an LLM agent (e.g. Claude Code) that navigates large files by searching for keywords and reading line ranges. With inline timestamps, every search hit is a self-contained record β€” the agent immediately knows *what* was said and *when*, in a single operation. No follow-up read to find the timestamp on the line above.



Diagram comparing search workflows: Before requires two steps to find the timestamp, After returns timestamp and text together in one step

Searching a transcript with thousands of lines: separate lines require a second lookup for the timestamp, same-line format returns a complete record


---

## πŸ“• Appendix 2: *Personal Notes*

Evals To Do (transcript.txt vs transcript.xml):

- [ ] Build Shiny for Python app to use Hamel's [simple error analysis approach](https://hamel.dev/blog/posts/field-guide/index.html)
- [ ] But I don't like Hamel's binary approach, what about a Six Sigma ordinal data approach like in [docs/idea-evals.md](docs/idea-evals.md)?
- [ ] Automate the evals with pytest as far as possible, LLM as a Judge for others
- [ ] If XML is the winner, try tweak the XML structure to improve, for example [this](docs/knowledge/working-notes.md#better-format). Like whitespace, more tags, or maybe JSON?
- [ ] But now I've got a problem with cost because xml is in the context window. So can RAG perform equally as well and fast?
- [ ] Can use a cheaper model that performs equally as well, like Haiku over Sonnet (for some things)?
- [ ] At some point I'm going to have to head over to [BrainTrust.dev](https://www.braintrust.dev/) - use an agnostic SDK?

Learnings To Carry Over:

- [Use CodeRabbit for PR review](https://www.anthropic.com/customers/coderabbit) to improve code
- [Use Claude Code Docs](https://github.com/ericbuess/claude-code-docs) so Claude Code knows what it can do
- [Use Claude Code Project Index](https://github.com/ericbuess/claude-code-project-index) so Claude Code sees entire project easily
- [Manage MCPs nicely](docs/knowledge/manage-mcps-nicely.md) constrain what you use, put API keys in one place
- [Git branch workflow](docs/knowledge/git-branch-flow.md) try put everything on a purposeful branch
- Always use strict linting and typing and enforce in [pre-commit hook](.pre-commit-config.yaml)
- Always do test-driven development and [manual LLM testing](docs/refactor-todo/exceptions/test_url.md) is useful too
- Manage LLM Context: set [terminology](docs/terminology.md), use clear naming, keep docstrings/comments accurate, at 60% context window `/clear` Claude Code

Open Questions:

- Q1. Is there something I could have done better with UV?
- Q2. Is the system architecture well-designed and elegant?
- Q3. Is the exception design suitable for a future API service?
- Q4. Are [tests/](tests/) clear and sane, or over-engineered?
- Q5. Was it safe to exclude "XML security" Ruff [S314](pyproject.toml)?