https://github.com/lucidprogrammer/youtube-vision-transcriber

AI-powered pipeline that converts YouTube videos into polished articles using vision-based transcription - captures code, terminal output, and on-screen text that subtitles miss
https://github.com/lucidprogrammer/youtube-vision-transcriber

ai fast-agent gemini knowledge- knowledge-extraction llm mcp model-context-protocol openai python transcription video-to-text vision-ai youtube

Last synced: 4 months ago
JSON representation

AI-powered pipeline that converts YouTube videos into polished articles using vision-based transcription - captures code, terminal output, and on-screen text that subtitles miss

Host: GitHub
URL: https://github.com/lucidprogrammer/youtube-vision-transcriber
Owner: lucidprogrammer
License: mit
Created: 2026-01-18T02:26:56.000Z (6 months ago)
Default Branch: main
Last Pushed: 2026-01-19T16:57:03.000Z (6 months ago)
Last Synced: 2026-01-19T18:13:30.393Z (6 months ago)
Topics: ai, fast-agent, gemini, knowledge-, knowledge-extraction, llm, mcp, model-context-protocol, openai, python, transcription, video-to-text, vision-ai, youtube
Language: Python
Size: 248 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# YouTube Vision Transcriber

**AI-powered pipeline that converts YouTube videos into polished articles using vision-based transcription.**

Unlike standard transcribers that only read audio captions, this tool uses **Google Gemini 2.0 Flash** (via `fast-agent`) to "watch" the video, capturing code snippets, terminal output, and on-screen text that subtitles often miss.

## Features

- **Vision-Based Transcription**: Captures visual context, code blocks, and diagrams.
- **MCP-Native**: Built with the [Model Context Protocol](https://modelcontextprotocol.io), compatible with Claude Desktop and VS Code.
- **Robust Pipeline**:
- **Video Preparer**: Downloads and splits videos into manageable chunks.
- **Part Transcriber**: Transcribes each chunk with timestamped visual details.
- **Doc Writer**: Synthesizes a structured technical article from the transcript.
- **Fast & Efficient**: Uses `uv` for dependency management and `fast-agent` for orchestration.

## Prerequisites

- **Python**: >= 3.13.5
- **uv**: [Install uv](https://docs.astral.sh/uv/getting-started/installation/)
- **API Keys**:
- **Google Gemini**: For vision transcription.
- **OpenAI** (Optional): For orchestration (default model is `gpt-4o-mini`, but can be changed).

## Installation

1. **Clone the repository**:
```bash
git clone https://github.com/lucidprogrammer/youtube-vision-transcriber.git
cd youtube-vision-transcriber
```

2. **Install dependencies**:
```bash
uv sync
```

3. **Configure API Keys**:
Copy the example secrets file and edit it:
```bash
cp fastagent.secrets.yaml.example fastagent.secrets.yaml
```
Add your API keys to `fastagent.secrets.yaml`:
```yaml
google:
api_key: "YOUR_GEMINI_API_KEY"
openai:
api_key: "YOUR_OPENAI_API_KEY"
```

4. **Install MCP Sub-Server Dependencies** (For Local Development):
The project uses external MCP servers that need to be installed globally:
```bash
# Filesystem MCP server (Node.js)
npm install -g @modelcontextprotocol/server-filesystem

# Fetch MCP server (Python)
pip install mcp-server-fetch
# OR using uv:
uv pip install mcp-server-fetch
```

> **Note**: If you're only using Docker, you can skip this step as these dependencies are pre-installed in the container.

## CLI Usage

You can run the agent interactively in your terminal:

```bash
uv run agent.py
```

Paste a YouTube URL when prompted. The agent will:
1. Download the video to `./youtube_data`.
2. Split and transcribe it.
3. Generate an `article.md` in the video's folder.

### 🚀 Quick Start (Docker)

1. **Pre-create a Persistent Volume** (Recommended for data persistence):
```bash
# Create a Docker volume linked to your home folder for easy results access
mkdir -p ~/youtube_data
docker volume rm youtube_vision_vol || true
docker volume create --driver local \
--opt type=none \
--opt device=$HOME/youtube_data \
--opt o=bind \
youtube_vision_vol

```

2. **Configure MCP Client**:
Add this to your `windsurf.json`, `cursor.json`, or `claude_desktop_config.json`:

```json
{
"mcpServers": {
"youtube-vision": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"-e", "GOOGLE__API_KEY=your_key_here",
"-e", "OPENAI__API_KEY=your_key_here",
"-v", "youtube_vision_vol:/app/youtube_data",
"lucidprogrammer/youtube-vision-transcriber"
]
}
}
}
```

### 🛠️ Troubleshooting (Docker Mounts)

If you see a **"mounts denied"** error (e.g., `The path /tmp/youtube_data is not shared`):

1. **Host Sharing**: In Docker Desktop settings, go to **Resources > File Sharing** and ensure the path is added.
2. **Path Selection**: Use a path inside your home directory (e.g., `/Users/yourname/youtube_data`) which is usually shared by default.
3. **Absolute Paths**: Ensure you use the full absolute path on the host machine.

## Architecture

The system consists of three main MCP components orchestrated by a master agent:

1. **YouTubeVisionTranscriber (`youtube_mcp.py`)**: Handles downloading (`yt-dlp`) and splitting videos (`ffmpeg`).
2. **VideoTranscriberServer (`video_transcriber_mcp.py`)**: Wraps internal LLM calls to process video chunks.
3. **Filessytem Server**: Manages reading/writing data to `./youtube_data`.

## License

MIT License. See [LICENSE](LICENSE) for details.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lucidprogrammer/youtube-vision-transcriber

Awesome Lists containing this project

README