https://github.com/guimatheus92/mcp-video-analyzer
MCP server for video analysis — extracts transcripts, key frames, OCR text, and metadata from video URLs. Supports Loom and direct video files.
https://github.com/guimatheus92/mcp-video-analyzer
ai claude frames loom mcp mcp-server model-context-protocol ocr transcript video-analysis
Last synced: 3 months ago
JSON representation
MCP server for video analysis — extracts transcripts, key frames, OCR text, and metadata from video URLs. Supports Loom and direct video files.
- Host: GitHub
- URL: https://github.com/guimatheus92/mcp-video-analyzer
- Owner: guimatheus92
- License: mit
- Created: 2026-03-09T17:51:43.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-10T18:54:41.000Z (3 months ago)
- Last Synced: 2026-03-11T00:05:41.895Z (3 months ago)
- Topics: ai, claude, frames, loom, mcp, mcp-server, model-context-protocol, ocr, transcript, video-analysis
- Language: TypeScript
- Size: 867 KB
- Stars: 0
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# mcp-video-analyzer
Featured in [awesome-mcp-servers](https://github.com/punkpeye/awesome-mcp-servers#-multimedia-process).
MCP server for video analysis — extracts transcripts, key frames, and metadata from video URLs. Supports Loom, direct video files (.mp4, .webm), and more.
No existing video MCP combines **transcripts + visual frames + metadata** in one tool. This one does.
## Installation
### Prerequisites
- **Node.js 18+** — required to run the server via `npx`
- **yt-dlp** (optional) — enables frame extraction via ffmpeg. Install with `pip install yt-dlp`
- **Chrome/Chromium** (optional) — fallback for frame extraction if yt-dlp is unavailable
> Without yt-dlp or Chrome, the server still works — you'll get transcripts, metadata, and comments, just no frames.
### Claude Code (CLI)
```bash
claude mcp add video-analyzer -- npx mcp-video-analyzer@latest
```
Then restart Claude Code or start a new conversation.
### VS Code / Cursor
Add to your MCP settings file:
- **VS Code**: `File → Preferences → Settings → search "MCP"` or edit `~/.vscode/mcp.json` / `%APPDATA%\Code\User\mcp.json` (Windows)
- **Cursor**: `Settings → MCP Servers → Add`
```json
{
"servers": {
"mcp-video-analyzer": {
"type": "stdio",
"command": "npx",
"args": ["mcp-video-analyzer@latest"]
}
}
}
```
Then reload the window (`Ctrl+Shift+P` → "Developer: Reload Window").
### Claude Desktop
Add to your Claude Desktop config file:
- **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
- **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
```json
{
"mcpServers": {
"video-analyzer": {
"command": "npx",
"args": ["mcp-video-analyzer@latest"]
}
}
}
```
Then restart Claude Desktop.
### Verify it works
Once installed, ask your AI assistant:
```
Analyze this video: https://www.loom.com/share/bdebdfe44b294225ac718bad241a94fe
```
If the server is connected, it will automatically call the `analyze_video` tool.
## Tools
### `analyze_video` — Full video analysis
Extracts everything from a video URL in one call:
```
> Analyze this video: https://www.loom.com/share/abc123...
```
Returns:
- **Transcript** with timestamps and speakers
- **Key frames** extracted via scene-change detection (automatically deduplicated)
- **OCR text** extracted from frames (code, error messages, UI text visible on screen)
- **Annotated timeline** merging transcript + frames + OCR into a unified "what happened when" view
- **Metadata** (title, duration, platform)
- **Comments** from viewers
- **Chapters** and **AI summary** (when available)
The AI will **automatically** call this tool when it sees a video URL — no need to ask.
Options:
- `detail` — analysis depth: `"brief"` (metadata + truncated transcript, no frames), `"standard"` (default), `"detailed"` (dense sampling, more frames)
- `fields` — array of specific fields to return, e.g. `["metadata", "transcript"]`. Available: `metadata`, `transcript`, `frames`, `comments`, `chapters`, `ocrResults`, `timeline`, `aiSummary`
- `maxFrames` (1-60, default depends on detail level) — cap on extracted frames
- `threshold` (0.0-1.0, default 0.1) — scene-change sensitivity
- `forceRefresh` — bypass cache and re-analyze
- `skipFrames` — skip frame extraction for transcript-only analysis
### `get_transcript` — Transcript only
```
> Get the transcript from this video
```
Quick transcript extraction. Falls back to Whisper transcription when no native transcript is available.
### `get_metadata` — Metadata only
```
> What's this video about?
```
Returns metadata, comments, chapters, and AI summary without downloading the video.
### `get_frames` — Frames only
```
> Extract frames from this video with dense sampling
```
Two modes:
- **Scene-change detection** (default) — captures visual transitions
- **Dense sampling** (`dense: true`) — 1 frame/sec for full coverage
### `analyze_moment` — Deep-dive on a time range
```
> Analyze what happens between 1:30 and 2:00 in this video
```
Combines burst frame extraction + filtered transcript + OCR + annotated timeline for a focused segment. Use when you need to understand exactly what happens at a specific moment.
### `get_frame_at` — Single frame at a timestamp
```
> Show me the frame at 1:23 in this video
```
The AI reads the transcript, spots a critical moment, and requests the exact frame to see what's on screen.
### `get_frame_burst` — N frames in a time range
```
> Show me 10 frames between 0:15 and 0:17 of this video
```
For motion, vibration, animations, or fast scrolling — burst mode captures N frames in a narrow window so the AI can see frame-by-frame changes.
## Detail Levels
| Level | Frames | Transcript | OCR | Timeline | Use case |
|-------|--------|-----------|-----|----------|----------|
| `brief` | None | First 10 entries | No | No | Quick check — what's this video about? |
| `standard` | Up to 20 (scene-change) | Full | Yes | Yes | Default — full analysis |
| `detailed` | Up to 60 (1fps dense) | Full | Yes | Yes | Deep analysis — every second captured |
## Caching
Results are cached in memory for 10 minutes. Subsequent calls with the same URL and options return instantly. Use `forceRefresh: true` to bypass the cache.
## Supported Platforms
| Platform | Transcript | Metadata | Comments | Frames | Auth |
|----------|:----------:|:--------:|:--------:|:------:|:----:|
| **Loom** | Yes | Yes | Yes | Yes | None |
| **Direct URL** (.mp4, .webm) | No | Duration only | No | Yes | None |
### Frame Extraction Strategies
Frame extraction uses a two-strategy fallback chain — no single dependency is required:
| Strategy | How it works | Speed | Requirements |
|----------|-------------|-------|-------------|
| **yt-dlp + ffmpeg** (primary) | Downloads video, extracts frames via scene detection | Fast, precise | [yt-dlp](https://github.com/yt-dlp/yt-dlp) (`pip install yt-dlp`) |
| **Browser** (fallback) | Opens video in headless Chrome, seeks to timestamps, takes screenshots | Slower, no download needed | Chrome or Chromium installed |
The fallback is automatic — if yt-dlp is not available, the server tries browser-based extraction via `puppeteer-core`. If neither is available, analysis still returns transcript + metadata + comments, just no frames.
### Post-Processing Pipeline
After frame extraction, the pipeline automatically applies:
| Step | What it does | Why |
|------|-------------|-----|
| **Frame deduplication** | Removes near-identical consecutive frames using perceptual hashing (dHash + Hamming distance) | Screencasts often have long static moments — dedup removes redundant frames, saving tokens |
| **OCR** | Extracts text visible on screen from each frame (via tesseract.js) | Captures code, error messages, terminal output, UI text that the transcript doesn't cover |
| **Annotated timeline** | Merges transcript timestamps + frame timestamps + OCR text into a single chronological view | Gives the AI a unified "what was said, what changed visually, and what text appeared" at each moment |
The OCR step requires `tesseract.js` (included as a dependency). If it fails to load, analysis continues without OCR — no frames or transcript are lost.
## Complementary Tools
### Chrome DevTools MCP
For **live web debugging** alongside video analysis, pair this server with the [Chrome DevTools MCP](https://github.com/anthropics/anthropic-quickstarts/tree/main/mcp-devtools):
```bash
claude mcp add chrome-devtools npx @anthropic-ai/mcp-devtools@latest
```
**When to use each:**
| Scenario | Tool |
|----------|------|
| Bug report recorded as a Loom video | `mcp-video-analyzer` — extract transcript, frames, and error text from the recording |
| Live debugging a web page | Chrome DevTools MCP — inspect DOM, console, network, take screenshots |
| Video shows UI issue, need to reproduce it | Use both: analyze the video first, then open the page in Chrome DevTools to reproduce |
The two MCPs complement each other: video analyzer understands **recorded** content, DevTools interacts with **live** pages.
## Example Output
The [`examples/loom-demo/`](examples/loom-demo/) folder contains **real outputs** from analyzing a public Loom video ([Boost In-App Demo Video](https://www.loom.com/share/bdebdfe44b294225ac718bad241a94fe), 2:55).
| File | What it shows |
|------|--------------|
| [`metadata.json`](examples/loom-demo/metadata.json) | Title, duration, platform |
| [`transcript.json`](examples/loom-demo/transcript.json) | 42 timestamped entries with speaker IDs |
| [`timeline.json`](examples/loom-demo/timeline.json) | Unified chronological view (transcript + frames merged) |
| [`moment-transcript-0m30s-0m45s.json`](examples/loom-demo/moment-transcript-0m30s-0m45s.json) | Filtered transcript for `analyze_moment` (0:30–0:45) |
| [`full-analysis.json`](examples/loom-demo/full-analysis.json) | Complete `analyze_video` output |
**Frame images** (19 total in [`examples/loom-demo/frames/`](examples/loom-demo/frames/)):
- `scene_*.jpg` — scene-change detection (key visual transitions)
- `dense_*.jpg` — 1fps dense sampling (every 10th frame saved as sample)
- `burst_*.jpg` — burst extraction for moment analysis (0:30–0:45)
> **Regenerate after changes:** `npx tsx examples/generate.ts` — requires yt-dlp + network access.
## Development
```bash
# Install dependencies
npm install
# Run all checks (format, lint, typecheck, knip, tests)
npm run check
# Build
npm run build
# Run E2E tests (requires network)
npm run test:e2e
# Open MCP Inspector for manual testing
npm run inspect
```
## Architecture
```
src/
├── index.ts # Entry point (shebang + stdio)
├── server.ts # FastMCP server + tool registration
├── tools/ # MCP tool definitions (7 tools)
│ ├── analyze-video.ts # Full analysis with detail levels + caching
│ ├── analyze-moment.ts # Deep-dive on a time range
│ ├── get-transcript.ts # Transcript-only with Whisper fallback
│ ├── get-metadata.ts # Metadata + comments + chapters
│ ├── get-frames.ts # Frames-only (scene-change or dense)
│ ├── get-frame-at.ts # Single frame at timestamp
│ └── get-frame-burst.ts # N frames in a time range
├── adapters/ # Platform-specific logic
│ ├── adapter.interface.ts # IVideoAdapter interface + registry
│ ├── loom.adapter.ts # Loom: authless GraphQL
│ └── direct.adapter.ts # Direct URL: any mp4/webm link
├── processors/ # Shared processing
│ ├── frame-extractor.ts # ffmpeg scene detection + dense + burst extraction
│ ├── browser-frame-extractor.ts # Headless Chrome fallback for frames
│ ├── audio-transcriber.ts # Whisper fallback (HF transformers → CLI → OpenAI)
│ ├── image-optimizer.ts # sharp resize/compress
│ ├── frame-dedup.ts # Perceptual dedup (dHash + Hamming distance)
│ ├── frame-ocr.ts # OCR text extraction (tesseract.js)
│ └── annotated-timeline.ts # Unified timeline (transcript + frames + OCR)
├── config/
│ └── detail-levels.ts # brief / standard / detailed config
├── utils/
│ ├── cache.ts # In-memory TTL cache with LRU eviction
│ ├── field-filter.ts # Selective field filtering for responses
│ ├── url-detector.ts # Platform detection from URL
│ ├── vtt-parser.ts # WebVTT → transcript entries
│ └── temp-files.ts # Temp directory management
└── types.ts # Shared TypeScript interfaces
```
## License
MIT