{"id":48517162,"url":"https://github.com/devhims/youtube-caption-extractor","last_synced_at":"2026-05-16T20:04:00.033Z","repository":{"id":161396500,"uuid":"636094222","full_name":"devhims/youtube-caption-extractor","owner":"devhims","description":"A lightweight package to scrape and parse captions (subtitles) from YouTube videos, supporting both user-submitted and auto-generated captions with language options.","archived":false,"fork":false,"pushed_at":"2025-08-28T08:40:51.000Z","size":524,"stargazers_count":113,"open_issues_count":1,"forks_count":15,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-11-04T15:14:27.378Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://youtube-caption-extractor.vercel.app","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/devhims.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-05-04T05:48:00.000Z","updated_at":"2025-11-03T20:29:50.000Z","dependencies_parsed_at":"2026-03-26T22:01:19.610Z","dependency_job_id":null,"html_url":"https://github.com/devhims/youtube-caption-extractor","commit_stats":null,"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/devhims/youtube-caption-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devhims%2Fyoutube-caption-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devhims%2Fyoutube-caption-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devhims%2Fyoutube-caption-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devhims%2Fyoutube-caption-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/devhims","download_url":"https://codeload.github.com/devhims/youtube-caption-extractor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devhims%2Fyoutube-caption-extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31526666,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T16:28:08.000Z","status":"ssl_error","status_checked_at":"2026-04-07T16:28:06.951Z","response_time":105,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-07T19:32:21.163Z","updated_at":"2026-05-16T20:04:00.025Z","avatar_url":"https://github.com/devhims.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1\u003eyoutube-caption-extractor\u003c/h1\u003e\n  \u003cp\u003e\u003cstrong\u003eTurn public YouTube videos into clean, timestamped transcripts.\u003c/strong\u003e\u003c/p\u003e\n  \u003cp\u003e\n    Extract YouTube captions, subtitles, auto-generated transcripts, and video metadata\n    with a tiny TypeScript library: ~10.8 kB packed, ~31.1 kB unpacked.\n  \u003c/p\u003e\n  \u003cp\u003e\n    \u003ca href=\"https://www.npmjs.com/package/youtube-caption-extractor\"\u003e\u003cimg alt=\"npm version\" src=\"https://img.shields.io/npm/v/youtube-caption-extractor?color=cb3837\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://www.npmjs.com/package/youtube-caption-extractor\"\u003e\u003cimg alt=\"npm downloads\" src=\"https://img.shields.io/npm/dm/youtube-caption-extractor\"\u003e\u003c/a\u003e\n    \u003ca href=\"./LICENSE\"\u003e\u003cimg alt=\"license\" src=\"https://img.shields.io/npm/l/youtube-caption-extractor\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://www.typescriptlang.org/\"\u003e\u003cimg alt=\"TypeScript ready\" src=\"https://img.shields.io/badge/TypeScript-ready-3178c6\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://nodejs.org/\"\u003e\u003cimg alt=\"Node.js 18+\" src=\"https://img.shields.io/badge/Node.js-18%2B-43853d\"\u003e\u003c/a\u003e\n  \u003c/p\u003e\n  \u003cp\u003e\n    \u003ca href=\"#try-it-quickly\"\u003eQuickstart\u003c/a\u003e\n    · \u003ca href=\"https://youtube-caption-extractor.vercel.app/\"\u003eLive demo\u003c/a\u003e\n    · \u003ca href=\"./sample\"\u003eSample app\u003c/a\u003e\n    · \u003ca href=\"#api\"\u003eAPI\u003c/a\u003e\n    · \u003ca href=\"#deployment-notes\"\u003eDeployment notes\u003c/a\u003e\n  \u003c/p\u003e\n\u003c/div\u003e\n\n---\n\n## Why use it?\n\n- **One-call transcript extraction** — `getSubtitles()` returns timestamped segments ready for search, summarization, indexing, RAG, slide ready research notes, or export.\n- **Metadata included** — `getVideoDetails()` returns title, description, and captions in one response.\n- **Manual + auto captions** — prefers exact language matches, then gracefully falls back to available tracks.\n- **Tiny install** — ~10.8 kB packed on npm, with only two runtime dependencies.\n- **Runtime-friendly** — uses global `fetch`, with an optional custom transport for retries, caching, regional routing, or proxies.\n- **Production-aware sample** — includes a Next.js demo plus a token-protected Cloudflare Container API.\n\n## Built for\n\n- YouTube caption extraction, subtitle extraction, and timestamped transcript data.\n- AI summaries, search indexes, RAG pipelines, and agent workflows that need clean video text.\n- Slide ready notes, presentation research, and content workflows built from public YouTube videos.\n- Lightweight video metadata enrichment without pulling in a large SDK.\n\n```ts\nimport { getSubtitles } from 'youtube-caption-extractor';\n\nconst subtitles = await getSubtitles({ videoID: '7GeFt8suV8E', lang: 'en' });\n// → [\n//     { start: '1.12', dur: '4.56', text: 'This scraper can scrape almost anything' },\n//     { start: '3.36', dur: '5.84', text: 'on the internet and you will be' },\n//     { start: '5.68', dur: '6.64', text: 'surprised how easy it is to use it.' },\n//     ...\n//   ]\n```\n\n## Try it quickly\n\n```sh\nnpm install youtube-caption-extractor\n```\n\n```ts\nimport { getVideoDetails } from 'youtube-caption-extractor';\n\nconst video = await getVideoDetails({\n  videoID: '7GeFt8suV8E',\n  lang: 'en',\n});\n\nconsole.log(video.title);\nconsole.log(video.subtitles.map((s) =\u003e s.text).join('\\n'));\n```\n\nWant to click around first? Try the hosted demo:\n[youtube-caption-extractor.vercel.app](https://youtube-caption-extractor.vercel.app/).\n\nWant a full app example? See [`sample/`](./sample), which includes:\n\n- A polished Next.js UI\n- Local API testing with your machine's network egress\n- A Dockerized Hono API deployed through Cloudflare Containers\n- A server-side token-protected proxy so the container API is not publicly open\n\n## Installation\n\n```sh\nnpm install youtube-caption-extractor\n```\n\nRequires **Node.js ≥ 18** when running on Node.js because the library uses the\nglobal `fetch` API. It also works in Bun, Deno, Cloudflare Workers, and other\nmodern JavaScript runtimes that provide `fetch`. See\n[Deployment notes](#deployment-notes) for tips on keeping calls reliable from\nyour runtime of choice.\n\n## API\n\nThe library exports two functions and three types.\n\n### `getSubtitles({ videoID, lang?, fetch? })`\n\nReturns the caption track as an array of timed segments.\n\n| Param     | Type           | Default        | Notes                                                                                                                                                            |\n| --------- | -------------- | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `videoID` | `string`       | (required)     | The 11-character YouTube video ID, e.g. `7GeFt8suV8E`. Not the full URL.                                                                                         |\n| `lang`    | `string`       | `'en'`         | ISO language code (`'en'`, `'es'`, `'fr'`, `'ja'`, …). Manual captions are preferred over auto-generated, and an exact match is preferred over a partial match.  |\n| `fetch`   | `typeof fetch` | global `fetch` | Custom fetch implementation. Useful for adding caching, custom retries, or routing through a proxy. See [Customizing the transport](#customizing-the-transport). |\n\nResolves to `Subtitle[]`. Returns an empty array if the video plays but has no caption track in the requested language. **Throws** if the video is unavailable on any extraction path (see [Error handling](#error-handling)).\n\n### `getVideoDetails({ videoID, lang?, fetch? })`\n\nSame arguments as `getSubtitles`. Returns title, description, and the same subtitle array:\n\n```ts\nconst details = await getVideoDetails({ videoID: '7GeFt8suV8E', lang: 'en' });\n// → {\n//     title: 'Master Web Scraping with Firecrawl!',\n//     description: 'Get started with Firecrawl here: https://firecrawl.link/…',\n//     subtitles: [{ start: '1.12', dur: '4.56', text: 'This scraper can scrape almost anything' }, …],\n//   }\n```\n\nIf subtitles fail to extract but the video metadata is available, `subtitles` will be an empty array and the call still resolves (rather than throwing). This way you can always show title/description even when captions aren't available.\n\n### Types\n\n```ts\ninterface Subtitle {\n  start: string; // Segment start time, seconds\n  dur: string; // Segment duration, seconds\n  text: string; // Decoded text content\n}\n\ninterface VideoDetails {\n  title: string;\n  description: string;\n  subtitles: Subtitle[];\n}\n\ninterface Options {\n  videoID: string;\n  lang?: string;\n  fetch?: typeof fetch;\n}\n```\n\nAll three are exported by name and can be imported directly:\n\n```ts\nimport type {\n  Subtitle,\n  VideoDetails,\n  Options,\n} from 'youtube-caption-extractor';\n```\n\n## Languages\n\nThe `lang` argument is a hint, not a strict filter. Track selection precedence:\n\n1. **Manual captions in the requested language** (`vssId === '.\u003clang\u003e'`)\n2. **Auto-generated captions in the requested language** (`vssId === 'a.\u003clang\u003e'`)\n3. **Any track whose `languageCode` matches** the requested code\n4. **Any track whose `vssId` contains the requested code** (partial match)\n5. **The first available track** as a final fallback\n\nIf you pass `lang: 'en'` and the video only has Spanish manual captions, you'll get those — the library prefers _some_ output over none. If you pass a code that doesn't exist on the video, you'll typically get the video's primary language track. To check whether you got what you asked for, inspect the first segment's text or compare against `VideoDetails.title` / `description`.\n\n## Error handling\n\nThe library throws a regular `Error` when no extraction path succeeds — for instance, when the video is private, deleted, or YouTube didn't return a usable response.\n\nThe error message has a stable, parseable structure listing each client that was attempted along with the status YouTube returned for it:\n\n```\nVideo not playable on any client. Attempts:\nios: LOGIN_REQUIRED - Sign in to confirm you're not a bot\nandroid_vr: LOGIN_REQUIRED - Sign in to confirm you're not a bot\nmweb: LOGIN_REQUIRED - Sign in to confirm you're not a bot\n```\n\nA common pattern for classifying errors and surfacing them gracefully:\n\n```ts\ntry {\n  const subtitles = await getSubtitles({ videoID, lang });\n  return subtitles;\n} catch (err) {\n  const msg = err instanceof Error ? err.message : String(err);\n  if (msg.includes('LOGIN_REQUIRED') || msg.includes('not a bot')) {\n    // Transient — usually worth retrying. See \"Deployment notes\" for details.\n    throw new Error('transient_extraction_failure');\n  }\n  if (msg.includes('Video unavailable') || msg.includes('private')) {\n    throw new Error('video_not_accessible');\n  }\n  throw err;\n}\n```\n\n## Deployment notes\n\nThe library calls YouTube directly, so reliability depends partly on the network\negress of the process making the request.\n\nLocal development and self-hosted servers tend to work out of the box. Shared\nserverless, container, and edge IP ranges can sometimes be rate-limited or gated\nby YouTube's bot checks. That is not a library API issue; it is an egress\nreputation issue. For production, use the patterns below.\n\n### Recommended app architecture\n\nKeep YouTube extraction server-side. Do not call YouTube directly from browser\ncode.\n\n```txt\nBrowser → your app API route → youtube-caption-extractor → YouTube\n```\n\nIf you use a separate API service, protect it with a server-side token:\n\n```txt\nBrowser → your app API route → token-protected caption API → YouTube\n```\n\nThe included [`sample/`](./sample) demonstrates this pattern with:\n\n- Next.js API routes as the public browser-facing API\n- A Cloudflare Worker that rejects requests without `Authorization: Bearer \u003ctoken\u003e`\n- A Cloudflare Container running a Hono/Node API\n- `CAPTION_API_TOKEN` kept server-side only, never in `NEXT_PUBLIC_*`\n\n### Building resilient calls\n\nA small retry wrapper handles transient failures gracefully:\n\n```ts\nimport { getSubtitles, type Subtitle } from 'youtube-caption-extractor';\n\nasync function getSubtitlesWithRetry(\n  videoID: string,\n  lang = 'en',\n  maxAttempts = 3,\n): Promise\u003cSubtitle[]\u003e {\n  let lastError: unknown;\n  for (let attempt = 1; attempt \u003c= maxAttempts; attempt++) {\n    try {\n      return await getSubtitles({ videoID, lang });\n    } catch (err) {\n      lastError = err;\n      const msg = err instanceof Error ? err.message : String(err);\n      // Don't retry on permanent errors (private/deleted video, etc.)\n      if (msg.includes('Video unavailable') || msg.includes('private')) {\n        throw err;\n      }\n      // Small backoff between attempts\n      await new Promise((r) =\u003e setTimeout(r, 200 * attempt));\n    }\n  }\n  throw lastError;\n}\n```\n\n### Customizing the transport\n\nThe optional `fetch` argument lets you supply any custom transport — useful for adding caching, custom headers, regional routing, or proxying through another service:\n\n```ts\nimport { getSubtitles } from 'youtube-caption-extractor';\nimport { ProxyAgent, fetch as undiciFetch } from 'undici';\n\nconst dispatcher = new ProxyAgent(process.env.OUTBOUND_PROXY_URL!);\n\nconst proxied: typeof fetch = (input, init) =\u003e\n  undiciFetch(input, { ...init, dispatcher }) as unknown as Promise\u003cResponse\u003e;\n\nconst subtitles = await getSubtitles({\n  videoID: '7GeFt8suV8E',\n  lang: 'en',\n  fetch: proxied,\n});\n```\n\nCommon uses for a custom `fetch`:\n\n- **Caching layers** — wrap the global fetch with an LRU or in-memory cache\n- **Authenticated proxies** — add `Authorization` headers via a wrapper\n- **Regional routing** — direct outbound traffic through a specific region or provider\n\n### Local vs hosted behavior\n\nIf extraction works locally but fails in a hosted environment with a message like\n`LOGIN_REQUIRED` or \"Sign in to confirm you're not a bot\", the hosted provider's\negress IP is likely being challenged by YouTube. Your options are:\n\n1. Run the extraction API somewhere with reliable egress for your workload.\n2. Use the `fetch` option to route outbound YouTube requests through a trusted proxy.\n3. Cache successful results aggressively so fewer requests reach YouTube.\n4. Treat these failures as transient and retry with backoff where appropriate.\n\n## Usage examples\n\n### Next.js (App Router)\n\n```ts\n// app/api/captions/route.ts\nimport { NextResponse, type NextRequest } from 'next/server';\nimport { getVideoDetails } from 'youtube-caption-extractor';\n\nexport async function GET(request: NextRequest) {\n  const videoID = request.nextUrl.searchParams.get('videoID');\n  const lang = request.nextUrl.searchParams.get('lang') ?? 'en';\n\n  if (!videoID) {\n    return NextResponse.json({ error: 'Missing videoID' }, { status: 400 });\n  }\n\n  try {\n    const details = await getVideoDetails({ videoID, lang });\n    return NextResponse.json(details);\n  } catch (err) {\n    const msg = err instanceof Error ? err.message : String(err);\n    return NextResponse.json({ error: msg }, { status: 500 });\n  }\n}\n```\n\nCall from a client component with `fetch('/api/captions?videoID=...')`. This avoids the browser CORS issue and keeps the YouTube call server-side.\n\n### Express\n\n```ts\nimport express from 'express';\nimport { getSubtitles } from 'youtube-caption-extractor';\n\nconst app = express();\n\napp.get('/captions/:videoID', async (req, res) =\u003e {\n  try {\n    const subtitles = await getSubtitles({\n      videoID: req.params.videoID,\n      lang: (req.query.lang as string) ?? 'en',\n    });\n    res.json({ subtitles });\n  } catch (err) {\n    res.status(500).json({ error: (err as Error).message });\n  }\n});\n```\n\n### Cloudflare Workers\n\n```ts\nimport { getSubtitles } from 'youtube-caption-extractor';\n\nexport default {\n  async fetch(request: Request): Promise\u003cResponse\u003e {\n    const url = new URL(request.url);\n    const videoID = url.searchParams.get('videoID');\n    if (!videoID) return new Response('Missing videoID', { status: 400 });\n\n    try {\n      const subtitles = await getSubtitles({ videoID, lang: 'en' });\n      return Response.json({ subtitles });\n    } catch (err) {\n      return Response.json({ error: (err as Error).message }, { status: 500 });\n    }\n  },\n};\n```\n\nAdd `compatibility_flags: [\"nodejs_compat\"]` in your `wrangler.jsonc` so the library's `he` and `striptags` dependencies resolve. For production workloads, wrap the call in the retry helper from [Building resilient calls](#building-resilient-calls).\n\n## Debug logging\n\nThe library is silent by default. To see what's happening internally — which client returned what, where it fell back, what URL was hit — set the `DEBUG` env var:\n\n```sh\nDEBUG=youtube-caption-extractor node your-script.js\n\n# Cloudflare Workers\nDEBUG=youtube-caption-extractor wrangler dev\n\n# Or DEBUG=* for everything\n```\n\nThe logger uses only `console.log` and `process.env` (read defensively), so it works in any runtime that provides those — no `debug` package dependency.\n\n## TypeScript\n\nThe package ships type definitions; no `@types/*` install needed. All three types (`Subtitle`, `VideoDetails`, `Options`) are exported:\n\n```ts\nimport {\n  getSubtitles,\n  getVideoDetails,\n  type Subtitle,\n  type VideoDetails,\n  type Options,\n} from 'youtube-caption-extractor';\n\nasync function transcript(opts: Options): Promise\u003cSubtitle[]\u003e {\n  return await getSubtitles(opts);\n}\n```\n\n## Changelog\n\n### v1.10.2\n\n- Added a production-ready sample API path using a Dockerized Hono server on Cloudflare Containers.\n- Added server-side token protection for the sample Cloudflare Worker API.\n- Updated the sample app so browser requests go through Next.js API routes instead of exposing API secrets client-side.\n- Refreshed README quickstart and deployment guidance to make local testing, hosted demos, and production egress tradeoffs clearer.\n\n### v1.10.1\n\n- **Streamlined the internal client fallback chain.** Removed an outdated client that was no longer contributing successful extractions, and reordered the remaining clients with the most reliable one first.\n- **Faster successful calls** — one fewer round-trip in the common case (~150 ms saved per request).\n- No API changes; fully backward-compatible.\n\n### v1.10.0\n\n- Caption extraction is reliable again — fixes a regression where `getSubtitles` would silently return `[]` for many videos.\n- Multi-path extraction with automatic fallback across clients; gracefully degrades when one path is unavailable.\n- json3-based subtitle parser replaces the legacy XML regex, fixing multi-line and special-character edge cases.\n- New optional `fetch` option for routing through a residential proxy.\n- `Options` interface now exported.\n- `engines.node` bumped to `\u003e=18.0.0`.\n- Slimmer install — published tarball is ~85% smaller.\n\n### v1.9.0\n\n- `Subtitle` interface exported.\n- Universal debug logger that works in Node.js, Cloudflare Workers, and edge runtimes.\n- Library is silent by default in production.\n\n### v1.4.2\n\n- TypeScript definitions shipped with the package.\n- Node.js and edge runtime support.\n- New `getVideoDetails` API for title + description + subtitles in one call.\n\n## License\n\n[MIT](./LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevhims%2Fyoutube-caption-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevhims%2Fyoutube-caption-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevhims%2Fyoutube-caption-extractor/lists"}