https://github.com/devdogukan/youtube-transcript-extarctor

Last synced: 28 days ago
JSON representation
Host: GitHub
URL: https://github.com/devdogukan/youtube-transcript-extarctor
Owner: devdogukan
License: mit
Created: 2025-12-12T19:02:51.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-12-12T22:15:19.000Z (6 months ago)
Last Synced: 2025-12-14T10:06:57.929Z (6 months ago)
Language: TypeScript
Size: 25.4 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # YouTube Transcript Extractor

A simple, plug-and-play Node.js module that extracts clean transcripts from YouTube videos. No UI, no website - just a reusable backend module.

## Features

- ✅ Extracts transcripts from YouTube videos (manual and auto-generated captions)

- ✅ Returns transcript segments with text, duration, offset, and language

- ✅ Multiple output formats: JSON, Text, Markdown, SRT

- ✅ Automatic HTML entity decoding (converts `&#39;` to `'`, etc.)

- ✅ CLI support for command-line usage

- ✅ Written in TypeScript with full type definitions

- ✅ Works without YouTube login

- ✅ Minimal dependencies (TypeScript for development, Node.js built-in fetch)

- ✅ Error handling for videos without captions

## Installation

This module requires Node.js 18.0.0 or higher (for built-in fetch support).

### Building the Project

First, install dependencies and build the TypeScript code:

```bash

npm install

npm run build

```

This will compile TypeScript files from `src/` to `dist/`.

### As a Module

After building, import from the compiled output:

```javascript

import { getTranscriptFromUrl } from './youtube-transcript-extractor/dist/index.js';

```

Or if using TypeScript in your project, you can import directly from source:

```typescript

import { getTranscriptFromUrl } from './youtube-transcript-extractor/src/index.js';

```

### As a CLI Tool (Global Installation)

```bash

npm install -g .

```

After installation, you can use the `youtube-transcript` command globally.

## Quick Start

### CLI Usage

```bash

# JSON format (default, outputs to console)

node dist/cli.js https://www.youtube.com/watch?v=dQw4w9WgXcQ

# Text format

node dist/cli.js https://www.youtube.com/watch?v=dQw4w9WgXcQ --format text

# Markdown format with timestamps, save to file

node dist/cli.js https://www.youtube.com/watch?v=dQw4w9WgXcQ --format markdown --output transcript.md --timestamps

# SRT format, save to file

node dist/cli.js https://www.youtube.com/watch?v=dQw4w9WgXcQ --format srt --output transcript.srt

# Show help

node dist/cli.js --help

```

If installed globally:

```bash

youtube-transcript https://www.youtube.com/watch?v=dQw4w9WgXcQ --format text

```

### Module Usage

**JavaScript:**

```javascript

import { getTranscriptFromUrl } from './dist/index.js';

// JSON format (default)

const result = await getTranscriptFromUrl('https://www.youtube.com/watch?v=dQw4w9WgXcQ');

if (typeof result === 'object' && !Array.isArray(result) && 'error' in result) {

  console.log('Error:', result.error);

  console.log('Video ID:', result.videoId);

} else if (Array.isArray(result)) {

  // result is an array of transcript segments

  console.log(`Found ${result.length} segments`);

  result.forEach((segment, index) => {

    console.log(`Segment ${index + 1}:`, segment.text);

    console.log(`  Duration: ${segment.duration}s, Offset: ${segment.offset}s`);

  });

}

// Text format

const text = await getTranscriptFromUrl('https://www.youtube.com/watch?v=dQw4w9WgXcQ', {

  format: 'text',

  formatOptions: { sentencesPerParagraph: 3 }

});

// Markdown format with timestamps, save to file

await getTranscriptFromUrl('https://www.youtube.com/watch?v=dQw4w9WgXcQ', {

  format: 'markdown',

  formatOptions: { includeTimestamps: true },

  outputFile: './transcript.md'

});

```

**TypeScript:**

```typescript

import { getTranscriptFromUrl, OutputFormat } from './src/index.js';

import type { TranscriptSegment } from './src/types.js';

// JSON format (default)

const result = await getTranscriptFromUrl('https://www.youtube.com/watch?v=dQw4w9WgXcQ');

if (typeof result === 'object' && !Array.isArray(result) && 'error' in result) {

  console.log('Error:', result.error);

  console.log('Video ID:', result.videoId);

} else if (Array.isArray(result)) {

  // TypeScript knows result is TranscriptSegment[]

  const segments: TranscriptSegment[] = result;

  console.log(`Found ${segments.length} segments`);

  segments.forEach((segment, index) => {

    console.log(`Segment ${index + 1}:`, segment.text);

    console.log(`  Duration: ${segment.duration}s, Offset: ${segment.offset}s`);

  });

}

// Text format with enum

const text = await getTranscriptFromUrl('https://www.youtube.com/watch?v=dQw4w9WgXcQ', {

  format: OutputFormat.TEXT,

  formatOptions: { sentencesPerParagraph: 3 }

});

// Markdown format with timestamps, save to file

await getTranscriptFromUrl('https://www.youtube.com/watch?v=dQw4w9WgXcQ', {

  format: OutputFormat.MARKDOWN,

  formatOptions: { includeTimestamps: true },

  outputFile: './transcript.md'

});

```

## API Reference

### `getTranscriptFromUrl(url, options)`

Extracts transcript from a YouTube video URL or video ID.

**TypeScript Signature:**

```typescript

function getTranscriptFromUrl(

  url: string,

  options?: TranscriptOptions

): Promise

```

**Parameters:**

- `url` (string): YouTube video URL (e.g., `https://www.youtube.com/watch?v=VIDEO_ID`) or video ID

- `options` (TranscriptOptions, optional): Configuration options

  - `format` (OutputFormat | string): Output format - `OutputFormat.JSON`, `OutputFormat.TEXT`, `OutputFormat.MARKDOWN`, or `OutputFormat.SRT` (default: `OutputFormat.JSON`)

  - `outputFile` (string, optional): File path to save output

  - `formatOptions` (FormatOptions, optional): Format-specific options

    - `sentencesPerParagraph` (number): For `'text'` format - number of sentences per paragraph (default: 3)

    - `includeTimestamps` (boolean): For `'markdown'` format - include timestamps (default: false)

**Returns:**

- `Promise`: 

  - `TranscriptSegment[]` - Array of transcript segments (JSON format)

  - `string` - Formatted string (text, markdown, or srt formats)

  - `{ success: true; file: string; format: string }` - Success object (when `outputFile` is specified)

  - `{ error: string; videoId: string }` - Error object on failure

**Type Definitions:**

```typescript

// Available enums

enum OutputFormat {

  JSON = 'json',

  TEXT = 'text',

  MARKDOWN = 'markdown',

  SRT = 'srt'

}

// Transcript segment structure

interface TranscriptSegment {

  text: string;

  duration: number;

  offset: number;

  lang: string;

}

// Options interface

interface TranscriptOptions {

  format?: OutputFormat | string;

  outputFile?: string;

  formatOptions?: FormatOptions;

}

interface FormatOptions {

  sentencesPerParagraph?: number;

  includeTimestamps?: boolean;

}

```

**Success Response (Array of segments):**

```javascript

[

  {

    text: '♪ Never gonna give you up ♪',

    duration: 1.96,

    offset: 161.4,

    lang: 'en'

  },

  {

    text: '♪ Never gonna let you down ♪',

    duration: 2.2,

    offset: 163.44,

    lang: 'en'

  },

  // ... more segments

]

```

**Error Response (Object):**

```javascript

{

  error: "No transcript available",

  videoId: "dQw4w9WgXcQ"

}

```

## Integration Examples

### Next.js API Route

```typescript

// pages/api/transcript.ts or app/api/transcript/route.ts

import { getTranscriptFromUrl } from '../../../transcript-engine/dist/index.js';

import type { NextApiRequest, NextApiResponse } from 'next';

export default async function handler(req: NextApiRequest, res: NextApiResponse) {

  const { url } = req.query;

  if (!url || typeof url !== 'string') {

    return res.status(400).json({ error: 'URL parameter is required' });

  }

  try {

    const result = await getTranscriptFromUrl(url);

    

    if (typeof result === 'object' && !Array.isArray(result) && 'error' in result) {

      return res.status(404).json(result);

    }

    

    return res.status(200).json(result);

  } catch (error) {

    const errorMessage = error instanceof Error ? error.message : 'Unknown error';

    return res.status(500).json({ error: errorMessage });

  }

}

```

### Express Server

```typescript

// server.ts

import express from 'express';

import { getTranscriptFromUrl } from './transcript-engine/dist/index.js';

const app = express();

app.get('/api/transcript', async (req, res) => {

  const { url } = req.query;

  if (!url || typeof url !== 'string') {

    return res.status(400).json({ error: 'URL parameter is required' });

  }

  try {

    const result = await getTranscriptFromUrl(url);

    

    if (typeof result === 'object' && !Array.isArray(result) && 'error' in result) {

      return res.status(404).json(result);

    }

    

    return res.status(200).json(result);

  } catch (error) {

    const errorMessage = error instanceof Error ? error.message : 'Unknown error';

    return res.status(500).json({ error: errorMessage });

  }

});

app.listen(3000, () => {

  console.log('Server running on port 3000');

});

```

### Standalone Node Module

```typescript

// my-script.ts

import { getTranscriptFromUrl } from './transcript-engine/dist/index.js';

import type { TranscriptSegment } from './transcript-engine/src/types.js';

async function main(): Promise {

  const videoUrl = 'https://www.youtube.com/watch?v=dQw4w9WgXcQ';

  const result = await getTranscriptFromUrl(videoUrl);

  if (typeof result === 'object' && !Array.isArray(result) && 'error' in result) {

    console.error('Failed to get transcript:', result.error);

    return;

  }

  if (Array.isArray(result)) {

    // TypeScript knows result is TranscriptSegment[]

    const segments: TranscriptSegment[] = result;

    console.log(`Found ${segments.length} transcript segments`);

    

    // Combine all text

    const fullText = segments.map(segment => segment.text).join(' ');

    console.log('\n--- Full Transcript ---');

    console.log(fullText);

    

    // Or work with individual segments

    segments.forEach((segment, index) => {

      console.log(`\n[${segment.offset}s] ${segment.text}`);

    });

  }

}

main();

```

## Module Structure

```

youtube-transcript-extractor/

├── src/                      # TypeScript source files

│   ├── index.ts              # Main entry point (module export)

│   ├── cli.ts                # CLI interface

│   ├── constants.ts          # Constants and regex patterns

│   ├── types.ts              # TypeScript type definitions and enums

│   ├── formatters/

│   │   └── index.ts          # Format converters (text, markdown, srt, json)

│   ├── services/

│   │   ├── fetchTranscript.ts   # Fetches transcript XML from YouTube

│   │   └── parseTranscript.ts   # Parses XML into segment objects

│   └── utils/

│       ├── videoUtils.ts     # Video ID extraction and HTTP utilities

│       ├── fileWriter.ts     # File writing utilities

│       └── htmlEntities.ts   # HTML entity decoding utilities

├── dist/                     # Compiled JavaScript (generated after build)

├── tsconfig.json             # TypeScript configuration

├── package.json

└── README.md

```

## Output Formats

- **JSON** (default): Returns an array of segment objects with `text`, `duration`, `offset`, and `lang` properties

- **Text**: Plain text format with paragraphs (configurable sentences per paragraph)

- **Markdown**: Markdown formatted text with optional timestamps

- **SRT**: Standard SRT subtitle format for video players

## How It Works

1. **URL Parsing**: Extracts video ID from YouTube URL

2. **API Key Extraction**: Fetches YouTube watch page and extracts Innertube API key

3. **Caption Discovery**: Uses YouTube Innertube API to discover available caption tracks

4. **Transcript Fetching**: Downloads transcript XML from YouTube's caption endpoint

5. **Parsing**: Parses XML into segment objects with text, duration, offset, and language

6. **HTML Entity Decoding**: Automatically decodes HTML entities (e.g., `&#39;` → `'`, `&quot;` → `"`) to ensure clean, readable text

7. **Formatting**: Converts segments to requested format (JSON, text, markdown, or SRT)

## Error Handling

The module handles various error scenarios:

- **Video Unavailable**: Video doesn't exist or has been removed

- **No Transcript Available**: Video doesn't have captions

- **Transcripts Disabled**: Captions are disabled for the video

- **Too Many Requests**: Rate limiting from YouTube

All errors return a JSON response with `error` and `videoId` fields.

## Requirements

- Node.js >= 18.0.0 (for built-in fetch API)

- TypeScript >= 5.3.3 (for development)

- @types/node >= 20.10.6 (for TypeScript types)

## Development

To work with the source code:

```bash

# Install dependencies

npm install

# Build TypeScript to JavaScript

npm run build

# Watch mode for development

npm run dev

```

The compiled JavaScript files will be output to the `dist/` directory.

## License

MIT

## Notes

- This module uses YouTube's public caption endpoints - no authentication required

- Supports both manual and auto-generated captions

- Works with any public YouTube video that has captions available

- The module is designed to be simple, modular, and easy to integrate

- Written in TypeScript with full type safety and IntelliSense support

- All types and enums are exported from `src/types.ts` for use in TypeScript projects
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/devdogukan/youtube-transcript-extarctor

Awesome Lists containing this project

README