https://github.com/dmmeteo/youtube-transcript-scraper

A CLI tool for downloading transcripts for all YouTube videos at a specified channel and creating training datasets for language models.
https://github.com/dmmeteo/youtube-transcript-scraper

datasets llm ml scraper transcript youtube

Last synced: 7 months ago
JSON representation

A CLI tool for downloading transcripts for all YouTube videos at a specified channel and creating training datasets for language models.

Host: GitHub
URL: https://github.com/dmmeteo/youtube-transcript-scraper
Owner: dmmeteo
License: mit
Created: 2025-03-06T20:24:49.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-03-06T20:39:47.000Z (12 months ago)
Last Synced: 2025-03-06T21:27:51.594Z (12 months ago)
Topics: datasets, llm, ml, scraper, transcript, youtube
Language: Python
Homepage:
Size: 19.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# YouTube Transcript Scraper

A CLI tool for downloading transcripts for all YouTube videos at a specified channel and creating training datasets for language models.

## Instructions for obtaining an API key for YouTube Data API v3

1. Go to the [Google Cloud Console](https://console.cloud.google.com/).
2. Create a new project or select an existing one.
3. In the left menu, navigate to APIs & Services → Library and find YouTube Data API v3.
4. Click Enable to activate the API for your project.
5. Navigate to APIs & Services → Credentials.
6. Click Create Credentials and select API key.
7. Copy the API key you received and use it when running the script.

## Setup

We use the `uv` package manager for dependency management. To set up the project:

1. Create a virtual environment:
```bash
uv venv --python 3.13
```

2. Install dependencies:
```bash
uv sync
```

3. Activate the virtual environment:
```bash
# On macOS/Linux
source .venv/bin/activate

# On Windows
.venv\Scripts\activate
```

## Running the script
```bash
python main.py "https://www.youtube.com/channel/UCXXXXXXX" en $YOUR_API_KEY
```

## Environment Variables
You can also provide your API key using environment variables or a .env file:

1. Create a .env file in the project directory
2. Add your YouTube API key: `YOUTUBE_API_KEY=your_key_here`

When using environment variables, you can run the script without providing the API key:
```bash
python main.py "https://www.youtube.com/channel/UCXXXXXXX" en
```

## Output Data Structure

The script generates JSON and JSONL files with the following structure:

```json
{
"created": "2023-05-15T12:34:56Z",
"name": "Video Title",
"value": "Transcript text..."
}
```

Where:
- `created`: Publication date of the video
- `name`: Title of the video
- `value`: Transcript text (or "no transcript available" if none exists)

## Dataset Files

The script creates the following files:

- `dataset/full_dataset.jsonl`: Complete dataset in JSONL format (one JSON object per line)
- `dataset/train.jsonl`: Training set (approximately 80% of data)
- `dataset/val.jsonl`: Validation set (approximately 10% of data)
- `dataset/test.jsonl`: Test set (approximately 10% of data)
- `dataset/transcripts.json`: Complete dataset as a standard JSON file

## Memory Efficiency

The script processes videos one at a time and writes each transcript directly to the JSONL file instead of keeping all transcripts in memory. This approach allows for processing large channels with many videos without memory issues.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dmmeteo/youtube-transcript-scraper

Awesome Lists containing this project

README