https://github.com/naetherm/tedextract

Small script for the extraction of video subtitles of TED videos.
https://github.com/naetherm/tedextract

dataset language-identification ted ted-talks text-extraction

Last synced: 11 months ago
JSON representation

Small script for the extraction of video subtitles of TED videos.

Host: GitHub
URL: https://github.com/naetherm/tedextract
Owner: naetherm
License: apache-2.0
Created: 2020-07-28T09:39:04.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2020-07-28T09:39:19.000Z (almost 6 years ago)
Last Synced: 2025-07-19T15:39:11.281Z (about 1 year ago)
Topics: dataset, language-identification, ted, ted-talks, text-extraction
Language: Python
Homepage:
Size: 16.6 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# TEDExtract

A small test crawler for TED talks. For each talk on ted.com all transcripts of all talks will be downloaded and saved within a separate file.

## Requirements

As stated within the requirements.txt:

- pandas
- bs4

## Usage

```
python3 ./main.py --output=/output/dir --max_pages=76 --delay=5
```

Through this all transcripts of all talks will be saved within the `output` directory with the format:
```$output/.csv```

The parameter `delay` will introduce a delay between each crawling attempt so we don't receive a 429 error. The default value is 10. If such an error (like 429) occurs you don't have to start from the beginning. We'll save a backup pickle file and check if a talk transcript was downloaded
already, so just restart the script.

There is an additional script `combine_csvs.py` which is responsible for creating a single csv file and csv files for each language.

```
python3 ./combine_csvs.py --input_dir=/output/dir --outname=/final/output/dir/name
```

All files will then be saved to the directory `/final/output/dir` with the name `final*`. There will be one file containing all
talks `final.csv` and one file for each language `final.en.csv`, `final.de.vsc`, etc.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/naetherm/tedextract

Awesome Lists containing this project

README