https://github.com/naetherm/tedextract
Small script for the extraction of video subtitles of TED videos.
https://github.com/naetherm/tedextract
dataset language-identification ted ted-talks text-extraction
Last synced: 10 months ago
JSON representation
Small script for the extraction of video subtitles of TED videos.
- Host: GitHub
- URL: https://github.com/naetherm/tedextract
- Owner: naetherm
- License: apache-2.0
- Created: 2020-07-28T09:39:04.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2020-07-28T09:39:19.000Z (almost 6 years ago)
- Last Synced: 2025-07-19T15:39:11.281Z (11 months ago)
- Topics: dataset, language-identification, ted, ted-talks, text-extraction
- Language: Python
- Homepage:
- Size: 16.6 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TEDExtract
A small test crawler for TED talks. For each talk on ted.com all transcripts of all talks will be downloaded and saved within a separate file.
## Requirements
As stated within the requirements.txt:
- pandas
- bs4
## Usage
```
python3 ./main.py --output=/output/dir --max_pages=76 --delay=5
```
Through this all transcripts of all talks will be saved within the `output` directory with the format:
```$output/.csv```
The parameter `delay` will introduce a delay between each crawling attempt so we don't receive a 429 error. The default value is 10. If such an error (like 429) occurs you don't have to start from the beginning. We'll save a backup pickle file and check if a talk transcript was downloaded
already, so just restart the script.
There is an additional script `combine_csvs.py` which is responsible for creating a single csv file and csv files for each language.
```
python3 ./combine_csvs.py --input_dir=/output/dir --outname=/final/output/dir/name
```
All files will then be saved to the directory `/final/output/dir` with the name `final*`. There will be one file containing all
talks `final.csv` and one file for each language `final.en.csv`, `final.de.vsc`, etc.