Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zhijing-jin/ted_talk_downloader
Easy-to-use codes for downloading transcripts for TED talks.
https://github.com/zhijing-jin/ted_talk_downloader
Last synced: 10 days ago
JSON representation
Easy-to-use codes for downloading transcripts for TED talks.
- Host: GitHub
- URL: https://github.com/zhijing-jin/ted_talk_downloader
- Owner: zhijing-jin
- License: bsd-3-clause
- Created: 2020-01-11T09:29:31.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-01-11T09:56:50.000Z (about 5 years ago)
- Last Synced: 2024-11-16T04:00:13.138Z (2 months ago)
- Language: Python
- Size: 6.84 KB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# ted_talk_downloader (Python Package)
This is an easy-to-use Python package to download transcripts of TED talks.
## Installation
**Method 1:** Clone this repo, and follow the instructions in [How To Run](#how-to-run) section.**Method 2:** Pip install this package by
```bash
pip install --upgrade git+git://github.com/zhijing-jin/ted_talk_downloader.git
```**Method 3:** For research use (strictly not commercial), download [this dataset](http://bit.ly/ted-data-zhijing) of all the transcripts of TED talks.
The dataset contains all TED talks by Jan 9, 2020. Three lanuages are included, English, German, Romanian.Here are some statistics of this dataset:
- English: 3,799 transcripts, 450,326 sentences
- German: 2,625 talks, 338,117 sentences
- Romanian: 2,856 talks, 353,103 sentences
## How to Run#### Function 1: Download the transcripts for specific talks.
```python
>>> from ted_talk_downloader import TEDTalkDownloader
>>> downloader = TEDTalkDownloader('en')
>>> links = [
"https://www.ted.com/talks/edward_tenner_the_paradox_of_efficiency?language=en",
"https://www.ted.com/talks/alex_gendler_why_doesn_t_the_leaning_tower_of_pisa_fall_over?language=en",
]
>>> downloader.get_all_transcripts(links=links)
Retrieving Webpages: 100%|███████| 2/2 [00:21<00:00, 11.06s/it]
Parsing HTML: 100%|██████████████| 2/2 [00:02<00:00, 1.97s/it]
[Info] Saved 2 links, 4 transcripts, and 190 sentences to "ted_transcripts.json"# (1) for transcripts, check out the file `ted_transcripts.json`
# (2) for raw webpages, check out the file `ted_raw.json`
```
#### Function 2: (For Research Use Only) Download transcripts for all talks.
```python
>>> from ted_talk_downloader import TEDTalkDownloader
>>> downloader = TEDTalkDownloader('en')
>>> downloader.get_all_transcripts()
# (1) for transcripts, check out the file `ted_transcripts.json`
# (2) for raw webpages, check out the file `ted_raw.json`
```## Contact
If you have any questions, feel free to check out the previous [Q&A](https://github.com/zhijing-jin/ted_talk_downloader/issues?utf8=%E2%9C%93&q=is%3Aissue), or raise a new GitHub issue.In case of really urgent needs, contact the author [Zhijing Jin (Miss)](mailto:[email protected]).