Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lkhphuc/ted-transcript-crawler
A crawler to automatically download all the transcript of TED talks.
https://github.com/lkhphuc/ted-transcript-crawler
Last synced: 5 days ago
JSON representation
A crawler to automatically download all the transcript of TED talks.
- Host: GitHub
- URL: https://github.com/lkhphuc/ted-transcript-crawler
- Owner: lkhphuc
- License: mit
- Created: 2017-08-23T09:42:22.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-08-23T09:56:14.000Z (over 7 years ago)
- Last Synced: 2024-11-09T16:39:10.270Z (2 months ago)
- Language: Julia
- Size: 58.6 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ted-transcript-crawler
A crawler to automatically download all the transcript of TED talks.
This crawler was built using Scrapy based on this tutorial https://blakeboswell.github.io/2016/scrapy-tedtalk/ but have modified it to be usable with the latest version of TED Website.### To run:
1. Install Scrapy
2. Download or clone the repo
3. run `cd ted-transcript-crawler/ted`
4. run `scrapy crawl ted_crawl`### Output:
Outputs are stripped off all the html elements and contains only plaintext and whitespace.
The outputs are saved in Json-line format.