Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zaneh/ocw-crawler
Crawl MIT OpenCourseWare courses with Kimurai. Not affiliated.
https://github.com/zaneh/ocw-crawler
crawler kimurai mit ocw opencourseware spider
Last synced: 18 days ago
JSON representation
Crawl MIT OpenCourseWare courses with Kimurai. Not affiliated.
- Host: GitHub
- URL: https://github.com/zaneh/ocw-crawler
- Owner: ZaneH
- License: mit
- Created: 2023-07-24T04:48:37.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-28T04:54:17.000Z (11 months ago)
- Last Synced: 2024-10-03T08:30:08.158Z (about 1 month ago)
- Topics: crawler, kimurai, mit, ocw, opencourseware, spider
- Language: Ruby
- Homepage:
- Size: 65.4 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MIT OpenCourseWare Crawler
## Crawl Output
**Last updated**: November 27, 2023
- OCW Video Lectures: [results.csv](https://github.com/ZaneH/ocw-crawler/blob/main/results.csv)
## Description
This is a simple crawler to save the available courses on [MIT OpenCourseWare](https://ocw.mit.edu/). This crawler will export the courses with video lectures as a CSV file.
You can crawl for courses other than video lectures by changing the `@start_urls` in `crawler.rb`.
## Docker Run (Recommended)
This is the simplest way to run the crawler. It will run the crawler and save the results in `results.csv` using a Docker volume.
```bash
$ docker build -t ocw-crawl:1.0 .
$ docker run --volume $(pwd)/results.csv:/app/results.csv \
--rm \
--name ocw-crawl \
ocw-crawl:1.0
```---
## Manually Run
To run the crawler without Docker, you'll need to install an older version of Ruby that's compatible with `kimurai`. You'll also need `geckodriver` and Firefox. Read more about setting up `kimurai` [here](https://github.com/vifreefly/kimuraframework#installation) if you run into trouble.
### Setup
Install Ruby 2.5.0 and run `bundle install`.
```bash
$ asdf install ruby 2.5.0
$ asdf global ruby 2.5.0
$ gem install bundler
$ bundle install # install dependencies
```### Run
```bash
$ ruby crawler.rb
...
```## Possible Improvements
- Use [OCW Sitemaps](https://ocw.mit.edu/sitemap.xml) to crawl all courses
- Get more information about each course from the sitemap
- Course materials often follow these patterns:
- Syllabus: `/pages/syllabus/`
- Course download: `/download/`
- Resources: `/resources/*/`
- PDFs, slides, lectures notes, etc.
- Course pages: `/pages/*/`
- Readings: `/pages/readings/`
- Turn the data into an app or API