Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/zaneh/ocw-crawler

Crawl MIT OpenCourseWare courses with Kimurai. Not affiliated.
https://github.com/zaneh/ocw-crawler

crawler kimurai mit ocw opencourseware spider

Last synced: 8 days ago
JSON representation

Crawl MIT OpenCourseWare courses with Kimurai. Not affiliated.

Awesome Lists containing this project

README

        

# MIT OpenCourseWare Crawler

## Crawl Output

**Last updated**: November 27, 2023

- OCW Video Lectures: [results.csv](https://github.com/ZaneH/ocw-crawler/blob/main/results.csv)

## Description

This is a simple crawler to save the available courses on [MIT OpenCourseWare](https://ocw.mit.edu/). This crawler will export the courses with video lectures as a CSV file.

You can crawl for courses other than video lectures by changing the `@start_urls` in `crawler.rb`.

## Docker Run (Recommended)

This is the simplest way to run the crawler. It will run the crawler and save the results in `results.csv` using a Docker volume.

```bash
$ docker build -t ocw-crawl:1.0 .
$ docker run --volume $(pwd)/results.csv:/app/results.csv \
--rm \
--name ocw-crawl \
ocw-crawl:1.0
```

---

## Manually Run

To run the crawler without Docker, you'll need to install an older version of Ruby that's compatible with `kimurai`. You'll also need `geckodriver` and Firefox. Read more about setting up `kimurai` [here](https://github.com/vifreefly/kimuraframework#installation) if you run into trouble.

### Setup

Install Ruby 2.5.0 and run `bundle install`.

```bash
$ asdf install ruby 2.5.0
$ asdf global ruby 2.5.0
$ gem install bundler
$ bundle install # install dependencies
```

### Run

```bash
$ ruby crawler.rb
...
```

## Possible Improvements

- Use [OCW Sitemaps](https://ocw.mit.edu/sitemap.xml) to crawl all courses
- Get more information about each course from the sitemap
- Course materials often follow these patterns:
- Syllabus: `/pages/syllabus/`
- Course download: `/download/`
- Resources: `/resources/*/`
- PDFs, slides, lectures notes, etc.
- Course pages: `/pages/*/`
- Readings: `/pages/readings/`
- Turn the data into an app or API