https://github.com/rmax/mit-ocw-crawler
MIT's OCW Crawler
https://github.com/rmax/mit-ocw-crawler
Last synced: about 1 year ago
JSON representation
MIT's OCW Crawler
- Host: GitHub
- URL: https://github.com/rmax/mit-ocw-crawler
- Owner: rmax
- License: bsd-3-clause
- Created: 2010-07-19T01:35:53.000Z (almost 16 years ago)
- Default Branch: master
- Last Pushed: 2010-07-19T02:49:06.000Z (almost 16 years ago)
- Last Synced: 2025-03-24T12:11:28.777Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 89.8 KB
- Stars: 4
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
============================
MIT's OpenCourseWare Crawler
============================
:Author: Rolando Espinoza La fuente
About
=====
MIT's `OpenCourseWare`_ is an excellent resource of knowledge.
This crawler helps to fetch all courses information, like
materials' download links.
Requirements
============
- `Scrapy`_
Usage Example
=============
First, choose a department at MIT's `OpenCourseWare`_. Then figure out the
``DEPARTMENT_ID`` which is part of the department's url. In this case
we will choose the `Nuclear Science and Engineering`_ department using
``nuclear-engineering`` as DEPARTMENT_ID.
Finally run ``scrapy-ctl.py`` to crawl and fetch all courses information.
* To only crawl all courses::
$ ./scrapy-ctl.py crawl materials --set DEPARTMENT_ID=nuclear-engineering
* To store results in a CSV file::
$ ./scrapy-ctl.py crawl materials --set DEPARTMENT_ID=nuclear-engineering --set EXPORT_FORMAT=csv --set EXPORT_FILE=materials.csv
* To store urls for later usage in a download manager::
$ ./scrapy-ctl.py crawl materials --set DEPARTMENT_ID=nuclear-engineering --set EXPORT_FORMAT=csv --set EXPORT_FILE=materials.csv --set EXPORT_FIELDS=download_url
$ wget -i materials.csv
.. _Scrapy: http://www.scrapy.org/
.. _OpenCourseWare: http://ocw.mit.edu/
.. _Nuclear Science and Engineering: http://ocw.mit.edu/courses/nuclear-engineering/