https://github.com/samzhang02/mcgill-course-scraper
Scraper written in Python that scrapes all courses from McGill University and their relevant information
https://github.com/samzhang02/mcgill-course-scraper
Last synced: about 1 year ago
JSON representation
Scraper written in Python that scrapes all courses from McGill University and their relevant information
- Host: GitHub
- URL: https://github.com/samzhang02/mcgill-course-scraper
- Owner: SamZhang02
- License: mit
- Created: 2023-01-25T21:40:23.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-02-08T11:39:37.000Z (over 3 years ago)
- Last Synced: 2025-03-26T23:37:08.952Z (about 1 year ago)
- Language: Python
- Size: 27.3 KB
- Stars: 1
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# McGill-Course-Scraper
Scraper written in Python that scrapes all courses from McGill University and their relevant information.
Only valid for the 2022-2023 school year for now.
---
"This project is **not** affiliated, endorsed, or vetted by McGill University. It is an open-source tool that uses publicly available information from the university and is intended for research and educational purposes only. Please refer to McGill University's terms of use for details on your rights to use the information downloaded. Remember - the information provided is intended for personal use only."
---
## News
Version 0.2:
- Added multithreading to speed-up individual page scrapings.
## Requirements
```
pip install -r requirements.txt
```
## Usage
MacOS
```
python3 src/main.py --num-threads= [default: 10]
```
Windows
```
py src/main.py --num-threads= [default: 10]
```
The program starts by scraping the URL of all courses on McGill University's official website and storing them in a `.txt` in `/output`. This should take a few minutes.
The program then requests each URL in the file and parses the individual pages one by one, with 10 threads by default. This should take under 5 min, but feel free to change the number of threads in `main.py` to slow the requests down out of politeness. The process status will be printed out in the terminal as the program executes.
McGill's website does appear to rate limit, so don't set the number of threads too high.
The output will be stored in `/output/courses.json`. See `/docs/structure.json` for a miniature example of what the file will look like.
## Contributing
Fork the repo and open a PR to `/main` with the appropriate title and descriptions.