https://github.com/rigwild/epfl-scraper
Scrape everything from EPFL https://people.epfl.ch/
https://github.com/rigwild/epfl-scraper
epfl scraper
Last synced: about 1 year ago
JSON representation
Scrape everything from EPFL https://people.epfl.ch/
- Host: GitHub
- URL: https://github.com/rigwild/epfl-scraper
- Owner: rigwild
- License: mit
- Created: 2021-09-16T10:44:15.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-09-16T11:58:42.000Z (over 4 years ago)
- Last Synced: 2025-02-18T19:50:18.582Z (over 1 year ago)
- Topics: epfl, scraper
- Language: JavaScript
- Homepage:
- Size: 6.84 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# EPFL Scraper
Scrape everything from https://people.epfl.ch, including email addresses.
# Install dependencies
You must use Node.js v14.8+ as this project uses [Top-level await](https://v8.dev/features/top-level-await) and [ES modules](https://v8.dev/features/modules).
```sh
yarn
```
# Run search scraper
Abuse the EPFL search API from https://search.epfl.ch/ to get all the data.
Works by querying each letter in the alphabet followed by a `*` (`a*`, `b*`), then merging the results.
```sh
node searchScraper.js
# If you already have the data cached
node searchScraper.js --use-cache
```
- Individual search queries will be cached to the `search/` directory
- Output will be generated at `output.min.json` (7.0 MB, 20623 entries)
- The full usernames list is generated at `output.usernames.txt` (20623 entries)
- The full mailing list is generated at `output.emails.txt` (18760 entries)
# Run page scraper
EPFL conveniently gives us its [full website sitemap](https://people.epfl.ch/private/common/sitemap.xml).
This will download every HTML pages to the `output/` directory. No parsing is done there, just downloading.
You should probably be careful with your IP with this. It could be mistakenly seen as a DOS attack tentative.
```sh
node htmlScraper.js
```
# License
```
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
Version 2, December 2004
Copyright (C) 2021 rigwild
Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. You just DO WHAT THE FUCK YOU WANT TO.
```