Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/j2kun/math-genealogy-scraper
Code for scraping (and a mirror of) the Math Genealogy Database
https://github.com/j2kun/math-genealogy-scraper
asyncio database dataset math-genealogy mathematics scraper
Last synced: about 2 months ago
JSON representation
Code for scraping (and a mirror of) the Math Genealogy Database
- Host: GitHub
- URL: https://github.com/j2kun/math-genealogy-scraper
- Owner: j2kun
- Created: 2017-06-14T05:03:16.000Z (over 7 years ago)
- Default Branch: main
- Last Pushed: 2023-07-20T15:09:13.000Z (over 1 year ago)
- Last Synced: 2024-04-14T06:08:01.620Z (9 months ago)
- Topics: asyncio, database, dataset, math-genealogy, mathematics, scraper
- Language: HTML
- Size: 7.36 MB
- Stars: 14
- Watchers: 5
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Math Genealogy
Code for scraping the math genealogy website.
A copy of the database, scraped on 2019-06-17, is contained in `data.json`.
Requirements:
- python3.5+
- aiohttp
- beautifulsoup4## Setup
Optionally create a virtualenvironment
```
virtualenv venv
source venv/bin/activate
```Install required packages
```
pip install -r requirements.txt
```Run it!
```
python fetch.py
```The output, stored in `data.json`, is a dictionary with a single key `nodes`,
mapping to a list of dictionaries (specified by `parse.py`) of the form```
{
"students": [
int, int, ... <-- refers to the id field
],
"advisors": [
int, int, ...
],
"name": str,
"school": str,
"subject": str,
"thesis": str,
"country": str,
"year": int,
"id": int,
}
```Fields that were not found are null, or the empty list, as appropriate.
Example:
```
{
"id": 186481,
"name": "John Anthony Gerard Roberts",
"thesis": "Order and Chaos in reversible Dynamical Systems",
"school": "University of Melbourne",
"country": "Australia",
"year": 1990,
"subject": null,
"advisors": [
53185,
116308
],
"students": [
186482,
186484,
186486,
186485
]
}
```## Details
To be a nice person, this program rate-limits itself to have 5 concurrent
workers hitting the math genealogy website. Downloading the entire database in
this way takes about 6 hours. This repository contains a copy of the entire
dataset in `data.json` so you don't have to hammer their servers. This dataset
was fetched on 2019-06-17.However, if you insist on being a bad person, you can increase the limit on the
worker semaphore in `fetch.py` and re-run it.```
sem = asyncio.BoundedSemaphore(5) # 5 workers
```