Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/prirai/rare-diseases-data-scraping
This repository aims to be a central place for all data scraping and analysis related to rare diseases.
https://github.com/prirai/rare-diseases-data-scraping
csv ontology python rare-disease
Last synced: about 2 months ago
JSON representation
This repository aims to be a central place for all data scraping and analysis related to rare diseases.
- Host: GitHub
- URL: https://github.com/prirai/rare-diseases-data-scraping
- Owner: prirai
- Created: 2022-12-16T10:10:05.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2022-12-21T06:26:27.000Z (about 2 years ago)
- Last Synced: 2023-03-06T19:34:01.186Z (almost 2 years ago)
- Topics: csv, ontology, python, rare-disease
- Language: Jupyter Notebook
- Homepage:
- Size: 26 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# rare-diseases-data-scraping
Data scraped from - https://rarediseases.info.nih.gov
`data1.csv` is the currently incomplete detailed description about the diseases.
## File and their purpose
----
|File|Purpose|
|--|--|
`links.md` | links of interest
`disease_links_scraper.py` | extracts the big list from https://rarediseases.info.nih.gov/diseases.
`disease_links.csv` | data extracted using the above script
`scrape_specific_page.ipynb` | Scrapes a given page for disease details (names, symptoms and causes).
`reqmul.py` | Requests with multithreading - made to save all the 5910 pages offline as HTML for easy scraping later.## Contents of `pages` folder
This directory contains all the files, the links for which were already scraped and are included in `disease_links.csv`. The naming convention is just the part after the last `/` in the corresponding url.
`reqmul.py` checks from previously downloaded files and doesn't overwrite. Current script allows somewhere around 700 pages after which it gets a server restriction. The previous versions were even worse and could only get 100 to max 200 at a time and that too in a long time. wget, axel, aria2, selenium have been already tried.Scraped Disease at a glance section, people affected, symptoms, categories, ages and causes for all diseases using the offline HTML pages (code in `page_details_extractor-offline.ipynb`). They are saved in the `disease_details.csv` file.