Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ejw-data/web-scraping-proteins
Webscrape of Pubmed publication data that is used in a single webpage with multiple plotly charts. The basic structure of the website is updated with an excel spreadsheet to help those who don't know how to code.
https://github.com/ejw-data/web-scraping-proteins
beautifulsoup excel html-css-javascript pandas plotly python splinter
Last synced: 10 days ago
JSON representation
Webscrape of Pubmed publication data that is used in a single webpage with multiple plotly charts. The basic structure of the website is updated with an excel spreadsheet to help those who don't know how to code.
- Host: GitHub
- URL: https://github.com/ejw-data/web-scraping-proteins
- Owner: ejw-data
- Created: 2022-07-22T23:35:08.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-07-26T06:39:24.000Z (over 2 years ago)
- Last Synced: 2024-11-21T13:54:20.305Z (2 months ago)
- Topics: beautifulsoup, excel, html-css-javascript, pandas, plotly, python, splinter
- Language: Jupyter Notebook
- Homepage:
- Size: 1.91 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# web-scraping-proteins
Author: Erin James Wills, [email protected]
![NIH web scrape banner](./static/images/protein-webscrapte.png)
Photo by National Cancer Institute on Unsplash
## Overview
Webscrape of Pubmed publication data that is used in a single webpage with multiple plotly charts. The basic structure of the website is updated with an excel spreadsheet to help those who don't know how to code.
> The content of this repo generated the following webpage: http://nrtdp.northwestern.edu/targets/ (as of July 2022)
## Technologies
* Python
* Pandas
* Splinter
* BeautifulSoup
* Plotly
* HTML/CSS/JS
## Data Source
The dataset is generated by scraping the Pubmed search results based on a protein name:
* [Pubmed Search "p21"](https://pubmed.ncbi.nlm.nih.gov/?term=p21)
## Setup and Installation
1. Environment needs the following:
* Python 3.6+
* pandas
* webdriver_manager.chrome
* splinter
* BeautifulSoup
* time
* json
1. Activate your environment
1. Clone the repo to your local machine
1. Start Jupyter Notebook within the environment from the repo
1. To run and/or troubleshoot the scraping, run `pubmed_scrape.ipynb`.
1. To view the index page, I suggest that you use a VSCode Extension called "LiveServer" to view the `index.html` file.
## Images
![](./webscrape/images/uniprotkb.jpg)
![](./webscrape/images/pubmed_protein_search.jpg)
![](./webscrape/images/js-table.jpg)
![](./webscrape/images/js-table-filtered.jpg)
![](./webscrape/images/js-plotly-target-selected.jpg)
![](./webscrape/images/js-plotly-menu-closed.jpg)