Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ejw-data/web-scraping-proteins

Webscrape of Pubmed publication data that is used in a single webpage with multiple plotly charts. The basic structure of the website is updated with an excel spreadsheet to help those who don't know how to code.
https://github.com/ejw-data/web-scraping-proteins

beautifulsoup excel html-css-javascript pandas plotly python splinter

Last synced: 10 days ago
JSON representation

Webscrape of Pubmed publication data that is used in a single webpage with multiple plotly charts. The basic structure of the website is updated with an excel spreadsheet to help those who don't know how to code.

Awesome Lists containing this project

README

        

# web-scraping-proteins

Author: Erin James Wills, [email protected]

![NIH web scrape banner](./static/images/protein-webscrapte.png)
Photo by National Cancer Institute on Unsplash


## Overview


Webscrape of Pubmed publication data that is used in a single webpage with multiple plotly charts. The basic structure of the website is updated with an excel spreadsheet to help those who don't know how to code.

> The content of this repo generated the following webpage: http://nrtdp.northwestern.edu/targets/ (as of July 2022)


## Technologies
* Python
* Pandas
* Splinter
* BeautifulSoup
* Plotly
* HTML/CSS/JS


## Data Source

The dataset is generated by scraping the Pubmed search results based on a protein name:
* [Pubmed Search "p21"](https://pubmed.ncbi.nlm.nih.gov/?term=p21)


## Setup and Installation
1. Environment needs the following:
* Python 3.6+
* pandas
* webdriver_manager.chrome
* splinter
* BeautifulSoup
* time
* json
1. Activate your environment
1. Clone the repo to your local machine
1. Start Jupyter Notebook within the environment from the repo
1. To run and/or troubleshoot the scraping, run `pubmed_scrape.ipynb`.
1. To view the index page, I suggest that you use a VSCode Extension called "LiveServer" to view the `index.html` file.


## Images

![](./webscrape/images/uniprotkb.jpg)


![](./webscrape/images/pubmed_protein_search.jpg)


![](./webscrape/images/js-table.jpg)


![](./webscrape/images/js-table-filtered.jpg)


![](./webscrape/images/js-plotly-target-selected.jpg)


![](./webscrape/images/js-plotly-menu-closed.jpg)