{"id":21293343,"url":"https://github.com/ejw-data/web-scraping-proteins","last_synced_at":"2026-05-11T09:22:23.927Z","repository":{"id":55099172,"uuid":"516912230","full_name":"ejw-data/web-scraping-proteins","owner":"ejw-data","description":"Webscrape of Pubmed publication data that is used in a single webpage with multiple plotly charts.  The basic structure of the website is updated with an excel spreadsheet to help those who don't know how to code.  ","archived":false,"fork":false,"pushed_at":"2022-07-26T06:39:24.000Z","size":2007,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-22T06:47:17.406Z","etag":null,"topics":["beautifulsoup","excel","html-css-javascript","pandas","plotly","python","splinter"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ejw-data.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-22T23:35:08.000Z","updated_at":"2022-08-09T09:55:19.000Z","dependencies_parsed_at":"2022-08-14T11:50:59.461Z","dependency_job_id":null,"html_url":"https://github.com/ejw-data/web-scraping-proteins","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejw-data%2Fweb-scraping-proteins","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejw-data%2Fweb-scraping-proteins/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejw-data%2Fweb-scraping-proteins/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejw-data%2Fweb-scraping-proteins/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ejw-data","download_url":"https://codeload.github.com/ejw-data/web-scraping-proteins/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243762236,"owners_count":20343976,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","excel","html-css-javascript","pandas","plotly","python","splinter"],"created_at":"2024-11-21T13:54:23.534Z","updated_at":"2026-05-11T09:22:23.895Z","avatar_url":"https://github.com/ejw-data.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# web-scraping-proteins  \n\nAuthor: Erin James Wills, ejw-data@gmail.com  \n\n![NIH web scrape banner](./static/images/protein-webscrapte.png)  \n\u003ccite\u003ePhoto by \u003ca href=\"https://unsplash.com/@nci?utm_source=unsplash\u0026utm_medium=referral\u0026utm_content=creditCopyText\"\u003eNational Cancer Institute\u003c/a\u003e on \u003ca href=\"https://unsplash.com/s/photos/proteins?utm_source=unsplash\u0026utm_medium=referral\u0026utm_content=creditCopyText\"\u003eUnsplash\u003c/a\u003e\u003c/cite\u003e\n\n\u003cbr\u003e\n\n## Overview  \n\u003chr\u003e\n\nWebscrape of Pubmed publication data that is used in a single webpage with multiple plotly charts.  The basic structure of the website is updated with an excel spreadsheet to help those who don't know how to code.  \n\n\n\u003e The content of this repo generated the following webpage:  http://nrtdp.northwestern.edu/targets/  (as of July 2022)\n\n\u003cbr\u003e\n\n## Technologies  \n*  Python\n*  Pandas\n*  Splinter\n*  BeautifulSoup\n*  Plotly\n*  HTML/CSS/JS\n\n\u003cbr\u003e  \n\n## Data Source  \n\nThe dataset is generated by scraping the Pubmed search results based on a protein name:  \n*  [Pubmed Search \"p21\"](https://pubmed.ncbi.nlm.nih.gov/?term=p21) \n\n\u003cbr\u003e\n\n## Setup and Installation  \n1. Environment needs the following:  \n    *  Python 3.6+  \n    *  pandas\n    *  webdriver_manager.chrome\n    *  splinter\n    *  BeautifulSoup\n    *  time\n    *  json\n1. Activate your environment\n1. Clone the repo to your local machine\n1. Start Jupyter Notebook within the environment from the repo\n1. To run and/or troubleshoot the scraping, run `pubmed_scrape.ipynb`.  \n1.   To view the index page, I suggest that you use a VSCode Extension called \"LiveServer\" to view the `index.html` file.  \n\n\u003cbr\u003e\n\n## Images  \n\u003cbr\u003e\n\n![](./webscrape/images/uniprotkb.jpg)\n\n\u003cbr\u003e\n\n![](./webscrape/images/pubmed_protein_search.jpg)\n\n\u003cbr\u003e\n\n![](./webscrape/images/js-table.jpg)\n\n\u003cbr\u003e\n\n![](./webscrape/images/js-table-filtered.jpg)\n\n\u003cbr\u003e\n\n![](./webscrape/images/js-plotly-target-selected.jpg)\n\n\u003cbr\u003e\n\n![](./webscrape/images/js-plotly-menu-closed.jpg)\n\n\u003cbr\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fejw-data%2Fweb-scraping-proteins","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fejw-data%2Fweb-scraping-proteins","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fejw-data%2Fweb-scraping-proteins/lists"}