https://github.com/jshinm/web-scrapper
Web Scrapper used to extract NeuroData github repo stats
https://github.com/jshinm/web-scrapper
data-analysis web-scraping
Last synced: about 1 year ago
JSON representation
Web Scrapper used to extract NeuroData github repo stats
- Host: GitHub
- URL: https://github.com/jshinm/web-scrapper
- Owner: jshinm
- License: mit
- Created: 2021-12-07T05:12:07.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-12-22T13:15:36.000Z (over 4 years ago)
- Last Synced: 2025-02-10T01:15:48.224Z (over 1 year ago)
- Topics: data-analysis, web-scraping
- Language: Jupyter Notebook
- Homepage:
- Size: 936 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Web Scrapping Github Repositories
In order to grasp a comprehensive scope of current status of all repositories under our lab, I built this brief web scraper to organize my workflow. The motivation for search selection is based on the need of each repo. To gauge the level of needs, I chose to parse out `the number of issues`. Additionally, `last updated dates` are also important to understand whether the repo is actively managed at this time.

For each page that lists repositories, individual url was used to request html output. Then there were two items that were parsed out from the html output which was converted into utf-8 format.
Aforementioned attributes of interest were parsed using `re` and in-built string methods.

The following two filters were applied to further narrow down the list.
1. Nubmer of Issues > 0
2. Last Updated Date is no earlier than `2021-09-01`

Recent request from our lab was to generate a list of active PRs in NeuroData organization, thus further scrapping was conducted to extract `title of PR`, `PR's direct URL`, and `the author who made the initial PRs`. The output is exported as an excel spreadsheet, which was subsequently registered as a NeuroData github issue.
## Example output of the extracted and wranggled web-scrapping result
Lead | Repository | PR Name | PR Url
-- | -- | -- | --
adam2392 | scikit-learn | [TEST PR] Adding oblique trees (i.e. Forest-RC) to cythonized tree module | https://github.com//neurodata/scikit-learn/pull/11
adam2392 | scikit-learn | [TEST PR] Oblique forests | https://github.com//neurodata/scikit-learn/pull/10
adam2392 | scikit-learn | Tom/grid to graph 26 | https://github.com//neurodata/scikit-learn/pull/8
LizaNaydanova | ProgLearn | Added streaming capability for ODIN | https://github.com//neurodata/ProgLearn/pull/528
LizaNaydanova | ProgLearn | Added neural network scene segmentation tutorial. | https://github.com//neurodata/ProgLearn/pull/527