{"id":23151666,"url":"https://github.com/jshinm/web-scrapper","last_synced_at":"2025-04-04T15:24:15.932Z","repository":{"id":127464458,"uuid":"435751227","full_name":"jshinm/web-scrapper","owner":"jshinm","description":"Web Scrapper used to extract NeuroData github repo stats","archived":false,"fork":false,"pushed_at":"2021-12-22T13:15:36.000Z","size":958,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-10T01:15:48.224Z","etag":null,"topics":["data-analysis","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jshinm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-07T05:12:07.000Z","updated_at":"2022-03-18T01:29:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"fb5cad3c-bdfb-4d4a-82ef-452d78909126","html_url":"https://github.com/jshinm/web-scrapper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jshinm%2Fweb-scrapper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jshinm%2Fweb-scrapper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jshinm%2Fweb-scrapper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jshinm%2Fweb-scrapper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jshinm","download_url":"https://codeload.github.com/jshinm/web-scrapper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247199574,"owners_count":20900295,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","web-scraping"],"created_at":"2024-12-17T18:43:05.108Z","updated_at":"2025-04-04T15:24:15.907Z","avatar_url":"https://github.com/jshinm.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Scrapping Github Repositories\n\nIn order to grasp a comprehensive scope of current status of all repositories under our lab, I built this brief web scraper to organize my workflow. The motivation for search selection is based on the need of each repo. To gauge the level of needs, I chose to parse out `the number of issues`. Additionally, `last updated dates` are also important to understand whether the repo is actively managed at this time.\n\n\u003ccenter\u003e\u003cimg src='./fig/webpage.png' width=80%\u003e\u003c/center\u003e\n\nFor each page that lists repositories, individual url was used to request html output. Then there were two items that were parsed out from the html output which was converted into utf-8 format. \n\nAforementioned attributes of interest were parsed using `re` and in-built string methods.\n\n\u003ccenter\u003e\u003cimg src='./fig/output-b4-filtering.jpg' width=80%\u003e\u003c/center\u003e\n\nThe following two filters were applied to further narrow down the list.\n1. Nubmer of Issues \u003e 0\n2. Last Updated Date is no earlier than `2021-09-01`\n\n\u003ccenter\u003e\u003cimg src='./fig/output.jpg' width=80%\u003e\u003c/center\u003e\n\nRecent request from our lab was to generate a list of active PRs in NeuroData organization, thus further scrapping was conducted to extract `title of PR`, `PR's direct URL`, and `the author who made the initial PRs`. The output is exported as an excel spreadsheet, which was subsequently registered as a NeuroData github issue.\n\n## Example output of the extracted and wranggled web-scrapping result\n\nLead | Repository | PR Name | PR Url\n-- | -- | -- | --\nadam2392 | scikit-learn | [TEST PR] Adding oblique trees (i.e. Forest-RC) to cythonized tree module | https://github.com//neurodata/scikit-learn/pull/11\nadam2392 | scikit-learn | [TEST PR] Oblique forests | https://github.com//neurodata/scikit-learn/pull/10\nadam2392 | scikit-learn | Tom/grid to graph 26 | https://github.com//neurodata/scikit-learn/pull/8\nLizaNaydanova | ProgLearn | Added streaming capability for ODIN | https://github.com//neurodata/ProgLearn/pull/528\nLizaNaydanova | ProgLearn | Added neural network scene segmentation tutorial. | https://github.com//neurodata/ProgLearn/pull/527","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjshinm%2Fweb-scrapper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjshinm%2Fweb-scrapper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjshinm%2Fweb-scrapper/lists"}