https://github.com/superskyyy/yet-another-github-miner
dev
https://github.com/superskyyy/yet-another-github-miner
Last synced: 3 months ago
JSON representation
dev
- Host: GitHub
- URL: https://github.com/superskyyy/yet-another-github-miner
- Owner: Superskyyy
- License: apache-2.0
- Created: 2021-10-02T17:43:37.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2022-06-13T16:05:05.000Z (almost 3 years ago)
- Last Synced: 2024-12-29T21:26:31.091Z (5 months ago)
- Language: Python
- Size: 5.64 MB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# How to Dev and Use:
**THE PROJECT IS FOR EDUCATIONAL PURPOSES ONLY, anyone using this scraper is expected to adhere to GitHub regulations**
## Components
### Scraper
Scraper filters GitHub for matched repos containing the wanted file at any depth,~Only the first batch of 1000 results will be returned.~ Sample scraper shows a way to overcome the limitation
Example file target: `MLProject`
GitHub Query - `filename:MLProject`
## Install
Install dependencies - `pip install -r requirements.txt`
### Run
Change the `sample_credentials.py` to `crendentials.py` upon cloning, then fill in your GitHub personal access token.
Run `main.py` to filter through repos and paths.
Run `miner_selenium.py` to run the chrome-based selenium scraper.
Run `miner_requests.py` to run the GitHub v3 API scraper.
Run `pickle_loader.py` to see the first 1000 results collected.