https://github.com/chickencoding123/source-code-collector
A python program to scrape source code from github live and archive.
https://github.com/chickencoding123/source-code-collector
Last synced: about 2 months ago
JSON representation
A python program to scrape source code from github live and archive.
- Host: GitHub
- URL: https://github.com/chickencoding123/source-code-collector
- Owner: chickencoding123
- License: mit
- Created: 2022-09-12T15:48:46.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-09-19T13:15:18.000Z (over 2 years ago)
- Last Synced: 2025-01-22T05:43:13.640Z (4 months ago)
- Language: Python
- Size: 20.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# source-code-collector
A python program to scrape source code from github live and archive.>:warning: Although this repository works, it is a work-in-progress. See [todos](#Todos) for more information about pending tasks.
## Motivation
Training a machine learning model requires a massive amount of training data. This training data is generated by a data preparation pipeline and the first step in that pipeline is to collect raw data from one or more destinations. This repository contains a scraping program that downloads source code from [Github Archive](https://www.gharchive.org/) as well as [Github](https://docs.github.com/en/graphql) live repositories. Filters can be provided to fine tune the scraping logic:
- language: only download repositories that are tagged with this language. Must be a valid language tag in github (e.g. _typescript_ instead of _ts_). E.g. ['typescript', 'javascript']
- license: only download repositories with this license. E.g. ['mit', 'apache-2.0']## Setup
You will need **python3**, **sqlite3** to run this project. You can install them using a package manager or by following various guides online. After installing them:
1. Clone this repository to your local computer
2. Open a terminal and navigate to that directory on your local computer
3. Execute `python3 project-setup.py` inside the terminal which will create an isolated env and install dependencies.
4. Add `API_KEY=your github token` inside of a new file named `.secrets`. At runtime the program will use this token to interact with github API automatically.
>This project use the github graphql which requires an API key. You can create one by following [Creating a Person Access Token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token#creating-a-token). Only read-only scopes are required for crawling purposes, `read:packages` and `read:org`. For more information see [Authenticating with GraphQL](https://docs.github.com/en/graphql/guides/forming-calls-with-graphql#authenticating-with-graphql).## Usage
Two main functions are provided in _API_:
1. `find_repos` will find and save repositories that match the criteria in a sqlite3 database.
2. `download` will download the source code of those saved repositories.### CLI
Currently there are two _CLI_ commands:
- `crawl` to collect and save repositories that match the given criteria.
- `download` to download source code of previously saved repositories.
See `python3 cli.py` for more information. You can also see command specific help by running `--help` (e.g. `python3 cli.py crawl --help`).## Todos
1. [ ] Performance enhancement. Currently very slow.
2. [ ] Integrate with a scalable crawling program.
2. [ ] Publish as a package to [pypi.org](https://pypi.org).