An open API service indexing awesome lists of open source software.

https://github.com/chickencoding123/source-code-collector

A python program to scrape source code from github live and archive.
https://github.com/chickencoding123/source-code-collector

Last synced: about 2 months ago
JSON representation

A python program to scrape source code from github live and archive.

Awesome Lists containing this project

README

        

# source-code-collector
A python program to scrape source code from github live and archive.

>:warning: Although this repository works, it is a work-in-progress. See [todos](#Todos) for more information about pending tasks.

## Motivation
Training a machine learning model requires a massive amount of training data. This training data is generated by a data preparation pipeline and the first step in that pipeline is to collect raw data from one or more destinations. This repository contains a scraping program that downloads source code from [Github Archive](https://www.gharchive.org/) as well as [Github](https://docs.github.com/en/graphql) live repositories. Filters can be provided to fine tune the scraping logic:
- language: only download repositories that are tagged with this language. Must be a valid language tag in github (e.g. _typescript_ instead of _ts_). E.g. ['typescript', 'javascript']
- license: only download repositories with this license. E.g. ['mit', 'apache-2.0']

## Setup
You will need **python3**, **sqlite3** to run this project. You can install them using a package manager or by following various guides online. After installing them:
1. Clone this repository to your local computer
2. Open a terminal and navigate to that directory on your local computer
3. Execute `python3 project-setup.py` inside the terminal which will create an isolated env and install dependencies.
4. Add `API_KEY=your github token` inside of a new file named `.secrets`. At runtime the program will use this token to interact with github API automatically.
>This project use the github graphql which requires an API key. You can create one by following [Creating a Person Access Token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token#creating-a-token). Only read-only scopes are required for crawling purposes, `read:packages` and `read:org`. For more information see [Authenticating with GraphQL](https://docs.github.com/en/graphql/guides/forming-calls-with-graphql#authenticating-with-graphql).

## Usage
Two main functions are provided in _API_:
1. `find_repos` will find and save repositories that match the criteria in a sqlite3 database.
2. `download` will download the source code of those saved repositories.

### CLI
Currently there are two _CLI_ commands:
- `crawl` to collect and save repositories that match the given criteria.
- `download` to download source code of previously saved repositories.
See `python3 cli.py` for more information. You can also see command specific help by running `--help` (e.g. `python3 cli.py crawl --help`).

## Todos
1. [ ] Performance enhancement. Currently very slow.
2. [ ] Integrate with a scalable crawling program.
2. [ ] Publish as a package to [pypi.org](https://pypi.org).