https://github.com/chickencoding123/source-code-collector

A python program to scrape source code from github live and archive.
https://github.com/chickencoding123/source-code-collector

Last synced: 4 months ago
JSON representation

A python program to scrape source code from github live and archive.

Host: GitHub
URL: https://github.com/chickencoding123/source-code-collector
Owner: chickencoding123
License: mit
Created: 2022-09-12T15:48:46.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2022-09-19T13:15:18.000Z (almost 3 years ago)
Last Synced: 2025-01-22T05:43:13.640Z (6 months ago)
Language: Python
Size: 20.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# source-code-collector
A python program to scrape source code from github live and archive.

>:warning: Although this repository works, it is a work-in-progress. See [todos](#Todos) for more information about pending tasks.

## Motivation
Training a machine learning model requires a massive amount of training data. This training data is generated by a data preparation pipeline and the first step in that pipeline is to collect raw data from one or more destinations. This repository contains a scraping program that downloads source code from [Github Archive](https://www.gharchive.org/) as well as [Github](https://docs.github.com/en/graphql) live repositories. Filters can be provided to fine tune the scraping logic:
- language: only download repositories that are tagged with this language. Must be a valid language tag in github (e.g. _typescript_ instead of _ts_). E.g. ['typescript', 'javascript']
- license: only download repositories with this license. E.g. ['mit', 'apache-2.0']

## Setup
You will need **python3**, **sqlite3** to run this project. You can install them using a package manager or by following various guides online. After installing them:
1. Clone this repository to your local computer
2. Open a terminal and navigate to that directory on your local computer
3. Execute `python3 project-setup.py` inside the terminal which will create an isolated env and install dependencies.
4. Add `API_KEY=your github token` inside of a new file named `.secrets`. At runtime the program will use this token to interact with github API automatically.
>This project use the github graphql which requires an API key. You can create one by following [Creating a Person Access Token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token#creating-a-token). Only read-only scopes are required for crawling purposes, `read:packages` and `read:org`. For more information see [Authenticating with GraphQL](https://docs.github.com/en/graphql/guides/forming-calls-with-graphql#authenticating-with-graphql).

## Usage
Two main functions are provided in _API_:
1. `find_repos` will find and save repositories that match the criteria in a sqlite3 database.
2. `download` will download the source code of those saved repositories.

### CLI
Currently there are two _CLI_ commands:
- `crawl` to collect and save repositories that match the given criteria.
- `download` to download source code of previously saved repositories.
See `python3 cli.py` for more information. You can also see command specific help by running `--help` (e.g. `python3 cli.py crawl --help`).

## Todos
1. [ ] Performance enhancement. Currently very slow.
2. [ ] Integrate with a scalable crawling program.
2. [ ] Publish as a package to [pypi.org](https://pypi.org).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chickencoding123/source-code-collector

Awesome Lists containing this project

README