https://github.com/gridaco/github-archives
PL Datasource from public github repositories
https://github.com/gridaco/github-archives
archives dataset github source-code
Last synced: 8 months ago
JSON representation
PL Datasource from public github repositories
- Host: GitHub
- URL: https://github.com/gridaco/github-archives
- Owner: gridaco
- License: mit
- Created: 2022-12-21T12:58:25.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2022-12-22T18:06:21.000Z (almost 3 years ago)
- Last Synced: 2024-04-14T05:14:35.507Z (over 1 year ago)
- Topics: archives, dataset, github, source-code
- Language: Python
- Homepage:
- Size: 36.1 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# Github public repositories archiver
This is a python project for archiving certain interested public repositories from Github, for mostly M/L dataset usage.
## pre-requirements
### Install dependencies
```sh
# deps
brew install libmagic
# venv
pip3 install virtualenv
virtualenv -p python3 venv
source venv/bin/activate
pip3 install -r requirements.txt
```### Setup : `.env`
```.env
# you have to set your own github personal access token. read below for more info.
GITHUB_ACCESS_TOKEN=
# you can configure external storage for the archives (Make sure this is a empty directory and a valid, existing directory.)
PUBLIC_GITHUB_ARCHIVES_DIR=
# if non set, it will use the same directory as archives dir.
PUBLIC_GITHUB_UNARCHIVES_DIR=
```👉 [How to get Github personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token)
## How to use
```sh
# The archiver
# The unarchiver
```## Hardware setups
Full archive of all the public repositories will cost tons of storage and cost.
For this reason, we also support extracting only specific files from the repository, and removing the archive file (.zip / .tar.gz) afterwards. (You might have to customize the code for the best fit your pipeline)
## Disclaimer
Use it at your own risk.
### About Licenses of the archives
For faster archiving, this project will validate the license of the repositories after archiving. (without using any github api, it will lookup for the LICENSE files in the repository)