{"id":19881017,"url":"https://github.com/gridaco/github-archives","last_synced_at":"2026-04-18T19:31:18.057Z","repository":{"id":108796117,"uuid":"580787479","full_name":"gridaco/github-archives","owner":"gridaco","description":"PL Datasource from public github repositories","archived":false,"fork":false,"pushed_at":"2022-12-22T18:06:21.000Z","size":37,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-30T08:51:37.894Z","etag":null,"topics":["archives","dataset","github","source-code"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gridaco.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-12-21T12:58:25.000Z","updated_at":"2023-04-06T23:59:59.000Z","dependencies_parsed_at":"2023-03-21T13:36:37.546Z","dependency_job_id":null,"html_url":"https://github.com/gridaco/github-archives","commit_stats":{"total_commits":15,"total_committers":1,"mean_commits":15.0,"dds":0.0,"last_synced_commit":"67e42117fdd994c9a23620227d1c49dbdbba5826"},"previous_names":["gridaco/github-archives"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gridaco/github-archives","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gridaco%2Fgithub-archives","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gridaco%2Fgithub-archives/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gridaco%2Fgithub-archives/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gridaco%2Fgithub-archives/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gridaco","download_url":"https://codeload.github.com/gridaco/github-archives/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gridaco%2Fgithub-archives/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31982450,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T17:30:12.329Z","status":"ssl_error","status_checked_at":"2026-04-18T17:29:59.069Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archives","dataset","github","source-code"],"created_at":"2024-11-12T17:13:01.786Z","updated_at":"2026-04-18T19:31:18.037Z","avatar_url":"https://github.com/gridaco.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Github public repositories archiver\n\nThis is a python project for archiving certain interested public repositories from Github, for mostly M/L dataset usage.\n\n## pre-requirements\n\n### Install dependencies\n\n```sh\n# deps\nbrew install libmagic\n# venv\npip3 install virtualenv\nvirtualenv -p python3 venv\nsource venv/bin/activate\npip3 install -r requirements.txt\n```\n\n### Setup : `.env`\n\n```.env\n# you have to set your own github personal access token. read below for more info.\nGITHUB_ACCESS_TOKEN=\u003cpersonal-github-access-token\u003e\n# you can configure external storage for the archives (Make sure this is a empty directory and a valid, existing directory.)\nPUBLIC_GITHUB_ARCHIVES_DIR=\u003croot-directory-to-save-archives\u003e\n# if non set, it will use the same directory as archives dir.\nPUBLIC_GITHUB_UNARCHIVES_DIR=\u003croot-directory-to-extract-archives\u003e\n```\n\n👉 [How to get Github personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token)\n\n## How to use\n\n```sh\n# The archiver\n# The unarchiver\n```\n\n## Hardware setups\n\nFull archive of all the public repositories will cost tons of storage and cost.\n\nFor this reason, we also support extracting only specific files from the repository, and removing the archive file (.zip / .tar.gz) afterwards. (You might have to customize the code for the best fit your pipeline)\n\n## Disclaimer\n\nUse it at your own risk.\n\n### About Licenses of the archives\n\nFor faster archiving, this project will validate the license of the repositories after archiving. (without using any github api, it will lookup for the LICENSE files in the repository)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgridaco%2Fgithub-archives","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgridaco%2Fgithub-archives","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgridaco%2Fgithub-archives/lists"}