{"id":21289282,"url":"https://github.com/chickencoding123/source-code-collector","last_synced_at":"2025-03-15T15:44:59.822Z","repository":{"id":177504989,"uuid":"535739963","full_name":"chickencoding123/source-code-collector","owner":"chickencoding123","description":"A python program to scrape source code from github live and archive.","archived":false,"fork":false,"pushed_at":"2022-09-19T13:15:18.000Z","size":21,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-22T05:43:13.640Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chickencoding123.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-12T15:48:46.000Z","updated_at":"2022-09-12T15:52:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"bcd0498a-dc81-4960-bbfa-16fb8ec39cae","html_url":"https://github.com/chickencoding123/source-code-collector","commit_stats":null,"previous_names":["chickencoding123/source-code-collector"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chickencoding123%2Fsource-code-collector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chickencoding123%2Fsource-code-collector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chickencoding123%2Fsource-code-collector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chickencoding123%2Fsource-code-collector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chickencoding123","download_url":"https://codeload.github.com/chickencoding123/source-code-collector/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243754009,"owners_count":20342537,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-21T12:38:15.755Z","updated_at":"2025-03-15T15:44:59.800Z","avatar_url":"https://github.com/chickencoding123.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# source-code-collector\nA python program to scrape source code from github live and archive.\n\n\u003e:warning: Although this repository works, it is a work-in-progress. See [todos](#Todos) for more information about pending tasks.\n\n## Motivation\nTraining a machine learning model requires a massive amount of training data. This training data is generated by a data preparation pipeline and the first step in that pipeline is to collect raw data from one or more destinations. This repository contains a scraping program that downloads source code from [Github Archive](https://www.gharchive.org/) as well as [Github](https://docs.github.com/en/graphql) live repositories. Filters can be provided to fine tune the scraping logic:\n- language: only download repositories that are tagged with this language. Must be a valid language tag in github (e.g. _typescript_ instead of _ts_). E.g. ['typescript', 'javascript']\n- license: only download repositories with this license. E.g. ['mit', 'apache-2.0']\n\n## Setup\nYou will need **python3**, **sqlite3** to run this project. You can install them using a package manager or by following various guides online. After installing them:\n1. Clone this repository to your local computer\n2. Open a terminal and navigate to that directory on your local computer\n3. Execute `python3 project-setup.py` inside the terminal which will create an isolated env and install dependencies. \n4. Add `API_KEY=your github token` inside of a new file named `.secrets`. At runtime the program will use this token to interact with github API automatically.\n   \u003eThis project use the github graphql which requires an API key. You can create one by following [Creating a Person Access Token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token#creating-a-token). Only read-only scopes are required for crawling purposes, `read:packages` and `read:org`. For more information see [Authenticating with GraphQL](https://docs.github.com/en/graphql/guides/forming-calls-with-graphql#authenticating-with-graphql).\n\n## Usage\nTwo main functions are provided in _API_:\n1. `find_repos` will find and save repositories that match the criteria in a sqlite3 database.\n2. `download` will download the source code of those saved repositories.\n\n\n### CLI\nCurrently there are two _CLI_ commands:\n- `crawl` to collect and save repositories that match the given criteria.\n- `download` to download source code of previously saved repositories.\nSee `python3 cli.py` for more information. You can also see command specific help by running `--help` (e.g. `python3 cli.py crawl --help`).\n\n## Todos\n1. [ ] Performance enhancement. Currently very slow.\n2. [ ] Integrate with a scalable crawling program.\n2. [ ] Publish as a package to [pypi.org](https://pypi.org).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchickencoding123%2Fsource-code-collector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchickencoding123%2Fsource-code-collector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchickencoding123%2Fsource-code-collector/lists"}