{"id":13579622,"url":"https://github.com/bugout-dev/mirror","last_synced_at":"2025-12-27T01:36:32.266Z","repository":{"id":51183494,"uuid":"237697830","full_name":"bugout-dev/mirror","owner":"bugout-dev","description":"Software project analysis","archived":false,"fork":false,"pushed_at":"2021-05-20T16:10:47.000Z","size":343,"stargazers_count":21,"open_issues_count":4,"forks_count":8,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-11-05T18:46:58.957Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bugout-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-02T00:30:18.000Z","updated_at":"2024-08-26T07:03:26.000Z","dependencies_parsed_at":"2022-09-19T10:21:35.117Z","dependency_job_id":null,"html_url":"https://github.com/bugout-dev/mirror","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bugout-dev%2Fmirror","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bugout-dev%2Fmirror/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bugout-dev%2Fmirror/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bugout-dev%2Fmirror/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bugout-dev","download_url":"https://codeload.github.com/bugout-dev/mirror/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247415783,"owners_count":20935383,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T15:01:41.299Z","updated_at":"2025-12-27T01:36:32.219Z","avatar_url":"https://github.com/bugout-dev.png","language":"Python","readme":"# mirror - Tools for software project analysis\n\n## Setup\n- Prepare python environment and install package\n- For development please use `pip install -r requirements.dev.txt`\n- Copy `sample.env` to `dev.env`, fill it with required variables and source it\n```bash\nexport GITHUB_TOKEN=\"\u003cyour GitHub token\u003e\"\nexport LANGUAGES_DIR=\"\u003cdirectory with cloned languages repos\u003e\"\nexport MIRROR_CRAWL_INTERVAL_SECONDS=1\nexport MIRROR_CRAWL_MIN_RATE_LIMIT=500 (for search better set as 5)\nexport MIRROR_CRAWL_BATCH_SIZE=\"\u003chow often save data\u003e\"\nexport MIRROR_CRAWL_DIR=\"\u003cwhere to save crawled data\u003e\"\nexport MIRROR_LANGUAGES_FILE=\"\u003cjson file with languauges\u003e\"\nexport SNIPPETS_DIR=\"\u003cdir for snippets dataset\u003e\"\n```\n\n- To avoid block from GitHub prepare Rate Limit watcher\n```bash\nwatch -d -n 5 'curl https://api.github.com/rate_limit -s -H \"Authorization: Bearer $GITHUB_TOKEN\" \"Accept: application/vnd.github.v3+json\"'\n```\n\n### Module commands\n\n```\npython -m mirror.cli --help\n\n  clone              Clone repos from search api to output dir.\n  commits            Read repos json file and upload all commits for that...\n  crawl              Processes arguments as parsed from the command line\n                     and...\n\n  generate_snippets  Create snippets dataset from cloned repos\n  nextid             Prints ID of most recent repository crawled and\n                     written...\n\n  sample             Writes repositories sampled from a crawl directory to...\n  search             Crawl via search api.\n  validate           Prints ID of most recent repository crawled and\n                     written...\n```\n\n### Extract all repos metadata\n\nRun the `crawl` command to extract all repositories metadata and save in a `.json` file.\n\n```bash\npython -m mirror.cli crawl \\\n  --crawldir $MIRROR_CRAWL_DIR \\\n  --interval $MIRROR_CRAWL_INTERVAL_SECONDS \\\n  --min-rate-limit $MIRROR_CRAWL_MIN_RATE_LIMIT \\\n  --batch-size $MIRROR_CRAWL_BATCH_SIZE\n```\n\n### Extract repos metadata via search api\n\nSay you need to extract only a small pool of repositories for analysis then you can set more precise criteria that you need via `search` command. \n\n```bash\npython -m mirror.cli search --crawldir \"$MIRROR_CRAWL_DIR/search\" -L \"python\" -s \"\u003e500\" -l 5\n```\n\n### Clone repos to local machine for analysis\n\nThe `clone` command uses the standard `git clone` to extract search results of repositories and clones to local machine.\n\nClone from search\n```bash\npython -m mirror.cli clone -d $LANGUAGES_DIR -r \"$MIRROR_CRAWL_DIR/search\"\n```\n\nClone from crawl\n```bash\npython -m mirror.cli clone -d $LANGUAGES_DIR -r \"$MIRROR_CRAWL_DIR\"\n```\n\nStructure of `$LANGUAGES_DIR` directory:\n\n```\n\u003e $LANGUAGES_DIR\n  \u003e language 1\n    \u003e repo 1\n    \u003e repo 2\n    ...\n  \u003e language 2\n    \u003e repo 1\n    \u003e repo 2\n    ...\n  ...\n```\n\nAlso, there is possibility to upload popular repositories with python code. See example in [ex_clone.py](https://github.com/bugout-dev/mirror/examples/ex_clone.py)\n\n### Create commits from repo search\n\nCommand `commits` extract all commits from repository and save `.json` files with commits for each repository.\n\n```bash\npython -m mirror.cli commits -d \"$MIRROR_CRAWL_DIR\\commits\" -l 5 -r \"$MIRROR_CRAWL_DIR/search\"\n```\n\n\n### Convert json data to csv for analysis\n\nIt creates `.csv` file with flat json structure.\n\n```bash\npython -m mirror.github.utils --json-files-folder \"$MIRROR_CRAWL_DIR\" --output-csv \"$MIRROR_CRAWL_DIR/output.csv\" --command commits\n```\n\n### Generate snippets dataset from downloaded repo\n```bash\npython -m mirror.github.generate_snippets -r \"$OUTPUT_DIR\" -f \"examples/languages.json\" -L \"$LANGUAGES_DIR\"\n\n```\n\n\n\n\n### Workflow of generate snippet dataset from prepered file with languages and they extentions\n\n1) Create search result\n```bash\npython -m mirror.cli search -d \"$MIRROR_CRAWL_DIR/search\" -f $MIRROR_LANGUAGES_FILE -s \"\u003e500\" -l 5\n```\n\n2) Clone repos from search result it's take time and maybe good idea not add stdout from **git clone** to terminal.\n```bash\npython -m mirror.cli clone -d $LANGUAGES_DIR -r \"$MIRROR_CRAWL_DIR/search\"\n```\n\n3) Generate snippets \n```bash\npython -m mirror.cli generate_snippets  -d $SNIPPETS_DIR -r $LANGUAGES_DIR\n```\n\nIt return sqlite db with snippets and they metadata.\n\nFor use accross **allrepos** result **clone** and **commits** have option argument\n```bash\n --start-id --end-id\n ```\n parameters must be set togrther. That id add for ability processing part of repo from allrepos result.","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbugout-dev%2Fmirror","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbugout-dev%2Fmirror","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbugout-dev%2Fmirror/lists"}