{"id":24341560,"url":"https://github.com/xu-xiang/GithubGather","last_synced_at":"2025-09-28T03:30:57.179Z","repository":{"id":215658557,"uuid":"739469491","full_name":"xu-xiang/GithubGather","owner":"xu-xiang","description":"GitHub API Data Gatherer, Supports multi-token rotation, deep fetching, field filtering, and linked requests. Built as a proxy to the official GitHub API, it's fully compatible with the official API. ⚡ GitHub API数据采集器，支持多Token轮换、深度爬取、字段过滤，以及联动请求等功能。基于GitHub API代理实现，完全兼容官方API","archived":false,"fork":false,"pushed_at":"2024-01-05T18:15:23.000Z","size":16,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-01-06T18:43:53.766Z","etag":null,"topics":["github","github-analytics","github-api","github-crawler","github-data-gathering","github-integration","github-proxy","github-security","github-token-pool","python"],"latest_commit_sha":null,"homepage":"https://www.aiflows.io/get-started/integrations/api/githubgather-ni-de-zui-jia-github-shu-ju-cai-ji-gong-ju","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xu-xiang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2024-01-05T16:35:40.000Z","updated_at":"2024-01-06T18:05:05.000Z","dependencies_parsed_at":"2024-01-05T18:48:48.262Z","dependency_job_id":null,"html_url":"https://github.com/xu-xiang/GithubGather","commit_stats":null,"previous_names":["xu-xiang/githubgather"],"tags_count":1,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xu-xiang%2FGithubGather","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xu-xiang%2FGithubGather/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xu-xiang%2FGithubGather/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xu-xiang%2FGithubGather/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xu-xiang","download_url":"https://codeload.github.com/xu-xiang/GithubGather/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234480864,"owners_count":18840193,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["github","github-analytics","github-api","github-crawler","github-data-gathering","github-integration","github-proxy","github-security","github-token-pool","python"],"created_at":"2025-01-18T08:00:59.621Z","updated_at":"2025-09-28T03:30:56.755Z","avatar_url":"https://github.com/xu-xiang.png","language":"Python","funding_links":[],"categories":["LLM分析过程"],"sub_categories":[],"readme":"# GitHubGather: Your Ultimate GitHub Data Gathering Tool\n\n## 🚀 Introduction\n\nAn easy-to-use harvester based on the official GitHub API proxy, designed for developers, security engineers, data analysts, and more.\n\nFully compatible with the official GitHub API, **GitHubGather supports automatic switching between multiple tokens**. Simple parameter configuration allows for complex functionalities like batch data scraping.\n\nWhether you're looking to bulk scrape for corporate code leakage, collect CVE vulnerability intelligence from GitHub, or analyze and track project information, GitHubGather simplifies the process!\n\nCombine it with [airflows](https://github.com/xu-xiang/aiflows) to create a data source for task orchestration tools like N8N. Use LLM to intelligently analyze data on GitHub. For instance, automatically gather the latest code containing a CVE vulnerability using GitHubGather, then use large models to explore PoC.\n\n[中文版本(Chinese version)](doc/README.zh-cn.md)\n\n\n## 🌟 Highlight Features\n\n* API: Fully compatible with the official GitHub API, no worries about unsupported interfaces. Refer to the official API examples.\n* Token Rotation: Handles strict GitHub rate limits with automatic token rotation.\n* Deep Fetch: Automatically paginate to capture all data in searches, like code, which usually returns a maximum of 100 items at once.\n* Field Filtering: Choose to return only the fields you need, for clearer data and improved performance.\n* Linked Requests: Efficient data collection with one request triggering multiple API calls (e.g., crawling all READMEs of a user's repositories).\n* Docker Deployment: Easy one-click start.\n\n## ⚡ Quick Start\n\n### Docker One-Click Start\n\n```shell\ndocker run --name githubgather -d -e GITHUB_TOKENS='token1,token2,token3,token4,token5,token6,token7' -p 9000:9000 registry.cn-hangzhou.aliyuncs.com/aiflows/githubgather:latest\n```\n\n### Docker Local Build\n\n```shell\ndocker build -t githubgather .\ndocker run --name githubgather -d -e GITHUB_TOKENS='token1,token2,token3,token4,token5,token6,token7' -p 9000:9000 githubgather\n```\n\n### Local Quick Start\n\n```shell\ngit clone https://github.com/xu-xiang/GitHubGather.git\ncd GitHubGather\npip install -r requirements.txt\nexport GITHUB_TOKENS='token1,token2,token3,token4,token5,token6,token7'\nuvicorn app.main:app --host 0.0.0.0 --port 9000\n```\n\n### Give it a Try\n\nGitHubGather is not only powerful but also super easy to use. With just a few simple HTTP requests, you can easily obtain a large amount of GitHub data. Here are some common examples to try out!\n\n#### Get basic information of a specific user\n\n```http request\nGET http://localhost:9000/users/{{username}}\nContent-Type: application/json\n```\n\n#### Get information of a specific repository\n\n```http request\nGET http://localhost:9000/repos/{{username}}/{{repo}}\nContent-Type: application/json\n```\n\n## 📘 Advanced Features\n\n### Deep Fetch\n\n- **Description**: Automatically paginate to retrieve all data from the target interface, no manual handling of multiple pages required.\n- **Usage**: Add `?deep_fetch=true` to the request.\n- **Applicable Scenarios**: When you need to retrieve all data from a GitHub API interface with multiple pages of data, such as obtaining all repositories or all starred projects of a user.\n\n#### Deep fetch all repositories of a specific user\n\n```http request\nGET http://localhost:9000/users/{{username}}/repos?deep_fetch=true\u0026per_page=100\nAccept: application/json\n```\n\n#### Deep fetch all Starred projects of a user\n\n```http request\nGET http://localhost:9000/users/{{username}}/starred?deep_fetch=true\u0026per_page=100\nContent-Type: application/json\n```\n\n### Pagination Control (Per Page)\n\n- **Description**: Control the amount of data returned in each request.\n- **Usage**: Add `\u0026per_page=\u003cnumber\u003e` to the request, where `\u003cnumber\u003e` is the number of items you want per page.\n- **Default Value**: GitHub API defaults to 30 items per page, but for efficiency, this project uses the maximum value of 100, which can be reduced via this parameter.\n- **Applicable Scenarios**: When using deep fetch, adjusting the number of items per page can speed up data scraping or more finely control the amount of data returned.\n\n### Maximum Fetch Pages (Max Pages)\n\n- **Description**: Limit the number of pages fetched in deep fetch mode.\n- **Usage**: Add `\u0026max_pages=\u003cnumber\u003e` to the request, where `\u003cnumber\u003e` is the maximum number of pages to fetch.\n- **Applicable Scenarios**: To limit the amount of data and avoid too many requests when deeply fetching a large amount of data.\n\n#### Limit fetch to a specific number of pages, e.g., 1 item per page, only 2 pages in total\n\n```http request\nGET http://localhost:9000/users/{{username}}/repos?deep_fetch=true\u0026per_page=1\u0026max_pages=2\nAccept: application/json\n```\n\n### Highlighted Code in Search Results\n\n- **Description**: When searching for code, returns snippets with highlighted code, useful in scenarios like code leakage detection.\n- **Usage**: Add `\u0026highlight_code=true` to the request to enable code highlighting, default is false.\n- **Applicable Scenarios**: When you need to quickly locate and understand code in search results, this feature is particularly useful.\n\n#### Example: Search for highlighted asynchronous Python code snippets\n\n```http request\nGET http://localhost:9000/search/code?q=python+async\u0026highlight_code=true\nContent-Type: application/json\n```\n\n### Linked Requests\n\n- **Description**: Perform multiple API calls in one request, aggregating data from multiple requests.\n- **Usage**: Add `\u0026linked_\u003cresource\u003e=\u003cendpoint\u003e` to the request, e.g., `\u0026linked_readme=/repos/{full_name}/readme`.\n- **Dynamic Values**: Use `{}` to extract data dynamically from the results of the first request.\n- **Applicable Scenarios**: To retrieve related data, like getting all repositories of a user along with their README files.\n\n#### Retrieve repositories and their Readme files, limiting to specific fields\n\n```http request\nGET http://localhost:9000/users/{{username}}/repos?linked_readme=/repos/{full_name}/readme\n```\n\n### Field Filtering\n\n- **Description**: Retrieve specific data fields only.\n- **Usage**: Add `\u0026fields=\u003cfield1\u003e,\u003cfield2\u003e` to the request, where `\u003cfield1\u003e`, `\u003cfield2\u003e` are the names of the fields you want to retrieve.\n- **Applicable Scenarios**: To reduce the amount of data returned, focusing on the information you care about, like only retrieving a user's login name and contributions.\n\n#### Search for machine+learning related projects and paginate to retrieve all results\n\n```http request\nGET http://localhost:9000/search/repositories?q=machine+learning\u0026deep_fetch=true\u0026fields=items.name,items.description\nAccept: application/json\n```\n\nExample of returned data\n\n```text\n{\n  \"items\": [\n    {\n      \"description\": \"A curated list of awesome Machine Learning frameworks, libraries and software.\"\n    },\n    {\n      \"description\": \"Basic Machine Learning and Deep Learning\"\n    }\n    ...\n  ]\n}  \n```\n\nFor more examples, refer to: [test_cases.http](tests/test_cases.http)\n\nAPI calls can refer to the official GitHub documentation (all can be efficiently proxied): [GitHubAPI](https://docs.github.com/en/rest/search/search?apiVersion=2022-11-28#search-code)\n\n## 🔧 Configurable\n\n`GITHUB_TOKENS`: Configure multiple tokens in the environment variable for automatic switching to enhance efficiency.\n\n`DEEP_FETCH_MAX_PAGES`: Set the maximum number of pages for deep fetching, default is 15.\n\n`TOTAL_COUNT_LIMIT`: The upper limit of GitHub API search results (adjust only if the official GitHub API allows more, as of 2024-01-05, the official limit is up to 1000 records), default is 1000\n\n`MAX_RETRIES`: Maximum number of retries, default is 2\n\nFor configuration file, refer to: [config.py](githubgather/config.py)\n\n## 🤝 Join Us\n\nFound a bug? Have a fresh idea? Join us to improve GitHubGather!\n\nFork our repository.\nSwitch to a new branch (git checkout -b cool-new-feature).\nCommit your changes (git commit -am 'Add some cool feature').\nPush to the branch (git push origin cool-new-feature).\nSubmit a Pull Request.\n\n## 📜 License\n\nLicensed under the MIT License, see the LICENSE file for details.\n\n## 💬 Need Help?\n\nHaving issues? Not sure how to use? Feel free to raise issues or contact us directly!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxu-xiang%2FGithubGather","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxu-xiang%2FGithubGather","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxu-xiang%2FGithubGather/lists"}