{"id":28998692,"url":"https://github.com/luminati-io/github-dataset-samples","last_synced_at":"2025-06-25T07:09:45.490Z","repository":{"id":283783967,"uuid":"884792452","full_name":"luminati-io/GitHub-dataset-samples","owner":"luminati-io","description":"A sample dataset of over 1000 GitHub repositories, extracted using the Bright Data API, ideal for developer engagement, community engagement, and advocacy.","archived":false,"fork":false,"pushed_at":"2024-11-07T12:05:33.000Z","size":6569,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-03-22T07:02:01.203Z","etag":null,"topics":["datasets","github","github-data","github-dataset","github-repository","github-scraper","web-scraper"],"latest_commit_sha":null,"homepage":"https://brightdata.com/products/datasets/github","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luminati-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-07T11:58:05.000Z","updated_at":"2024-11-07T12:08:11.000Z","dependencies_parsed_at":"2025-03-22T07:12:08.270Z","dependency_job_id":null,"html_url":"https://github.com/luminati-io/GitHub-dataset-samples","commit_stats":null,"previous_names":["luminati-io/github-dataset-samples"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/luminati-io/GitHub-dataset-samples","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2FGitHub-dataset-samples","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2FGitHub-dataset-samples/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2FGitHub-dataset-samples/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2FGitHub-dataset-samples/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luminati-io","download_url":"https://codeload.github.com/luminati-io/GitHub-dataset-samples/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2FGitHub-dataset-samples/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261823775,"owners_count":23215150,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasets","github","github-data","github-dataset","github-repository","github-scraper","web-scraper"],"created_at":"2025-06-25T07:09:32.632Z","updated_at":"2025-06-25T07:09:45.455Z","avatar_url":"https://github.com/luminati-io.png","language":null,"readme":"# GitHub-dataset-samples\n\n\u003ch2\u003eA sample dataset of 1001 GitHub repositories\u003c/h2\u003e\n\n![GitHub dataset header](https://github.com/luminati-io/GitHub-dataset-samples/blob/main/github-datasets.PNG)\n\nA GitHub dataset sample of over 1000 repositories. Dataset was extracted using the \u003cb\u003eBright Data API\u003c/b\u003e.\n\n\u003ch2\u003eSome of the data points that are included in the dataset:\u003c/h2\u003e\n\n* ```url```: Repository web address\n* ```id```: Unique repository ID\n* ```code_language```: Main programming language\n* ```code```: Repository source code\n* ```num_lines```: Total lines of code\n* ```user_name```: Repository owner's username\n* ```user_url```: Owner's profile URL\n* ```size```: Repository size\n* ```size_unit```: Repository size units\n* ```size_num```: Repository size number\n* ```breadcrumbs```: Repository navigation path\n* ```num_issues```: Total issues count\n* ```num_pull_requests```: Total pull requests count\n* ```num_projects```: Number of associated projects\n* ```num_fork```: Fork count\n* ```num_stared```: Star count\n* ```last_feature```: Latest feature change\n* ```latest_update```: Date of last update\n\nAnd a lot more.\n\nThis is a sample subset which is derived from the \"GitHub Repositories (public data)\"\ndataset which includes more than \u003cb\u003e2,200,000 repositories\u003c/b\u003e.\n\nAvailable dataset file formats: \u003cb\u003eJSON, NDJSON, JSON Lines, CSV, or Parquet. Optionally, files can be compressed to .gz\u003c/b\u003e.\n\nDataset delivery type options: \u003cb\u003eEmail, API download, Webhook, Amazon S3, Google Cloud storage, Google Cloud PubSub, Microsoft Azure, Snowflake, SFTP\u003c/b\u003e.\n\nUpdate frequency: \u003cb\u003eOnce, Daily, Weekly, Monthly, Quarterly, or Custom basis\u003c/b\u003e.\n\nData enrichment available as an addition to the data points extracted: \u003cb\u003eBased on request.\u003c/b\u003e\n\n\u003cb\u003e[Get the full GitHub dataset](https://brightdata.com/products/datasets/github)\u003c/b\u003e.\n\n\u003ch2\u003eWhat are the GitHub datasets use cases?\u003c/h2\u003e\n\n\u003ch3\u003e1. Developer Engagement\u003c/h3\u003e\nGain insights into the activity and health of open-source projects by tracking data points like commit histories, pull requests, and issue discussions. This data can help businesses identify high-impact projects, monitor trends, and discover collaboration opportunities in the open-source community.\n\n\u003ch3\u003e2. Community Engagement\u003c/h3\u003e\nEvaluate the popularity and community backing of open-source projects by analyzing metrics such as star and fork counts. This information enables businesses to understand which projects are gaining traction, making informed decisions on adoption, and identifying technology trends.\n\n\u003ch3\u003e3. Community Advocacy\u003c/h3\u003e\nUtilize public GitHub profile data to foster engagement and advocacy within the open-source community. Identify active users who star, fork, and contribute to repositories in your field to create a network of advocates who can amplify your projects and fuel collaborative innovation.\n\n\u003ch2\u003eFree access to web scraping tools and datasets for academic researchers and NGOs\u003c/h2\u003e\n\nThe Bright Initiative offers access to Bright Data's \u003cb\u003e[Web Scraper APIs](https://brightdata.com/products/web-scraper)\u003c/b\u003e and \u003cb\u003e[ready-to-use datasets](https://brightdata.com/products/datasets)\u003c/b\u003e to leading academic faculties and researchers, NGOs and NPOs promoting various environmental and social causes. You can submit an application \u003cb\u003e[here](https://brightinitiative.com)\u003c/b\u003e.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Fgithub-dataset-samples","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluminati-io%2Fgithub-dataset-samples","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Fgithub-dataset-samples/lists"}