{"id":22975062,"url":"https://github.com/src-d/datasets","last_synced_at":"2025-04-06T07:14:55.849Z","repository":{"id":57497959,"uuid":"118920552","full_name":"src-d/datasets","owner":"src-d","description":"source{d} datasets (\"big code\") for source code analysis and machine learning on source code","archived":false,"fork":false,"pushed_at":"2019-11-27T16:55:22.000Z","size":49803,"stargazers_count":329,"open_issues_count":26,"forks_count":83,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-03-30T06:08:25.318Z","etag":null,"topics":["dataset","datasets","git","github","machine-learning","mlosc"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/src-d.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-01-25T14:07:36.000Z","updated_at":"2025-03-06T14:19:00.000Z","dependencies_parsed_at":"2022-08-28T19:41:37.931Z","dependency_job_id":null,"html_url":"https://github.com/src-d/datasets","commit_stats":null,"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fdatasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fdatasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fdatasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fdatasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/src-d","download_url":"https://codeload.github.com/src-d/datasets/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247445681,"owners_count":20939961,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","datasets","git","github","machine-learning","mlosc"],"created_at":"2024-12-15T00:21:07.468Z","updated_at":"2025-04-06T07:14:55.826Z","avatar_url":"https://github.com/src-d.png","language":"Jupyter Notebook","readme":"# source{d} Datasets [![Build Status](https://travis-ci.com/src-d/datasets.svg?branch=master)](https://travis-ci.com/src-d/datasets) [![Build status](https://ci.appveyor.com/api/projects/status/b2en9yo9142qgadh?svg=true)](https://ci.appveyor.com/project/vmarkovtsev/datasets)\n\nsource{d} datasets for source code analysis and [machine learning on source code (ML on Code)](https://github.com/src-d/awesome-machine-learning-on-source-code).\n\nThis repository contains all the needed tools and scripts to reproduce the datasets, as well as the academic papers they may relate to.\n\n## Available datasets\n\n### Public Git Archive\n\n- [Public Git Archive](PublicGitArchive)\n- Size: 6TB\n- Description: 260k+ top-bookmarked repositories from GitHub, consisting of 136M+ files and ~28 billion lines of code.\n\n### Programming Language Identifiers\n\n- [Programming Language Identifiers](Identifiers)\n- Size: 1GB\n- Description: ~49M distinct identifiers extracted from 10+ programming languages.\n\n### Code duplicates\n\n- [Manually labelled pairs of files and functions](Duplicates)\n- Size: 250MB\n- Description: 2k Java file and 600 Java function pairs labeled as similar or different by several programmers.\n\n### Pull Request review comments\n\n- [PR review comments](ReviewComments)\n- Size: 1.5GB\n- Description: 25.3 million GitHub PR review comments since January 2015 till December 2018.\n\n### Commit messages\n\n- [Commit messages](CommitMessages)\n- Size: 46GB\n- Description: 1.3 billion GitHub commit messages till March 2019.\n\n### Structural commit features\n\n- [Structural commit features](StructuralCommitFeatures)\n- Size: 1.9GB\n- Description: 1.6 million commits in 622 Java repositories on GitHub.\n\n### DockerHub Metadata\n\n- [DockerHub Metadata](DockerHubMetadata)\n- Size: 1.4GB\n- Description: 1.46 million Docker image configuration and manifest files on [DockerHub](https://hub.docker.com/) fetched in June 2019.\n\n### DockerHub Packages\n\n- [DockerHub Packages](DockerHubPackages)\n- Size: 15GB\n- Description: 419092 analyzed Docker images: lists of native, Python and Node packages on [DockerHub](https://hub.docker.com/) fetched in summer 2019.\n\n### Typos\n- [Typos](Typos)\n- Size: 1MB\n- Description: 7375 typos in source code identifier names found in GitHub repositories.\n\n### NuGet Namespaces\n- [NugetNamespaces](NugetNamespaces)\n- Size: 13MB\n- Description: information about 681,858 .NET namespaces extracted from 227,839 NuGet packages.\n\n## Contributions\n\nContributions are very welcome, please see [CONTRIBUTING.md](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md).\n\n## License\n\nThe tools and scripts are licensed under Apache 2.0, see [LICENSE.md](LICENSE.md).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fdatasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrc-d%2Fdatasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fdatasets/lists"}