Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/src-d/datasets
source{d} datasets ("big code") for source code analysis and machine learning on source code
https://github.com/src-d/datasets
dataset datasets git github machine-learning mlosc
Last synced: 8 days ago
JSON representation
source{d} datasets ("big code") for source code analysis and machine learning on source code
- Host: GitHub
- URL: https://github.com/src-d/datasets
- Owner: src-d
- License: other
- Created: 2018-01-25T14:07:36.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2019-11-27T16:55:22.000Z (about 5 years ago)
- Last Synced: 2025-01-22T14:06:23.765Z (15 days ago)
- Topics: dataset, datasets, git, github, machine-learning, mlosc
- Language: Jupyter Notebook
- Homepage:
- Size: 47.5 MB
- Stars: 325
- Watchers: 20
- Forks: 82
- Open Issues: 26
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# source{d} Datasets [![Build Status](https://travis-ci.com/src-d/datasets.svg?branch=master)](https://travis-ci.com/src-d/datasets) [![Build status](https://ci.appveyor.com/api/projects/status/b2en9yo9142qgadh?svg=true)](https://ci.appveyor.com/project/vmarkovtsev/datasets)
source{d} datasets for source code analysis and [machine learning on source code (ML on Code)](https://github.com/src-d/awesome-machine-learning-on-source-code).
This repository contains all the needed tools and scripts to reproduce the datasets, as well as the academic papers they may relate to.
## Available datasets
### Public Git Archive
- [Public Git Archive](PublicGitArchive)
- Size: 6TB
- Description: 260k+ top-bookmarked repositories from GitHub, consisting of 136M+ files and ~28 billion lines of code.### Programming Language Identifiers
- [Programming Language Identifiers](Identifiers)
- Size: 1GB
- Description: ~49M distinct identifiers extracted from 10+ programming languages.### Code duplicates
- [Manually labelled pairs of files and functions](Duplicates)
- Size: 250MB
- Description: 2k Java file and 600 Java function pairs labeled as similar or different by several programmers.### Pull Request review comments
- [PR review comments](ReviewComments)
- Size: 1.5GB
- Description: 25.3 million GitHub PR review comments since January 2015 till December 2018.### Commit messages
- [Commit messages](CommitMessages)
- Size: 46GB
- Description: 1.3 billion GitHub commit messages till March 2019.### Structural commit features
- [Structural commit features](StructuralCommitFeatures)
- Size: 1.9GB
- Description: 1.6 million commits in 622 Java repositories on GitHub.### DockerHub Metadata
- [DockerHub Metadata](DockerHubMetadata)
- Size: 1.4GB
- Description: 1.46 million Docker image configuration and manifest files on [DockerHub](https://hub.docker.com/) fetched in June 2019.### DockerHub Packages
- [DockerHub Packages](DockerHubPackages)
- Size: 15GB
- Description: 419092 analyzed Docker images: lists of native, Python and Node packages on [DockerHub](https://hub.docker.com/) fetched in summer 2019.### Typos
- [Typos](Typos)
- Size: 1MB
- Description: 7375 typos in source code identifier names found in GitHub repositories.### NuGet Namespaces
- [NugetNamespaces](NugetNamespaces)
- Size: 13MB
- Description: information about 681,858 .NET namespaces extracted from 227,839 NuGet packages.## Contributions
Contributions are very welcome, please see [CONTRIBUTING.md](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md).
## License
The tools and scripts are licensed under Apache 2.0, see [LICENSE.md](LICENSE.md).