Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/src-d/datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code
https://github.com/src-d/datasets

dataset datasets git github machine-learning mlosc

Last synced: 8 days ago
JSON representation

source{d} datasets ("big code") for source code analysis and machine learning on source code

Awesome Lists containing this project

README

        

# source{d} Datasets [![Build Status](https://travis-ci.com/src-d/datasets.svg?branch=master)](https://travis-ci.com/src-d/datasets) [![Build status](https://ci.appveyor.com/api/projects/status/b2en9yo9142qgadh?svg=true)](https://ci.appveyor.com/project/vmarkovtsev/datasets)

source{d} datasets for source code analysis and [machine learning on source code (ML on Code)](https://github.com/src-d/awesome-machine-learning-on-source-code).

This repository contains all the needed tools and scripts to reproduce the datasets, as well as the academic papers they may relate to.

## Available datasets

### Public Git Archive

- [Public Git Archive](PublicGitArchive)
- Size: 6TB
- Description: 260k+ top-bookmarked repositories from GitHub, consisting of 136M+ files and ~28 billion lines of code.

### Programming Language Identifiers

- [Programming Language Identifiers](Identifiers)
- Size: 1GB
- Description: ~49M distinct identifiers extracted from 10+ programming languages.

### Code duplicates

- [Manually labelled pairs of files and functions](Duplicates)
- Size: 250MB
- Description: 2k Java file and 600 Java function pairs labeled as similar or different by several programmers.

### Pull Request review comments

- [PR review comments](ReviewComments)
- Size: 1.5GB
- Description: 25.3 million GitHub PR review comments since January 2015 till December 2018.

### Commit messages

- [Commit messages](CommitMessages)
- Size: 46GB
- Description: 1.3 billion GitHub commit messages till March 2019.

### Structural commit features

- [Structural commit features](StructuralCommitFeatures)
- Size: 1.9GB
- Description: 1.6 million commits in 622 Java repositories on GitHub.

### DockerHub Metadata

- [DockerHub Metadata](DockerHubMetadata)
- Size: 1.4GB
- Description: 1.46 million Docker image configuration and manifest files on [DockerHub](https://hub.docker.com/) fetched in June 2019.

### DockerHub Packages

- [DockerHub Packages](DockerHubPackages)
- Size: 15GB
- Description: 419092 analyzed Docker images: lists of native, Python and Node packages on [DockerHub](https://hub.docker.com/) fetched in summer 2019.

### Typos
- [Typos](Typos)
- Size: 1MB
- Description: 7375 typos in source code identifier names found in GitHub repositories.

### NuGet Namespaces
- [NugetNamespaces](NugetNamespaces)
- Size: 13MB
- Description: information about 681,858 .NET namespaces extracted from 227,839 NuGet packages.

## Contributions

Contributions are very welcome, please see [CONTRIBUTING.md](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md).

## License

The tools and scripts are licensed under Apache 2.0, see [LICENSE.md](LICENSE.md).