https://github.com/smola/language-dataset

Dataset for programming language identification.
https://github.com/smola/language-dataset

dataset language-detection language-identification programming-language-identification

Last synced: 6 months ago
JSON representation

Dataset for programming language identification.

Host: GitHub
URL: https://github.com/smola/language-dataset
Owner: smola
License: mit
Created: 2018-08-06T14:05:52.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2023-03-06T05:01:18.000Z (about 3 years ago)
Last Synced: 2025-03-30T23:41:11.335Z (about 1 year ago)
Topics: dataset, language-detection, language-identification, programming-language-identification
Language: Python
Homepage:
Size: 11.7 MB
Stars: 22
Watchers: 3
Forks: 5
Open Issues: 14
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# language-dataset

A dataset for programming language identification.

## Methodology

* Available languages are fetched from [github/linguist](https://github.com/github/linguist/)'s [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml) and [acmeism/RosettaCodeData](https://github.com/acmeism/RosettaCodeData)'s [Lang.yaml](https://github.com/acmeism/RosettaCodeData/blob/master/Meta/Lang.yaml).
* For each language, initial samples are fetched from GitHub as follows:
* [GitHub Search API](https://developer.github.com/v4/query/#search) is used to get a list of repositories.
* Each repository is cloned and languages are detected with [github/linguist](https://github.com/github/linguist/).
* One sample is added from each repository.
* Samples are later reviewed by humans.

Rules for sample inclusion are:

* No more than one sample from each repository.
* Sample is at least 500b and at most 100kb.

## Dataset

The dataset is stored in the `data` directory. It contains:

* `meta.yml`: metadata about the dataset and available languages.
* `dataset.yml`: collection of all samples, with pointers sample paths relative to `data`.

Check a summary of the dataset at [REPORT.md](REPORT.md).

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## Tooling

The `tools` directory contains various Python utilities to maintain the dataset:
* `tools/gen_meta.py`: Generates `data/meta.yml`. This is only needed when upgrading to a new github/linguist or acmeism/RosettaCodeData version.
* `tools/harvest.py`: Fetches samples from GitHub.
* `tools/vote.py`: Updates the `vote` annotation.
* `tools/lint.py`: Checks the dataset for potential problems.
* `tools/prepare_commit.py`: Updates generated files, required before any commit.
* `tools/classify_linguist.py`: Updates linguist labels.
* `tools/classify_pygments.py`: Updates pygments labels.

To run tools first create the virtual environment:

```
pip install poetry
poetry install
```

Then run the tool with `python -m`:

```
poetry run python -m tools.gen_meta
```

## License

Each sample in `data` has its own license. Check the origin repository for details.

Everything else is licensed under the [MIT License](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/smola/language-dataset

Awesome Lists containing this project

README