https://github.com/saltudelft/many-types-4-py-dataset
ManyTypes4Py: A benchmark Python dataset for machine learning-based type inference
https://github.com/saltudelft/many-types-4-py-dataset
benchmark clean dataset machine-learning manytypes4py msr mt4py python type-annotations type-checked type-inference visible-type-hints
Last synced: 5 months ago
JSON representation
ManyTypes4Py: A benchmark Python dataset for machine learning-based type inference
- Host: GitHub
- URL: https://github.com/saltudelft/many-types-4-py-dataset
- Owner: saltudelft
- License: apache-2.0
- Created: 2020-09-16T15:21:33.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-03-27T08:52:57.000Z (about 4 years ago)
- Last Synced: 2025-09-05T11:49:41.316Z (10 months ago)
- Topics: benchmark, clean, dataset, machine-learning, manytypes4py, msr, mt4py, python, type-annotations, type-checked, type-inference, visible-type-hints
- Language: Jupyter Notebook
- Homepage:
- Size: 14.6 MB
- Stars: 23
- Watchers: 6
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: CITATION.bib
Awesome Lists containing this project
README
# ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference
[](https://doi.org/10.5281/zenodo.4719447)
- [Intro](#intro)
- [Download](#downloading-dataset)
- [Preparation](#dataset-preparation)
- [Citing MT4Py](#citing-the-dataset)
- [Roadmap](#roadmap)
# Intro
- It has *clean* and *complete* versions (from v0.7):
- The clean version has 5.1K **type-checked** Python repositories and 1.2M type annotations.
- The complete version has 5.2K Python repositories and 3.3M type annotations.
- Its source files are type-checked using mypy (clean version).
- Its projects were processed in JSON-formatted files using the [LibSA4Py](https://github.com/saltudelft/libsa4py) pipeline.
- Its source files were already split into training, validation, and test sets for training ML models.
- It is de-duplicated using [CD4Py](https://github.com/saltudelft/CD4Py).
- It contains **Visible Type Hints** (VTHs), which is a deep, recursive, and dynamic analysis of types from the import statements of source files and their dependencies.
- It is published in the Data Showcase of the **MSR'21** conference.
# Downloading dataset
The latest version of the dataset is publicly available on [zenodo](https://doi.org/10.5281/zenodo.4044635).
# Dataset preparation
We highly recommend downloading the latest version of the dataset as described above. If you want to manually prepare the dataset, follow below instructions.
## Requirements
* Python 3.5 or newer
* Python dependencies from `scripts/requirements.txt` installed (run `pip install -r scripts/requirements.txt`)
* Install the `libsa4py` package (run `git clone https://github.com/saltudelft/libsa4py.git && cd libsa4py && pip install .`)
## Steps
0. Clone the dataset:
```
python -m repo_cloner -i ./mypy-dependents-by-stars.json -o repos
```
1. To change the state of the cloned repositories to the ManyType4Py's, run the following command on the `ManyTypes4PyDataset.spec`:
```
./scripts/reset_commits.sh ./ManyTypes4PyDataset.spec repos
```
2. Generate duplicate tokens for dataset using `cd4py`
```
cd4py --p repos --ot tokens --od manytypes4py_dataset_duplicates.jsonl.gz --d 1024
```
3. Gather duplicate files from the `cd4py` output tokens, and output as a single text file (using `collect_dupes.py`)
```
python3 scripts/collect_dupes.py manytypes4py_dataset_duplicates.jsonl.gz manytypes4py_dup_files.txt
```
4. Create a copy dataset with duplicates removed from the duplicate files collected prior (using `process_dataset.py`)
```
python3 scripts/process_dataset.py repos manytypes4py_dup_files.txt [new dataset path]
```
5. Split dataset into test, train and validation data (using `split_dataset.py`)
```
python3 scripts/split_dataset.py [new dataset path] manytypes4py_split.csv
```
6. To process the Python repositories and produce JSON output files, run the `libsa4py` pipeline as follows:
```
libsa4py process --p [new dataset path] --o [processed projects path] --s manytypes4py_split.csv --j [WORKERS COUNT]
```
Check out the `libsa4py` [README](https://github.com/saltudelft/libsa4py#usage) for more info on its usage.
6. Create a tar of the full dataset & artifacts in one folder
```
tar -czvf [output path] [dataset artifacts path]
```
# Citing the dataset
If you have used the dataset in your research work, please consider citing it.
```
@inproceedings{mt4py2021,
author = {A. M. Mir and E. Latoskinas and G. Gousios},
booktitle = {IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
title = {ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference},
year = {2021},
pages = {585-589},
doi = {10.1109/MSR52588.2021.00079},
publisher = {IEEE Computer Society},
month = {May}
}
```
# Roadmap
- Gathering Python projects that depend on type-checkers other than mypy, i.e., pyre, pytype, and pyright.
- Apply type annotations from [typeshed](https://github.com/python/typeshed) to the dataset.