https://github.com/saltudelft/many-types-4-py-dataset

ManyTypes4Py: A benchmark Python dataset for machine learning-based type inference
https://github.com/saltudelft/many-types-4-py-dataset

benchmark clean dataset machine-learning manytypes4py msr mt4py python type-annotations type-checked type-inference visible-type-hints

Last synced: 5 months ago
JSON representation

ManyTypes4Py: A benchmark Python dataset for machine learning-based type inference

Host: GitHub
URL: https://github.com/saltudelft/many-types-4-py-dataset
Owner: saltudelft
License: apache-2.0
Created: 2020-09-16T15:21:33.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-03-27T08:52:57.000Z (over 4 years ago)
Last Synced: 2025-09-05T11:49:41.316Z (10 months ago)
Topics: benchmark, clean, dataset, machine-learning, manytypes4py, msr, mt4py, python, type-annotations, type-checked, type-inference, visible-type-hints
Language: Jupyter Notebook
Homepage:
Size: 14.6 MB
Stars: 23
Watchers: 6
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: CITATION.bib

Awesome Lists containing this project

README

          # ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4719447.svg)](https://doi.org/10.5281/zenodo.4719447)

- [Intro](#intro)

- [Download](#downloading-dataset)

- [Preparation](#dataset-preparation)

- [Citing MT4Py](#citing-the-dataset)

- [Roadmap](#roadmap)

# Intro

- It has *clean* and *complete* versions (from v0.7):

  - The clean version has 5.1K **type-checked** Python repositories and 1.2M type annotations.

  - The complete version has 5.2K Python repositories and 3.3M type annotations.

- Its source files are type-checked using mypy (clean version).

- Its projects were processed in JSON-formatted files using the [LibSA4Py](https://github.com/saltudelft/libsa4py) pipeline.

- Its source files were already split into training, validation, and test sets for training ML models.

- It is de-duplicated using [CD4Py](https://github.com/saltudelft/CD4Py).

- It contains **Visible Type Hints** (VTHs), which is a deep, recursive, and dynamic analysis of types from the import statements of source files and their dependencies.

- It is published in the Data Showcase of the **MSR'21** conference.

# Downloading dataset

The latest version of the dataset is publicly available on [zenodo](https://doi.org/10.5281/zenodo.4044635).

# Dataset preparation

We highly recommend downloading the latest version of the dataset as described above. If you want to manually prepare the dataset, follow below instructions.

## Requirements

* Python 3.5 or newer

* Python dependencies from `scripts/requirements.txt` installed (run `pip install -r scripts/requirements.txt`)

* Install the `libsa4py` package (run `git clone https://github.com/saltudelft/libsa4py.git && cd libsa4py && pip install .`)

## Steps

0. Clone the dataset:

    ```

    python -m repo_cloner -i ./mypy-dependents-by-stars.json -o repos

    ```

    

1. To change the state of the cloned repositories to the ManyType4Py's, run the following command on the `ManyTypes4PyDataset.spec`:

    

    ```

    ./scripts/reset_commits.sh  ./ManyTypes4PyDataset.spec repos

    ``` 

2. Generate duplicate tokens for dataset using `cd4py`

    ```

    cd4py --p repos --ot tokens --od manytypes4py_dataset_duplicates.jsonl.gz --d 1024

    ```

3. Gather duplicate files from the `cd4py` output tokens, and output as a single text file (using `collect_dupes.py`)

    ```

    python3 scripts/collect_dupes.py manytypes4py_dataset_duplicates.jsonl.gz manytypes4py_dup_files.txt

    ```

4. Create a copy dataset with duplicates removed from the duplicate files collected prior (using `process_dataset.py`)

    ```

    python3 scripts/process_dataset.py repos manytypes4py_dup_files.txt [new dataset path]

    ```

5. Split dataset into test, train and validation data (using `split_dataset.py`)

    ```

    python3 scripts/split_dataset.py [new dataset path] manytypes4py_split.csv

    ```

6. To process the Python repositories and produce JSON output files, run the `libsa4py` pipeline as follows:

    ```

    libsa4py process --p [new dataset path] --o [processed projects path] --s manytypes4py_split.csv --j [WORKERS COUNT]

    ```

    Check out the `libsa4py` [README](https://github.com/saltudelft/libsa4py#usage) for more info on its usage.

    

6. Create a tar of the full dataset & artifacts in one folder

    ```

    tar -czvf [output path] [dataset artifacts path]

    ```

# Citing the dataset

If you have used the dataset in your research work, please consider citing it.

```

@inproceedings{mt4py2021,

author = {A. M. Mir and E. Latoskinas and G. Gousios},

booktitle = {IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},

title = {ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference},

year = {2021},

pages = {585-589},

doi = {10.1109/MSR52588.2021.00079},

publisher = {IEEE Computer Society},

month = {May}

}

```

# Roadmap

- Gathering Python projects that depend on type-checkers other than mypy, i.e., pyre, pytype, and pyright.

- Apply type annotations from [typeshed](https://github.com/python/typeshed) to the dataset.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/saltudelft/many-types-4-py-dataset

Awesome Lists containing this project

README