{"id":43324412,"url":"https://github.com/saltudelft/many-types-4-py-dataset","last_synced_at":"2026-02-01T23:04:33.621Z","repository":{"id":53579748,"uuid":"296071486","full_name":"saltudelft/many-types-4-py-dataset","owner":"saltudelft","description":"ManyTypes4Py: A benchmark Python dataset for machine learning-based type inference","archived":false,"fork":false,"pushed_at":"2022-03-27T08:52:57.000Z","size":15360,"stargazers_count":23,"open_issues_count":1,"forks_count":5,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-09-05T11:49:41.316Z","etag":null,"topics":["benchmark","clean","dataset","machine-learning","manytypes4py","msr","mt4py","python","type-annotations","type-checked","type-inference","visible-type-hints"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/saltudelft.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.bib","codeowners":null,"security":null,"support":null}},"created_at":"2020-09-16T15:21:33.000Z","updated_at":"2025-05-05T02:21:17.000Z","dependencies_parsed_at":"2022-09-09T16:30:59.227Z","dependency_job_id":null,"html_url":"https://github.com/saltudelft/many-types-4-py-dataset","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/saltudelft/many-types-4-py-dataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saltudelft%2Fmany-types-4-py-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saltudelft%2Fmany-types-4-py-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saltudelft%2Fmany-types-4-py-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saltudelft%2Fmany-types-4-py-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/saltudelft","download_url":"https://codeload.github.com/saltudelft/many-types-4-py-dataset/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saltudelft%2Fmany-types-4-py-dataset/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28993792,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-01T22:01:47.507Z","status":"ssl_error","status_checked_at":"2026-02-01T21:58:37.335Z","response_time":56,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","clean","dataset","machine-learning","manytypes4py","msr","mt4py","python","type-annotations","type-checked","type-inference","visible-type-hints"],"created_at":"2026-02-01T23:04:32.970Z","updated_at":"2026-02-01T23:04:33.608Z","avatar_url":"https://github.com/saltudelft.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4719447.svg)](https://doi.org/10.5281/zenodo.4719447)\n\n- [Intro](#intro)\n- [Download](#downloading-dataset)\n- [Preparation](#dataset-preparation)\n- [Citing MT4Py](#citing-the-dataset)\n- [Roadmap](#roadmap)\n\n# Intro\n- It has *clean* and *complete* versions (from v0.7):\n  - The clean version has 5.1K **type-checked** Python repositories and 1.2M type annotations.\n  - The complete version has 5.2K Python repositories and 3.3M type annotations.\n- Its source files are type-checked using mypy (clean version).\n- Its projects were processed in JSON-formatted files using the [LibSA4Py](https://github.com/saltudelft/libsa4py) pipeline.\n- Its source files were already split into training, validation, and test sets for training ML models.\n- It is de-duplicated using [CD4Py](https://github.com/saltudelft/CD4Py).\n- It contains **Visible Type Hints** (VTHs), which is a deep, recursive, and dynamic analysis of types from the import statements of source files and their dependencies.\n- It is published in the Data Showcase of the **MSR'21** conference.\n\n# Downloading dataset\nThe latest version of the dataset is publicly available on [zenodo](https://doi.org/10.5281/zenodo.4044635).\n\n# Dataset preparation\nWe highly recommend downloading the latest version of the dataset as described above. If you want to manually prepare the dataset, follow below instructions.\n\n## Requirements\n\n* Python 3.5 or newer\n* Python dependencies from `scripts/requirements.txt` installed (run `pip install -r scripts/requirements.txt`)\n* Install the `libsa4py` package (run `git clone https://github.com/saltudelft/libsa4py.git \u0026\u0026 cd libsa4py \u0026\u0026 pip install .`)\n\n## Steps\n\n0. Clone the dataset:\n\n    ```\n    python -m repo_cloner -i ./mypy-dependents-by-stars.json -o repos\n    ```\n    \n1. To change the state of the cloned repositories to the ManyType4Py's, run the following command on the `ManyTypes4PyDataset.spec`:\n    \n    ```\n    ./scripts/reset_commits.sh  ./ManyTypes4PyDataset.spec repos\n    ``` \n\n2. Generate duplicate tokens for dataset using `cd4py`\n\n    ```\n    cd4py --p repos --ot tokens --od manytypes4py_dataset_duplicates.jsonl.gz --d 1024\n    ```\n\n3. Gather duplicate files from the `cd4py` output tokens, and output as a single text file (using `collect_dupes.py`)\n\n    ```\n    python3 scripts/collect_dupes.py manytypes4py_dataset_duplicates.jsonl.gz manytypes4py_dup_files.txt\n    ```\n\n4. Create a copy dataset with duplicates removed from the duplicate files collected prior (using `process_dataset.py`)\n\n    ```\n    python3 scripts/process_dataset.py repos manytypes4py_dup_files.txt [new dataset path]\n    ```\n\n5. Split dataset into test, train and validation data (using `split_dataset.py`)\n\n    ```\n    python3 scripts/split_dataset.py [new dataset path] manytypes4py_split.csv\n    ```\n\n6. To process the Python repositories and produce JSON output files, run the `libsa4py` pipeline as follows:\n\n    ```\n    libsa4py process --p [new dataset path] --o [processed projects path] --s manytypes4py_split.csv --j [WORKERS COUNT]\n    ```\n\n    Check out the `libsa4py` [README](https://github.com/saltudelft/libsa4py#usage) for more info on its usage.\n    \n6. Create a tar of the full dataset \u0026 artifacts in one folder\n\n    ```\n    tar -czvf [output path] [dataset artifacts path]\n    ```\n\n# Citing the dataset\nIf you have used the dataset in your research work, please consider citing it.\n\n```\n@inproceedings{mt4py2021,\nauthor = {A. M. Mir and E. Latoskinas and G. Gousios},\nbooktitle = {IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},\ntitle = {ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference},\nyear = {2021},\npages = {585-589},\ndoi = {10.1109/MSR52588.2021.00079},\npublisher = {IEEE Computer Society},\nmonth = {May}\n}\n```\n\n# Roadmap\n- Gathering Python projects that depend on type-checkers other than mypy, i.e., pyre, pytype, and pyright.\n- Apply type annotations from [typeshed](https://github.com/python/typeshed) to the dataset.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaltudelft%2Fmany-types-4-py-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsaltudelft%2Fmany-types-4-py-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaltudelft%2Fmany-types-4-py-dataset/lists"}