{"id":21168562,"url":"https://github.com/jetbrains-research/psiminer","last_synced_at":"2025-07-09T18:31:34.154Z","repository":{"id":42022142,"uuid":"248702388","full_name":"JetBrains-Research/psiminer","owner":"JetBrains-Research","description":"A Tool for Mining Rich Abstract Syntax Trees from Code","archived":false,"fork":false,"pushed_at":"2023-07-26T21:42:17.000Z","size":771,"stargazers_count":58,"open_issues_count":13,"forks_count":12,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-04-05T04:51:12.483Z","etag":null,"topics":["data-mining","mining","ml4code","ml4se"],"latest_commit_sha":null,"homepage":"","language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JetBrains-Research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-20T08:17:32.000Z","updated_at":"2024-11-11T15:03:49.000Z","dependencies_parsed_at":"2023-02-19T10:00:37.348Z","dependency_job_id":null,"html_url":"https://github.com/JetBrains-Research/psiminer","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/JetBrains-Research/psiminer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JetBrains-Research%2Fpsiminer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JetBrains-Research%2Fpsiminer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JetBrains-Research%2Fpsiminer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JetBrains-Research%2Fpsiminer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JetBrains-Research","download_url":"https://codeload.github.com/JetBrains-Research/psiminer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JetBrains-Research%2Fpsiminer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264502387,"owners_count":23618587,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-mining","mining","ml4code","ml4se"],"created_at":"2024-11-20T15:15:09.088Z","updated_at":"2025-07-09T18:31:32.965Z","avatar_url":"https://github.com/JetBrains-Research.png","language":"Kotlin","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `PSIMiner`\n\n[![JetBrains Research](https://jb.gg/badges/research.svg)](https://confluence.jetbrains.com/display/ALL/JetBrains+on+GitHub)\n\n`PSIMiner` — a tool for processing PSI trees from the IntelliJ Platform.\nPSI trees contain code syntax trees as well as functions to work with them,\nand therefore can be used to enrich code representation using static analysis algorithms of modern IDEs.\n\n`PSIMiner` is a plugin for IntelliJ IDEA that runs it in a headless mode and creates datasets for ML pipelines.\n\nThe complete documentation of different parts is stored in [docs](./docs) folder.\n\n## Installation\n\n`PSIMiner` requires Java 11 for correct work.\nCheck gradle will use the correct version.\nAll other dependencies will be installed automatically.\n\nUse `./gradlew build` (or `gradlew.bat build` on Windows) to build the tool.\n\n## Usage\n\nThere are already predefined configurations compatible with the IntelliJ IDEA.\nOpen or import project in it and run tool on test data or start tests.\nYou can modify these configurations to suit your needs.\n\nHowever, it is possible to run the tool through CLI.\nIt is better to use predefined shell script (only for Unix system)\n```shell\n./psiminer.sh $dataset_path $output_folder $JSON_config\n```\n\n### Logs\n\n`PSIMiner` automatically store logs in home directory of user on each run.\nCheck `~/psiminer.log` (or something like `C:\\Users\\yourusername\\psiminer.log` for Windows) and share it to describe \nyour \nproblem.\n\n## Configuration\n\n`PSIMiner` completely configured by JSON.\nCheck examples in the [configs](configs) folder.\n\nLogically `PSIMiner` consist of the following parts.\nThere are a full documentation for them in [docs](./docs) folder:\n- [Tree transformations](./docs/tree_transormations.md) —\nthis is an interface for enriching trees with new information and other useful manipulations,\ne.g. resolve types or exclude whitespaces.\n- [Filters](./docs/filters.md) —\nthis is an interface for removing *bad* trees from the data, e.g. trees that are too big.\n- [Label extractor](./docs/label_extractors.md) —\nthis is an interface to define the correct extraction of labels from raw trees,\ne.g. extract method name for each method.\n- [Storage](./docs/storages.md) —\nthis is an interface to define how tree should be saved on the disk,\ne.g. code2seq format or JSONL format.\n\nThere are also a few fields to define a parser and pipeline options.\nFor example, setting up `Language`.\n\n## Additional preprocessing\n\nIf you turn on additional preprocessing:\n* ✅ more projects will be opened successfully by IDEA\n* ⚠️ files in your original dataset will be **changed**\n\n[More about additional preprocessing](docs/preprocessing.md)\n\n## Language support\n\nCurrently, `PSIMiner` supports `Java` and `Kotlin` datasets.\nBut we developed the tool with the possibility to extend it to new languages.\nAnd since `PSI` trees supports big amount of languages,\nadding new language into the tool requires only implementing few interfaces.\n\nBe aware that multiple tree transformations can't be adopted to new languages automatically.\nAnd therefore, require manual work to add support for the new language.\n\nIf you would like to see new languages, don't hesitate to create issues with their request.\nOr even implement them yourself and create a pull request.\n\n## Use as dependency\n\nYou can reuse different parts of the `PSIMiner` inside your one tool, e.g. plugin for model inference.\nTo add core part of the tool (without dependency to CLI) add following code into your gradle.kts file:\n```\ndependencies {\n    implementation(\"org.jetbrains.research.psiminer:psiminer-core\") {\n        version {\n            branch = \"main\"\n        }\n    }\n}\n```\n\nRemember that `PSIMiner` is *plugin* for IntelliJ IDEA and, therefore, can be integrated only in another plugin.\n\n## Citation\n\nThe [paper](https://ieeexplore.ieee.org/document/9463105)\ndedicated to the `PSIMiner` was published in MSR'21.\nIf you use `PSIMiner` in your academic work, please, cite it.\n```\n@inproceedings{spirin_psiminer,\n  author={Spirin, Egor and Bogomolov, Egor and Kovalenko, Vladimir and Bryksin, Timofey},\n  booktitle={2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)}, \n  title={PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code}, \n  year={2021},\n  pages={13-17},\n  doi={10.1109/MSR52588.2021.00014}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjetbrains-research%2Fpsiminer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjetbrains-research%2Fpsiminer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjetbrains-research%2Fpsiminer/lists"}