Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jetbrains-research/astminer
A library for mining of path-based representations of code (and more)
https://github.com/jetbrains-research/astminer
antlr code2vec mining
Last synced: 3 days ago
JSON representation
A library for mining of path-based representations of code (and more)
- Host: GitHub
- URL: https://github.com/jetbrains-research/astminer
- Owner: JetBrains-Research
- License: mit
- Created: 2018-12-14T16:37:33.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-12-11T04:03:46.000Z (about 1 year ago)
- Last Synced: 2024-12-23T23:09:36.181Z (10 days ago)
- Topics: antlr, code2vec, mining
- Language: Kotlin
- Homepage:
- Size: 1.92 MB
- Stars: 285
- Watchers: 8
- Forks: 80
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[![JetBrains Research](https://jb.gg/badges/research.svg)](https://confluence.jetbrains.com/display/ALL/JetBrains+on+GitHub)
![astminer version](https://img.shields.io/badge/astminer-v0.9.0-blue)# `astminer`
A library for mining [path-based representations of code](https://arxiv.org/pdf/1803.09544.pdf) and more
supported by the
[Machine Learning Methods for Software Engineering](https://research.jetbrains.org/groups/ml_methods)
group at [JetBrains Research](https://research.jetbrains.org).Supported languages of the input:
| | Java | Python | C/C++ | JavaScript | PHP |
|---------|------|--------|-------|------------|-----|
| ANTLR | ✅ | ✅ | | ✅ | ✅ |
| GumTree | ✅ (JDT and srcML) | ✅ | | | |
| Fuzzy | | | ✅ | | |
| JavaParser | ✅ | | | | |
| TreeSitter | ✅ | | | | |
| JavaLang| ✅ | | | | |## About
`astminer` lets you create an end-to-end pipeline to process code for machine learning models.Currently, it supports the extraction of:
* Path-based representations of files/methods
* Raw ASTs of files/methods`astminer` was first implemented as a part of the pipeline in the [code style extraction project](https://arxiv.org/abs/2002.03997) and later converted into a reusable tool.
It is designed to be easily extensible to new languages.`astminer` allows you to convert source code cloned from VCSs to formats suitable for training.
To achieve that, `astminer` incorporates the following processing modules:
- [Filters](./docs/filters.md) to remove redundant samples from data.
- [Label extractors](./docs/label_extractors.md) to create a label for each tree.
- [Storages](./docs/storages.md) to define the storage format.## Usage
There are two ways to use `astminer`:- [As a standalone CLI tool](#using-astminer-cli) with a pre-implemented logic for common processing and mining tasks.
- [Integrated](#using-astminer-as-a-dependency) into your Kotlin/Java mining pipelines as a Gradle dependency.### Using `astminer` CLI
1. [Build the CLI](./docs/cli.md#Getting+started) from the sources.
2. Prepare your inputs and [configure](./docs/cli.md#Configuration) pipeline options. For config examples, see the [configs](./configs) directory.
3. To run the CLI, pass the config to the shell script:
```shell
./cli.sh
```
Alternatively, you can run the tool inside the [Docker image](./docs/cli.md#Docker).### Using `astminer` as a dependency
#### Import
`astminer` is available in the JetBrains Space package repository. You can add the dependency in your `build.gradle` file:
```
repositories {
maven {
url "https://packages.jetbrains.team/maven/p/astminer/astminer"
}
}dependencies {
implementation 'io.github.vovak:astminer:'
}
```If you use `build.gradle.kts`:
```
repositories {
maven(url = uri("https://packages.jetbrains.team/maven/p/astminer/astminer"))
}dependencies {
implementation("io.github.vovak:astminer:")
}
```#### Local development
To use a specific version of the library, navigate to the required branch and build a local version of `astminer`:
```shell
./gradlew publishToMavenLocal
```
After that add `mavenLocal()` into the `repositories` section in your gradle configuration.#### Examples
If you want to use `astminer` as a library in your Java/Kotlin-based data mining tool, check the following usage examples:
* Simple standalone [example scripts](src/examples) in Java and Kotlin with calling to different APIs of `astminer`.
* [psiminer](https://github.com/JetBrains-Research/psiminer), a mining tool that uses `astminer` to extract paths from PSI trees. See the [code2seq storage implementation] (https://github.com/JetBrains-Research/psiminer/blob/master/psiminer-core/src/main/kotlin/storage/paths/Code2SeqStorage.kt).Please consider trying Kotlin for your data mining pipelines: from our experience, it is much better suited for data collection and transformation instruments than Java.
## Contribution
We believe that `astminer` can find use beyond our own mining tasks.
Please help make `astminer` easier to use by sharing your use cases. Pull requests are welcome as well.
Support for other languages and documentation are the key areas of improvement.## Citing `astminer`
A [paper](https://zenodo.org/record/2595271) dedicated to `astminer` (more precisely, to its older version [PathMiner](https://github.com/vovak/astminer/tree/pathminer)) was presented at [MSR'19](https://2019.msrconf.org/).
If you use `astminer` in your academic work, please cite it.
```
@inproceedings{kovalenko2019pathminer,
title={PathMiner: a library for mining of path-based representations of code},
author={Kovalenko, Vladimir and Bogomolov, Egor and Bryksin, Timofey and Bacchelli, Alberto},
booktitle={Proceedings of the 16th International Conference on Mining Software Repositories},
pages={13--17},
year={2019},
organization={IEEE Press}
}
```