https://github.com/src-d/style-analyzer
Lookout Style Analyzer: fixing code formatting and typos during code reviews
https://github.com/src-d/style-analyzer
lookout mloncode
Last synced: 6 months ago
JSON representation
Lookout Style Analyzer: fixing code formatting and typos during code reviews
- Host: GitHub
- URL: https://github.com/src-d/style-analyzer
- Owner: src-d
- License: agpl-3.0
- Created: 2018-07-04T16:42:30.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2022-11-23T13:27:09.000Z (almost 3 years ago)
- Last Synced: 2025-05-05T05:05:11.995Z (6 months ago)
- Topics: lookout, mloncode
- Language: Jupyter Notebook
- Homepage:
- Size: 107 MB
- Stars: 32
- Watchers: 8
- Forks: 21
- Open Issues: 43
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README

# style-analyzer
Fix code style faults using 🤖
[](https://readthedocs.org/projects/style-analyzer/)
[](https://travis-ci.com/src-d/style-analyzer)
[](https://codecov.io/github/src-d/style-analyzer)
[](https://hub.docker.com/r/srcd/style-analyzer)
[](https://pypi.python.org/pypi/lookout-style)

[](https://opensource.org/licenses/AGPL-3.0)
[Overview](#overview) • [How To Use](#how-to-use) • [Science](#science) • [Contributions](#contributions) • [License](#license)
## Overview
This is a collection of analyzers for [Lookout](https://github.com/src-d/lookout) - the open source framework for code review intelligence.
You can run them directly on your Git repositories, but most likely you don't want that and instead
just use the upcoming code review product from [source{d}](https://sourced.tech).
Overall, this project is a mix of research ideas and their applications to solving real problems.
Consider it as an experiment at this stage.
Currently, there is the "format" analyzer working and the one "typos" incubating. All the current and the future
ones are based on machine learning and never contain any hidden domain knowledge such as static
code analysis rules or human-written pattern matchers.
* [`lookout.style.format`](lookout/style/format) - mine "white box" code formatting rules with machine learning and validate new code against them.
* [`lookout.style.typos`](lookout/style/typos) - find typos in identifier names, using the dataset of 60 million identifiers already present in open source repositories on GitHub.
"format" analyzer supports only JavaScript for now, though it is not nailed to that language and
is based on the language-agnostic [Babelfish](https://doc.bblf.sh/) parser. Everything is written in Python.
## How To Use
There are several ways to run style-analyzer:
* [Developer's setup](doc/how-to/developer.md)
* [Demonstration setup](doc/how-to/demo.md)
* [Reports](doc/how-to/reports.md)
## Science
The implemented analyzers are driven by bleeding edge research. One day we will write papers about them,
but first we want to focus on making them work. Below are brief descriptions of how the analyzers
are designed.
#### format
The core of the format analyzer is a language model: we learn without labeled data, just by modeling the existing format in a repository given the current code at a given point in a file. We then check whether the proposed changes follow those learnt formatting conventions.
The training algorithm is summarized below.
1. Represent a file as a linear sequence of "virtual" nodes. Some nodes correspond to the UAST nodes, and some are inserted to mirror the real tokens in the code which are not present in the UAST (e.g. white spaces, keywords, quotes or braces).
2. Identify the nodes which we use as labels - that is, identify Y-s in the (X, Y) training samples. We have around 50 classes at the moment. Some of the classes are sequences of nodes, e.g. four space indentation increase. We also predict NOOP-s: the empty gaps between non-Y nodes.
3. Extract features from the nodes surrounding the Y nodes. We take a fixed-size window and record the internal types, roles, positions and unique identifiers (for tokens which are not present in the UAST) for the left and right siblings and the parent hierarchy (2-3 levels). The features for the left and for the right siblings are different so that we avoid the information "leakage". For example, the difference in offsets between the left and the right neighbor defines the exact length of the predicted token in between.
4. We train the random forest model on the collected (X, Y) dataset. We fine-tune it with bayesian optimization.
5. We extract the rules - the branches of the trees. We prune them in several steps: first we exclude the rules which do not improve the accuracy, second we remove the rule parts which are redundant.
6. We put 93% rule confidence threshold - that is, precision on the training set - and discard the rest. This gives ~95% joint precision.
7. The rules which are left is our model - the training result.
The application algorithm is much simpler - we take the rules and apply them. However, there are several quirks:
1. In case several rules are triggered, the rule with the highest confidence wins.
2. There are paired tokens which we predict such as quotes. It is possible that there are two rules which contradict each other - the left and the right quotes are predicted to be different. We pick the most confident prediction and change the second quote accordingly.
3. We check that the prediction does not break the code. For example, it can insert a newline in the middle of the expression which can change the AST. We run Babelfish on each changed line to see if the AST remains the same.
4. There is a huge chunk of code to represent the triggered rule in a human-readable format and generate the code for fixes.
#### typos
We take the dataset with identifiers extracted from [Public Git Archive](https://github.com/src-d/datasets/tree/master/PublicGitArchive).
We split them (blog post is pending early November). There are frequencies present for each "atom",
so we consider top frequent ones as ground truth. For each checked "atom", we take it's embedding
computed with [fasttext](https://github.com/facebookresearch/fastText), refine it with a deep
fully-connected neural network, generate candidates with [symspell](https://github.com/wolfgarbe/SymSpell)
and rank them with [XGBoost](https://github.com/dmlc/xgboost).
## Contributions
Contributions are very welcome and desired! Please follow the [code of conduct](doc/code_of_conduct.md)
and read the [contribution guidelines](doc/contributing.md). If you want to add a new cool style
fixer backed by machine learning, it is always a good idea to discuss it on
[Slack](https://sourced.tech/community/#talk).
## License
AGPL-3.0, see [LICENSE.md](LICENSE.md).