https://github.com/sturdy-dev/suspicious

Catching bugs in code with AI, fully local CLI app
https://github.com/sturdy-dev/suspicious

ai codereview

Last synced: 10 months ago
JSON representation

Catching bugs in code with AI, fully local CLI app

Host: GitHub
URL: https://github.com/sturdy-dev/suspicious
Owner: sturdy-dev
License: agpl-3.0
Created: 2022-11-23T13:25:39.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2024-03-19T03:45:02.000Z (almost 2 years ago)
Last Synced: 2025-03-27T09:45:34.675Z (11 months ago)
Topics: ai, codereview
Language: Python
Homepage:
Size: 749 KB
Stars: 58
Watchers: 7
Forks: 7
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# Suspicious

Catching bugs in code with AI, fully local CLI app. No data leaves your computer.

🤔 Overview •
🪄 Demos •
🔧 Installation •
💻 Usage •
🧠 How it works

-------------------------------------------------------------------

## Overview

This is a CLI application that analyzes a source code file using an AI model. It then shows you parts that look suspicious to it.

It does **not** use rules or static analysis the way a linter tool would. Instead, the model generates its own code suggestions based on the surrounding context. Check out [how it works](#how-does-it-work).

> NB: All processing is done on your hardware and no data is transmitted to the Internet

Example output:

![example results](./docs/screenshot.png)

## Demo

Here's the output of running the application on its own source files (so meta).

- `cli.py` — [source code](./src/suspicious/cli.py) → [generated output](https://sturdy-dev.github.io/suspicious/demos/cli_py/)
- `render.py` — [source code](./src/suspicious/render.py) → [generated output](https://sturdy-dev.github.io/suspicious/demos/render_py/)
- `sus.py` — [source code](./src/suspicious/sus.py) → [generated output](https://sturdy-dev.github.io/suspicious/demos/sus_py/)

## Have I seen this before?

There was this post [AI found a bug in my code](https://news.ycombinator.com/item?id=33632610) on Hacker News which was pretty cool. I wanted to try it on my own code, so I went ahead and built my implementation of the idea.

## Installation

You can install `sus` via `pip` or from the source.

### Pip (MacOS, Linux, Windows)

```bash
pip3 install suspicious
```

### From source

```bash
git clone git@github.com:sturdy-dev/suspicious.git
cd suspicious
python -m pip install .
```

## Usage

You can run the program like this:

```bash
sus /path/to/file.py
```

> Note that when you run this for the first time, the application will need to download a model (~500 MB) — [more info](#model) section.

This will generate and open an `.html` file with the results.

- `grey` means prediction is the same as the original
- `light grey` means the model had a different prediction but with super low confidence
- `light red` means things are looking a little sus
- `red` means there was a different prediction and confidence was higher

### Practical usage

Unclear. You run `sus` on a file and skim over the red stuff, maybe it spots something you missed. Ping me on [twitter](https://twitter.com/krlvi) if you catch something cool with it.

## How does it work?

In a nutshell, it feeds a tokenized representation of your source text into a Transformer model and asks the model to predict one token at a time using [Masked Language Modelling](https://huggingface.co/docs/transformers/tasks/language_modeling#masked-language-modeling).

For a general overview about Transformer models, check out [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) article by Jay Alammar, which helped me out in understanding the core ideas.

`sus` uses a model called [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) which has been trained on the [CodeSearchNet](https://huggingface.co/datasets/code_search_net) dataset. To do the MLM (masked language modelling) we are adding a `lm_head` layer.

When `sus` processes your code, it first tokenizes the text, where a token could be a special character or programming language keyword, English word or part of a word.

Before feeding the sequence of token ids to the model, one or multiple tokens are replaced with a special `` token. After feeding the input through the network, we extract just the value at the masked location. This masking is done in a loop for each token to generate individual predictions.

Since this process is impractically slow, instead of masking one token at a time, `sus` masks 10% of the tokens, making sure that the masked locations are spread out (so that there is sufficient context around each prediction site).

The output of this entire process is a list of structs that contain the original and predicted values for each token. Example:

```json5
{
"idx": 0, // position in sequence
"original": "foo", // as originally written in the source file
"predicted": "bar", // what the model predicted
"cosine_similarity": 0.23, // how different the prediction is from the original in the vector space
"probability": 0.92, // how confident the model is in it's prediction
}
```

This is then fed into an `html` template to be rendered for the user. Easy-peasy.

### Model

`sus` uses the decoder of [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder), specifically the [unixcoder-base-nine](https://huggingface.co/microsoft/unixcoder-base-nine) checkpoint. What's cool is that it's only 500 MB and ~120M parameters, which means it's quick to download and fast enough to run locally.

Larger models produce higher quality outputs, but you need to run the inference on a server.

## Supported languages

You can try `sus` on any source file, but you can expect best results with the following languages:

- java
- ruby
- python
- php
- javascript
- go
- c
- c++
- c#

## Bugs and limitations

- Accuracy — `sus` is meant to be executed locally (aka not sending code to a server), which puts some constraints on the AI model size. Larger models will produce higher quality results, but they can be tens of GB in size and without a beefy GPU could take a long time to generate the output. Because of this, `sus` uses a [modestly sized model](#model).
- Large files — The [model](#model) also puts constraints on the input size (analyzed file size). `sus` works around this by batching the input, but as a result of this, batches are not aware of the 'context' / code that is in other batches. Files are split in batches of 2500 characters which is super crude and is meant to correspond to ~1024 tokens.
- [Masking](#how-does-it-work) is done on per token basis. It could be interesting to first generate syntax tree from the code and then mask the entire node instead.

## License

Semantic Code Search is distributed under [AGPL-3.0-only](LICENSE.txt). For Apache-2.0 exceptions —

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sturdy-dev/suspicious

Awesome Lists containing this project

README