https://github.com/squareslab/codealign

A tool for finding instruction-level equivalence between two functions.
https://github.com/squareslab/codealign

Last synced: 6 months ago
JSON representation

A tool for finding instruction-level equivalence between two functions.

Host: GitHub
URL: https://github.com/squareslab/codealign
Owner: squaresLab
Created: 2025-01-28T21:26:55.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-02-24T16:37:28.000Z (11 months ago)
Last Synced: 2025-02-24T17:43:37.158Z (11 months ago)
Language: Python
Size: 92.8 KB
Stars: 1
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Introduction

Codealign is a tool for evaluating neural decompilers that computes equivalence between two input functions at the instruction level.

The intended use case is comparing the predictions of a neural decompiler with a reference correct answer like the original source code.

This work was introduced in [Fast, Fine-Grained Equivalence Checking for Neural Decompilers](https://arxiv.org/abs/2501.04811)

# Installation

Codealign is a python package.

```

git clone https://github.com/squaresLab/codealign.git

cd codealign

pip install .

```

then optionally

```

python -m unittest

```

# Usage

```python

from codealign import align, Alignment

prediction = """

int write_response(int fd, char *buf, int len) {

	int	i;

	for (i = 0; i < len; i += len) {

		if ((i = write(fd, buf + i, len - i)) <= 0)

			return 0;

	}

    return 1;

}

"""

reference = """

int write_response(int fd, char *response, int len) {

	int	retval;

	int	byteswritten = 0;

	while (byteswritten < len) {

		retval = write(fd, response + byteswritten, len - byteswritten);

		if (retval <= 0) {

			return 0;

		}

        byteswritten += retval;

	}

    return 1;

}

"""

alignment: Alignment = align(prediction, reference, 'c', partial_loops=True)

print(alignment)

```

Will yield

```

Alignment(candidate=write_response, reference=write_response)

  %1 = phi 0 %7

  %1 = phi 0 %8

  %2 = < %1 len

  %3 = < %1 len

  loop %2

  loop %3

  %7 = + %5 len

  %3 = + buf %1

  %4 = + response %1

  %4 = - len %1

  %5 = - len %1

  %5 = write(fd, %3, %4)

  %6 = write(fd, %4, %5)

  %6 = <= %5 0

  %7 = <= %6 0

  if %6

  if %7

  return 0

  return 0

  return 1

  return 1

  %8 = + %1 %6

```

Equivalent instructions are grouped together.

Alignment objects and be interacted with programmatically via several methods.

#### IR representations

```python

alignment.candidate_ir

alignment.reference_ir

```

These allow for access to individual functions in terms of codealign's internal representation.

#### Alignment List Representation

```python

alignment.alignment_list

```

Represents the alignment as pairs of instructions in the order `(candidate_instruction, reference_instruction)`.

If an instruction does not align with anything the corresponding value will be `None`.

Except in injective mode, an instruction can occur in more than one pair if it aligns with multiple other instructions.

#### Alignment Lookup Representation

```python

alignment[instruction] # read-only

```

Returns the instruction(s) with which a given instruction is aligned.

Instructions in codealign IR can be found in the `.candidate_ir` and `.reference_ir` attributes.

## Object Model

Codealign describes code in term of an internal object model.

These can be imported from `codealign.ir`.

The codealign object model includes, but is not limited to

- `Function`

- `BasicBlock`

- `SSAOperator`

- `VarOperator`

- `Variable`

#### Accessing the Original AST

Where possible, codealign provides references to the `tree-sitter` AST nodes from which a given instruction was derived.

To access this, use

```

from codealign.ir import SSAOperator

instruction: SSAOperator

instruction.ast_node

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/squareslab/codealign

Awesome Lists containing this project

README