An open API service indexing awesome lists of open source software.

https://github.com/binaryeq/daleq

datalog-based binary equivalence
https://github.com/binaryeq/daleq

Last synced: 5 months ago
JSON representation

datalog-based binary equivalence

Awesome Lists containing this project

README

          

# DALEQ - Datalog-based Binary Equivalence

## Publications

1. [Jens Dietrich, Behnaz Hassanshahi: DALEQ - Explainable Equivalence for Java Bytecode (preprint)](https://arxiv.org/abs/2508.01530)
2. [Jens Dietrich, Tim White, Behnaz Hassanshahi, Paddy Krishnan: Levels of Binary Equivalence for the Comparison of Binaries from Alternative Builds. ICSME'25](https://arxiv.org/abs/2410.08427)
3. [Jens Dietrich, Tim White, Valerio Terragni, Behnaz Hassanshahi: Towards Cross-Build Differential Testing. ICST'25](https://www.researchgate.net/publication/391937833_Towards_Cross-Build_Differential_Testing)

## Build and Use CLI

DALEQ has been developed and tested with Java 17.
DALEQ can be built with Maven as follows:

`mvn clean package dependency:copy-dependencies`

Then run DALEQ as follows (for windows: change the classpath separator accordingly):

```
java -cp "target/classes:target/dependency/*" io.github.bineq.daleq.cli.Main \

-j1,--jar1 the first jar file to compare (required)
-j2,--jar2 the second jar file to compare (required)
-o,--out the output folder where the report will be generated (required)
-s1,--src1 the first jar file with source code to compare (optional)
-s2,--src2 the second jar file with source code to compare (optional)
-a,--autoopen if set, the generated html report will be opened automatically (optional, don't use this for CI integration)
-dq,--daleq one of {sound,soundy,both} (optional, default is soundy)
```

The jar built can also be used directly as an executable superjar.

```
java -jar target/daleq-.jar
```

Running the program will create a report `report.html` in the specified output folder.

The program returns with the following exit codes:

```
0 - all classes and resources are equal, source codes (if analysed) are equivalent
1 - some classes are not equal but still equivalent, resources are equal, source-codes (if analysed) are equivalent
2 - some classes are not equivalent (wrt daleq) or some resources (e.g. metadata) are not equal or some sources (if analysed) are not equivalent
```

This can be used to integrate the tool into CI processes.

### Locating Souffle

*Daleq* will try to find the *souffle* executable in the path.
This can be overridden by passing it as a JVM argument, i.e. `-DSOUFFLE=`.

## Overview

*Daleq* takes Java bytecode as input and produces a relational database.
This is done in two steps.

### Step 1 - EDB Extraction

```mermaid
flowchart LR
bytecode["bytecode (.class)"] --> asm(("asm")) --> edb["EDB"] --> souffle(("souffle")) --> idb["IDB"]
```

A database representing the bytecode is extracted. An [asm-based](https://asm.ow2.io/) static analysis is used for this purpose.
The result is the EDB (extensional database), a folder with tsv files, each tsv file corresponding to a predicate.
Those facts represent the bytecode, including class name, bytecode version, super types and interfaces,
access flags for method, fields and classes, bytecode instructions in methods, etc.

The representation is low-level, however, there is one significant abstraction (provided by _asm_): constant pool references are
resolved.

Each fact has a unique generated id that is used to provide provenance.

### Step 2 - IDB Computation

The IDB computation applies rules to compute a second set of facts (IDB files) that normalises
the EBD to establish whether two bytecode compared are equivalent or not.

Rules that are applied for this purpose are defined in
`src/main/resources/rules/`, `advanced.souffle` is the default set.

The API to generate the IDB is provided by `io.github.bineq.daleq.Souffle::createIDB`.

### Step 3 - Comparing Bytecode

When trying to establish equivalence between two (compiled) classes, IDBs cannot be compared directly for the following reasons:

1. The provenance (id) terms -- this reflects the process of normalisation that might be different for each of the classes compared
2. the IDB contains additional facts for predicates to keep track of facts that have been removed (i.e. EDB facts that have no counterpart in the IDB), or moved.
3. facts corresponding to bytecode instruction have an `instructioncounter` slot to define the order of instructions within a method. Those might change in the IDB as normalisation may move/remove/insert facts corresponding to instructions.

#### Projection

The purpose of *projection* is to remove the aforementioned parts (facts and slots) from the database.
At the moment normalisation is implemented in Java.

#### Workflow to Compare two Binaries

The classes `io.github.bineq.daleq.idb.IDBReader` and `io.github.bineq.daleq.idb.IDBPrinter` are used to read an IDB,
and convert it into a textual representation suitable for comparison and diffing (using a standard text fiff tool such as _diff_ or _kdiff3_). This also supports projection.

```mermaid
flowchart LR
bytecode1["class1"] --> asm1(("asm")) --> edb1["EDB1"] --> souffle1(("souffle")) --> idb1["IDB1"] --> IDBReader1(("project")) --> file1 --> diff --> result
bytecode2["class2"] --> asm2(("asm")) --> edb2["EDB2"] --> souffle2(("souffle")) --> idb2["IDB2"] --> IDBReader2(("project")) --> file2 --> diff

```

## Normalisation Rules

The datalog rules being used can be found in `src/main/resources/rules/`.
The structure of the rules is as follows:

- `core.souffle` - mostly generated standard rules mapping EDB facts to IDB facts, plus some additional guard predicates that can be used to override default rules
- `commons/access.souffle` - rules to derive facts for access flags such as *final* and *public* from access flags using bit encoding, for improved usability
- `normalisations/sound` - normalisations that have no impact on program behaviour
- `normalisations/soundy` - normalisations that are [soundy](http://soundiness.org/) in the sense that they might have some impact on program behaviour if the program uses reflection or similar dynamic programming techniques to inspect its own code. Those have been vetted for possible exploitability. For a conservative approach, those rules can be excluded from the analysis by using the _sound_ CLI option.

## Provenance

The first slot (term) for each fact is an *id*.
For EDB facts, those ids are generated during the bytecode analysis / extracyion phase.

Example (from `AALOAD.facts`):

```csv
F1428629 org/apache/hadoop/hdfs/HAUtil::getConfForOtherNodes(Lorg/apache/hadoop/conf/Configuration;)Ljava/util/List; 3200
```

When the IDB is created, each IDB fact also has this slot.
The values encode a derivation tree. Example:

```csv
R_AALOAD[F1428629] ...
```

This fact has been computed by applying rule `R_AALOAD` to fact `F1428629`.
I.e. those ids encode a derivation tree:

```mermaid
graph TD;
R_AALOAD --> F1428629;
```

A less trivial example is `R_42[F1,R_43[F2,F3]] ` , this encodes the following derivation tree:

```mermaid
graph TD;
R_42 --> F1;
R_42 --> R_43;
R_43 --> F2;
R_43 --> F3;
```

The project contains a parser to transform those strings into
trees (`io.github.bineq.daleq.souffle.provenance.ProvenanceParser`).

The grammar of the encoded proofs is defined here:
`src/main/antlr4/io/github/bineq/daleq/souffle/provenance/Proof.g4`.

## The Generated Report

This is an example of a generated report. Here, we have compared the jar file for the artifact
_javax.transaction:jta:1.1_ from Maven Central with the corresponding artifact rebuilt by Google's assured open source.
Daleq employs various analysers to compare classes, meta-data and resources within the jars.
The results are displayed in a table, the rows correspond to files within the jar(s), and the columns correspond to the various analysers.
Analysers compare the content of files, specialised analysers are used to compare source code and bytecode. The actual daleq analyser results are displayed in the last column. Please have a look at our paper for more details.

Possible analysis result states are:

- **PASS** the comparison yields true
- **FAIL** the comparison yields false
- **N/A** the comparison cannot be applied (e.g., a source Code comparison cannot be used to compare bytecode)
- **ERROR** the comparison has resulted in an error

image

The markers next to each result are hyperlinks to generated pages with additional provenance.
For non-equivalence statements, those pages are usually diff reports rendered in HTML.
For daleq equivalence statements, advanced diff reports are generated based on the derivations recorded when datalog rules are applied.
This is an example:

image

## Constructing the EDB Only

```
usage: java -cp io.github.bineq.daleq.edb.FactExtractor
-cl the location of compiled classes, a jar file or folder
-f a folder where to create the extension database (input
.facts files), the folder will be created if it does not
exist
-s a file where to create the souffle program (.souffle
file) containing imports and input predicate declarations
-v whether to verify the datalog facts against the schema,
must be true or false, default is true
```

## Notes

### Testing during Builds

Note that tests requires souffle to be set up (see also **IDB Computation** section).
This has to be done locally at the moment, see also [issue 11](https://github.com/binaryeq/daleq/issues/11).
Tests that require souffle will not fail but will be skipped if souffle is not available.

### Performance and Issues with Souffle

Souffle [occasionally fails](https://github.com/binaryeq/daleq/issues/22).
When souffle is used, a `Thread::sleep` instruction is used to pause DALEQ.
This reduces the number of errors (perhaps a souffle race condition?), but makes DALEQ slower.

Souffle issues will be reported by the appplication, and flagged as __error__ .

### Adding Your Own Analyser

Additional analysers can be easily added by following those steps:

1. Write an analyser (say `MyAnalyser`) implementing `io.github.bineq.daleq.cli.Analyser`
2. In your project, add a file `src/main/resources/META-INF/services/io.github.bineq.daleq.cli.Analyser`
3. Add a line with the fully qualified name of `MyAnalyser` to this file
4. Build a jar file
5. When running DALEQ, add this jar file to the classpath
6. Note that the order in which columns are displayed in reports is determined by the property `io.github.bineq.daleq.cli.Analyser::positionHint`, a values between 0 and 100.

## Contributing

To contribute to this project, please follow the steps below to set up your development environment.

First, create a virtual environment to isolate dependencies:

```bash
python3 -m venv .venv
```

Activate the virtual environment and install the `pre-commit` hooks.

```bash
source .venv/bin/activate
pip install -r requirements.txt
pre-commit install
```

Now, the hooks will automatically run on each commit to check for any issues in the code. You can manually run the pre-commit hooks on all files at any time with:

```bash
pre-commit run --all-files
```

## Sponsors

image

[Oracle Labs Australia](https://labs.oracle.com/)