https://github.com/lac-dcc/yali

A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.
https://github.com/lac-dcc/yali

machine-learning obfuscation ollvm

Last synced: 6 months ago
JSON representation

A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.

Host: GitHub
URL: https://github.com/lac-dcc/yali
Owner: lac-dcc
License: gpl-3.0
Created: 2021-12-11T04:09:13.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2023-08-01T14:16:46.000Z (over 2 years ago)
Last Synced: 2025-06-07T10:02:14.084Z (7 months ago)
Topics: machine-learning, obfuscation, ollvm
Language: LLVM
Homepage: https://doi.org/10.1145/3579990.3580012
Size: 47.8 MB
Stars: 33
Watchers: 5
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Yali

A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.

logo

## :pushpin: **Contents Table**

* [Introduction](#introduction)
* [Getting Started](#getting-started)
* [Prerequisites](#prerequisites)
* [Setup](#setup)
* [Running](#running)
* [Statistics](#statistics)
* [Structure](#structure)
* [Technical Report](#technical-report)

---

## :scroll: **Introduction**

Let _D_ be a deep learning model that classifies programs according to the problem they solve. This project aims to evaluate how _D_ behaves with obfuscated code. We want to know how much the accuracy of _D_ is affected. This study also evaluates different types of program representations.

examples of classifications

> The top of the image above shows the histogram produced by a specific strategy for program *292*. This program belongs to class 11 of the *POJ-104 dataset*. The bottom of the image shows how each model classifies the variations of program *292*.

---

## :checkered_flag: **Getting Started**
In this section are the steps to reproduce our experiments.

### **Prerequisites**
You need to install the following packages to run this project:

* [Docker](https://www.docker.com/get-started/) and [Docker Compose](https://docs.docker.com/compose/install/) to run our experiments
* [Python-3](https://www.python.org/downloads/) to plot the results in the project's Jupyter Notebook
* [Wget](https://www.gnu.org/software/wget/), [Tar](https://www.gnu.org/software/tar/) and [Sed](https://www.gnu.org/software/sed/) to run the initial scripts to configure the repository

### **Setup**

First, you should copy the `.env.example` file and rename it to `.env`.
You can now set environment variables in the `.env` file at the project's root. You can change the following variables:

Variable
Description
Value

REPRESENTATION
Program embedding that will be used to represent a program. This variable is required.

histogram

ir2vec

milepost

cfg_compact

cdfg

cdfg_compact

cdfg_plus

programl

MODEL
Selected machine learning model. This variable is required. If REPRESENTATION is equal to `cfg`, `cfg_compact`, `cdfg`, `cdfg_compact`, `cdfg_plus` or `programl`, the model must be `dgcnn` or `gcn`.

"cnn" (Convolutional Neural Network by Lili Mou et al.)

"rf" (Random Forest)

"svm" (Support Vector Machine)

"knn" (K-Nearest Neighbors)

"lr" (Logistic Regression)

"mlp" (Multilayer Perceptron)

"dgcnn" (Deep Graph CNN)

TRAINDATASET / TESTDATASET
Dataset that will be used in the training/testing phase. TRAINDATASET is required, but TESTDATASET must be empty if you want to use the same dataset in training and testing phase.

"OJClone" (POJ-104 dataset used by Lili Mou et al.)

"BCF" (The OJClone dataset that was obfuscated by the Bogus Control Flow strategy)

"FLA" (The OJClone dataset that was obfuscated by the Control Flow Flattening strategy)

"SUB" (The OJClone dataset was obfuscated by the Instructions Substitution strategy)

"OLLVM" (The OJClone dataset that was obfuscated by the Control Flow Flattening, Bogus Control Flow Strategy and Instructions Substitution strategies, respectively)

"MCMC" (The OJClone dataset that was obfuscated by the Markov Chain Monte Carlo strategy)

"DRLSG" (The OJClone dataset that was obfuscated by the Deep Reinforcement Learning Sequence Generation strategy)

"RS" (The OJClone dataset that was obfuscated by the Random-Search strategy)

OPTLEVELTRAIN / OPTLEVELTEST
Optimization level applied in the traning/testing dataset. OPTLEVELTRAIN is required, but OPTLEVELTEST must be empty if TESTDATASET is empty.

NUMCLASSES
The number of classes of the dataset. This variable is required.

ROUNDS
The number of rounds to run the model. This variable is required.

MEMORYPROF
Indicate whether a memory profiler will be used. This variable is required.

FILTER_HISTOGRAM
String with a comma separated list of opcodes to consider. Only available if REPRESENTATION=histogram.

After that, you need to prepare the environment to run our experiments. Run the following command line:

```bash
$ ./setup.sh
```
> This will download the datasets, build the docker image and create the necessary folders for the project.

### **Running**
Now, you can run the following command line:

```bash
$ ./run.sh MODE
```
There are the following values for `MODE`:
* **build**: Builds the docker container based on the modifications in the yali project
* **custom**: Runs the project based on the variables set on `.env` file
* **all**: Runs all experiments available in `MODE`
* **speedup**: Runs the speedup analysis with the benchmark game
* **embeddings**: Runs the embedding analysis
* **resources**: Runs only the resources analysis
* **malware**: Runs the experiment to detect classes of malware
* **game0** Runs the [Game 0](https://doi.org/10.1145/3579990.3580012)
* **game1**: Runs the [Game 1](https://doi.org/10.1145/3579990.3580012)
* **game2**: Runs the [Game 2](https://doi.org/10.1145/3579990.3580012)
* **game3**: Runs the [Game 3](https://doi.org/10.1145/3579990.3580012)
* **discover**: Runs the [Discover Game](https://doi.org/10.1145/3579990.3580012)
* **histogram_ext**: Runs an accuracy analysis with an extended histogram

> This will run the docker container with the configurations in the `.env` file.

---

## :bar_chart: **Statistics**
The `Statistics` folder contains _Jupyter Notebooks_ that plot the data generated by the experiments. Each notebook describes each chart and the steps to develop them. There are the following _notebooks_:

* [**EmbeddingResults**](./Statistics/EmbeddingResults.ipynb): Presents information about the accuracy of the dgcnn and cnn models with different representations
* [**GameResults**](./Statistics/GameResults.ipynb): Presents information about the 4 games proposed in our [work](https://doi.org/10.1145/3579990.3580012).
* [**ResourceResults**](./Statistics/ResourceResults.ipynb): Presents information about resource consumption (memory and time) of each model
* [**StrategiesResults**](./Statistics/StrategiesResults.ipynb): Presents the distance between the histograms of the original programs and the histograms generated by the obfuscators

---

## :card_index_dividers: Structure
The repository has the following organization:

```bash
|-- Classification: "scripts for the classification process"
|-- Compilation: "Scripts for the compilation process"
|-- Docs: "Repository documentation"
|-- Entrypoint: "Container setup"
|-- Extraction: "Script to extract a program representation and convert CSV to Numpy"
|-- HistogramPass: "LLVM pass to get the histograms"
|-- MalwareDataset: "Malware dataset to support experiments in the project"
|-- Representations: "Scripts to extract different program representations"
|-- Statistics: "Jupyter notebooks"
|-- Experiments: "Extra experiments using the yali infrastructure (each one of them has its own ReadME)"
|-- Utils: "Python scripts to support the `Experiments` folder and the Jupyter Notebooks"
|-- Volume: "Volume of the container"
|-- Csv: "CSVs with the histograms"
|-- Embeddings: "Different representations of programs in the Source folder"
|-- Histograms: "histograms in the Numpy format"
|-- Irs: "LLVM IRs of the programs"
|-- Results: "Results of the training/testing phase"
|-- Source: "Source code of the programs"
```

---

## :closed_book: Technical Report

This framework is used in the following published papers:

- [*A Game-Based Framework to Compare Program Classifiers and Evaders*](https://doi.org/10.1145/3579990.3580012). To cite it:
```latex
@inproceedings{damasio23,
author = {Dam\'{a}sio, Tha\'{\i}s and Canesche, Michael and Pacheco, Vin\'{\i}cius and Botacin, Marcus and Faustino da Silva, Anderson and Quint\~{a}o Pereira, Fernando M.},
title = {A Game-Based Framework to Compare Program Classifiers and Evaders},
year = {2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3579990.3580012},
doi = {10.1145/3579990.3580012},
booktitle = {Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization},
pages = {108–121},
numpages = {14},
keywords = {algorithm classification, obfuscation},
location = {Montr\'{e}al, QC, Canada},
series = {CGO 2023}
}
```
- [*Impacto de Ofuscadores e Otimizadores de Código na Acurácia de Classificadores de Programa*](https://doi.org/10.1145/3561320.3561322). To cite it:

```latex
@inproceedings{damasio22,
author = {Dam\'{a}sio, Tha\'{\i}s and Canesche, Michael and Pacheco, Vin\'{\i}cius and Faustino, Anderson and Quintao Pereira, Fernando Magno},
title = {Impacto de Ofuscadores e Otimizadores de C\'{o}Digo Na Acur\'{a}Cia de Classificadores de Programas},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3561320.3561322},
doi = {10.1145/3561320.3561322},
booktitle = {Proceedings of the XXVI Brazilian Symposium on Programming Languages},
pages = {68–75},
numpages = {8},
keywords = {neural network, compiler optimizations, obfuscation},
location = {Virtual Event, Brazil},
series = {SBLP '22}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lac-dcc/yali

Awesome Lists containing this project

README

Yali