An open API service indexing awesome lists of open source software.

https://github.com/lac-dcc/yali

A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.
https://github.com/lac-dcc/yali

machine-learning obfuscation ollvm

Last synced: 6 months ago
JSON representation

A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.

Awesome Lists containing this project

README

          


Yali



A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.


logo


License: GPL v3
Linting: Pylint

Last update

## :pushpin: **Contents Table**

* [Introduction](#introduction)
* [Getting Started](#getting-started)
* [Prerequisites](#prerequisites)
* [Setup](#setup)
* [Running](#running)
* [Statistics](#statistics)
* [Structure](#structure)
* [Technical Report](#technical-report)

---

## :scroll: **Introduction**

Let _D_ be a deep learning model that classifies programs according to the problem they solve. This project aims to evaluate how _D_ behaves with obfuscated code. We want to know how much the accuracy of _D_ is affected. This study also evaluates different types of program representations.


examples of classifications

> The top of the image above shows the histogram produced by a specific strategy for program *292*. This program belongs to class 11 of the *POJ-104 dataset*. The bottom of the image shows how each model classifies the variations of program *292*.

---

## :checkered_flag: **Getting Started**
In this section are the steps to reproduce our experiments.

### **Prerequisites**
You need to install the following packages to run this project:

* [Docker](https://www.docker.com/get-started/) and [Docker Compose](https://docs.docker.com/compose/install/) to run our experiments
* [Python-3](https://www.python.org/downloads/) to plot the results in the project's Jupyter Notebook
* [Wget](https://www.gnu.org/software/wget/), [Tar](https://www.gnu.org/software/tar/) and [Sed](https://www.gnu.org/software/sed/) to run the initial scripts to configure the repository

### **Setup**

First, you should copy the `.env.example` file and rename it to `.env`.
You can now set environment variables in the `.env` file at the project's root. You can change the following variables:



Variable
Description
Value


REPRESENTATION
Program embedding that will be used to represent a program. This variable is required.


  • histogram

  • ir2vec

  • milepost

  • cfg

  • cfg_compact

  • cdfg

  • cdfg_compact

  • cdfg_plus

  • programl





MODEL
Selected machine learning model. This variable is required. If REPRESENTATION is equal to `cfg`, `cfg_compact`, `cdfg`, `cdfg_compact`, `cdfg_plus` or `programl`, the model must be `dgcnn` or `gcn`.


  • "cnn" (Convolutional Neural Network by Lili Mou et al.)

  • "rf" (Random Forest)

  • "svm" (Support Vector Machine)

  • "knn" (K-Nearest Neighbors)

  • "lr" (Logistic Regression)

  • "mlp" (Multilayer Perceptron)

  • "dgcnn" (Deep Graph CNN)





TRAINDATASET / TESTDATASET
Dataset that will be used in the training/testing phase. TRAINDATASET is required, but TESTDATASET must be empty if you want to use the same dataset in training and testing phase.





OPTLEVELTRAIN / OPTLEVELTEST
Optimization level applied in the traning/testing dataset. OPTLEVELTRAIN is required, but OPTLEVELTEST must be empty if TESTDATASET is empty.


  • O0

  • O3





NUMCLASSES
The number of classes of the dataset. This variable is required.



ROUNDS
The number of rounds to run the model. This variable is required.



MEMORYPROF
Indicate whether a memory profiler will be used. This variable is required.


  • yes

  • no





FILTER_HISTOGRAM
String with a comma separated list of opcodes to consider. Only available if REPRESENTATION=histogram.


After that, you need to prepare the environment to run our experiments. Run the following command line:

```bash
$ ./setup.sh
```
> This will download the datasets, build the docker image and create the necessary folders for the project.

### **Running**
Now, you can run the following command line:

```bash
$ ./run.sh MODE
```
There are the following values for `MODE`:
* **build**: Builds the docker container based on the modifications in the yali project
* **custom**: Runs the project based on the variables set on `.env` file
* **all**: Runs all experiments available in `MODE`
* **speedup**: Runs the speedup analysis with the benchmark game
* **embeddings**: Runs the embedding analysis
* **resources**: Runs only the resources analysis
* **malware**: Runs the experiment to detect classes of malware
* **game0** Runs the [Game 0](https://doi.org/10.1145/3579990.3580012)
* **game1**: Runs the [Game 1](https://doi.org/10.1145/3579990.3580012)
* **game2**: Runs the [Game 2](https://doi.org/10.1145/3579990.3580012)
* **game3**: Runs the [Game 3](https://doi.org/10.1145/3579990.3580012)
* **discover**: Runs the [Discover Game](https://doi.org/10.1145/3579990.3580012)
* **histogram_ext**: Runs an accuracy analysis with an extended histogram

> This will run the docker container with the configurations in the `.env` file.

---

## :bar_chart: **Statistics**
The `Statistics` folder contains _Jupyter Notebooks_ that plot the data generated by the experiments. Each notebook describes each chart and the steps to develop them. There are the following _notebooks_:

* [**EmbeddingResults**](./Statistics/EmbeddingResults.ipynb): Presents information about the accuracy of the dgcnn and cnn models with different representations
* [**GameResults**](./Statistics/GameResults.ipynb): Presents information about the 4 games proposed in our [work](https://doi.org/10.1145/3579990.3580012).
* [**ResourceResults**](./Statistics/ResourceResults.ipynb): Presents information about resource consumption (memory and time) of each model
* [**StrategiesResults**](./Statistics/StrategiesResults.ipynb): Presents the distance between the histograms of the original programs and the histograms generated by the obfuscators

---

## :card_index_dividers: Structure
The repository has the following organization:

```bash
|-- Classification: "scripts for the classification process"
|-- Compilation: "Scripts for the compilation process"
|-- Docs: "Repository documentation"
|-- Entrypoint: "Container setup"
|-- Extraction: "Script to extract a program representation and convert CSV to Numpy"
|-- HistogramPass: "LLVM pass to get the histograms"
|-- MalwareDataset: "Malware dataset to support experiments in the project"
|-- Representations: "Scripts to extract different program representations"
|-- Statistics: "Jupyter notebooks"
|-- Experiments: "Extra experiments using the yali infrastructure (each one of them has its own ReadME)"
|-- Utils: "Python scripts to support the `Experiments` folder and the Jupyter Notebooks"
|-- Volume: "Volume of the container"
|-- Csv: "CSVs with the histograms"
|-- Embeddings: "Different representations of programs in the Source folder"
|-- Histograms: "histograms in the Numpy format"
|-- Irs: "LLVM IRs of the programs"
|-- Results: "Results of the training/testing phase"
|-- Source: "Source code of the programs"
```

---

## :closed_book: Technical Report

This framework is used in the following published papers:

- [*A Game-Based Framework to Compare Program Classifiers and Evaders*](https://doi.org/10.1145/3579990.3580012). To cite it:
```latex
@inproceedings{damasio23,
author = {Dam\'{a}sio, Tha\'{\i}s and Canesche, Michael and Pacheco, Vin\'{\i}cius and Botacin, Marcus and Faustino da Silva, Anderson and Quint\~{a}o Pereira, Fernando M.},
title = {A Game-Based Framework to Compare Program Classifiers and Evaders},
year = {2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3579990.3580012},
doi = {10.1145/3579990.3580012},
booktitle = {Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization},
pages = {108–121},
numpages = {14},
keywords = {algorithm classification, obfuscation},
location = {Montr\'{e}al, QC, Canada},
series = {CGO 2023}
}
```
- [*Impacto de Ofuscadores e Otimizadores de Código na Acurácia de Classificadores de Programa*](https://doi.org/10.1145/3561320.3561322). To cite it:

```latex
@inproceedings{damasio22,
author = {Dam\'{a}sio, Tha\'{\i}s and Canesche, Michael and Pacheco, Vin\'{\i}cius and Faustino, Anderson and Quintao Pereira, Fernando Magno},
title = {Impacto de Ofuscadores e Otimizadores de C\'{o}Digo Na Acur\'{a}Cia de Classificadores de Programas},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3561320.3561322},
doi = {10.1145/3561320.3561322},
booktitle = {Proceedings of the XXVI Brazilian Symposium on Programming Languages},
pages = {68–75},
numpages = {8},
keywords = {neural network, compiler optimizations, obfuscation},
location = {Virtual Event, Brazil},
series = {SBLP '22}
}
```