https://github.com/lac-dcc/yali
A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.
https://github.com/lac-dcc/yali
machine-learning obfuscation ollvm
Last synced: 6 months ago
JSON representation
A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.
- Host: GitHub
- URL: https://github.com/lac-dcc/yali
- Owner: lac-dcc
- License: gpl-3.0
- Created: 2021-12-11T04:09:13.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2023-08-01T14:16:46.000Z (over 2 years ago)
- Last Synced: 2025-06-07T10:02:14.084Z (7 months ago)
- Topics: machine-learning, obfuscation, ollvm
- Language: LLVM
- Homepage: https://doi.org/10.1145/3579990.3580012
- Size: 47.8 MB
- Stars: 33
- Watchers: 5
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Yali
A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.
## :pushpin: **Contents Table**
* [Introduction](#introduction)
* [Getting Started](#getting-started)
* [Prerequisites](#prerequisites)
* [Setup](#setup)
* [Running](#running)
* [Statistics](#statistics)
* [Structure](#structure)
* [Technical Report](#technical-report)
## :scroll: **Introduction**
Let _D_ be a deep learning model that classifies programs according to the problem they solve. This project aims to evaluate how _D_ behaves with obfuscated code. We want to know how much the accuracy of _D_ is affected. This study also evaluates different types of program representations.
> The top of the image above shows the histogram produced by a specific strategy for program *292*. This program belongs to class 11 of the *POJ-104 dataset*. The bottom of the image shows how each model classifies the variations of program *292*.
## :checkered_flag: **Getting Started**
In this section are the steps to reproduce our experiments.
### **Prerequisites**
You need to install the following packages to run this project:
* [Docker](https://www.docker.com/get-started/) and [Docker Compose](https://docs.docker.com/compose/install/) to run our experiments
* [Python-3](https://www.python.org/downloads/) to plot the results in the project's Jupyter Notebook
* [Wget](https://www.gnu.org/software/wget/), [Tar](https://www.gnu.org/software/tar/) and [Sed](https://www.gnu.org/software/sed/) to run the initial scripts to configure the repository
### **Setup**
First, you should copy the `.env.example` file and rename it to `.env`.
You can now set environment variables in the `.env` file at the project's root. You can change the following variables:
Variable
Description
Value
REPRESENTATION
Program embedding that will be used to represent a program. This variable is required.
- histogram
- ir2vec
- milepost
- cfg
- cfg_compact
- cdfg
- cdfg_compact
- cdfg_plus
- programl
MODEL
Selected machine learning model. This variable is required. If REPRESENTATION is equal to `cfg`, `cfg_compact`, `cdfg`, `cdfg_compact`, `cdfg_plus` or `programl`, the model must be `dgcnn` or `gcn`.
- "cnn" (Convolutional Neural Network by Lili Mou et al.)
- "rf" (Random Forest)
- "svm" (Support Vector Machine)
- "knn" (K-Nearest Neighbors)
- "lr" (Logistic Regression)
- "mlp" (Multilayer Perceptron)
- "dgcnn" (Deep Graph CNN)
TRAINDATASET / TESTDATASET
Dataset that will be used in the training/testing phase. TRAINDATASET is required, but TESTDATASET must be empty if you want to use the same dataset in training and testing phase.
-
"OJClone" (POJ-104 dataset used by Lili Mou et al.)
-
"BCF" (The OJClone dataset that was obfuscated by the Bogus Control Flow strategy)
-
"FLA" (The OJClone dataset that was obfuscated by the Control Flow Flattening strategy)
-
"SUB" (The OJClone dataset was obfuscated by the Instructions Substitution strategy)
-
"OLLVM" (The OJClone dataset that was obfuscated by the Control Flow Flattening, Bogus Control Flow Strategy and Instructions Substitution strategies, respectively)
-
"MCMC" (The OJClone dataset that was obfuscated by the Markov Chain Monte Carlo strategy)
-
"DRLSG" (The OJClone dataset that was obfuscated by the Deep Reinforcement Learning Sequence Generation strategy)
-
"RS" (The OJClone dataset that was obfuscated by the Random-Search strategy)
OPTLEVELTRAIN / OPTLEVELTEST
Optimization level applied in the traning/testing dataset. OPTLEVELTRAIN is required, but OPTLEVELTEST must be empty if TESTDATASET is empty.
- O0
- O3
NUMCLASSES
The number of classes of the dataset. This variable is required.
ROUNDS
The number of rounds to run the model. This variable is required.
MEMORYPROF
Indicate whether a memory profiler will be used. This variable is required.
- yes
- no
FILTER_HISTOGRAM
String with a comma separated list of opcodes to consider. Only available if REPRESENTATION=histogram.
After that, you need to prepare the environment to run our experiments. Run the following command line:
```bash
$ ./setup.sh
```
> This will download the datasets, build the docker image and create the necessary folders for the project.
### **Running**
Now, you can run the following command line:
```bash
$ ./run.sh MODE
```
There are the following values for `MODE`:
* **build**: Builds the docker container based on the modifications in the yali project
* **custom**: Runs the project based on the variables set on `.env` file
* **all**: Runs all experiments available in `MODE`
* **speedup**: Runs the speedup analysis with the benchmark game
* **embeddings**: Runs the embedding analysis
* **resources**: Runs only the resources analysis
* **malware**: Runs the experiment to detect classes of malware
* **game0** Runs the [Game 0](https://doi.org/10.1145/3579990.3580012)
* **game1**: Runs the [Game 1](https://doi.org/10.1145/3579990.3580012)
* **game2**: Runs the [Game 2](https://doi.org/10.1145/3579990.3580012)
* **game3**: Runs the [Game 3](https://doi.org/10.1145/3579990.3580012)
* **discover**: Runs the [Discover Game](https://doi.org/10.1145/3579990.3580012)
* **histogram_ext**: Runs an accuracy analysis with an extended histogram
> This will run the docker container with the configurations in the `.env` file.
## :bar_chart: **Statistics**
The `Statistics` folder contains _Jupyter Notebooks_ that plot the data generated by the experiments. Each notebook describes each chart and the steps to develop them. There are the following _notebooks_:
* [**EmbeddingResults**](./Statistics/EmbeddingResults.ipynb): Presents information about the accuracy of the dgcnn and cnn models with different representations
* [**GameResults**](./Statistics/GameResults.ipynb): Presents information about the 4 games proposed in our [work](https://doi.org/10.1145/3579990.3580012).
* [**ResourceResults**](./Statistics/ResourceResults.ipynb): Presents information about resource consumption (memory and time) of each model
* [**StrategiesResults**](./Statistics/StrategiesResults.ipynb): Presents the distance between the histograms of the original programs and the histograms generated by the obfuscators
## :card_index_dividers: Structure
The repository has the following organization:
```bash
|-- Classification: "scripts for the classification process"
|-- Compilation: "Scripts for the compilation process"
|-- Docs: "Repository documentation"
|-- Entrypoint: "Container setup"
|-- Extraction: "Script to extract a program representation and convert CSV to Numpy"
|-- HistogramPass: "LLVM pass to get the histograms"
|-- MalwareDataset: "Malware dataset to support experiments in the project"
|-- Representations: "Scripts to extract different program representations"
|-- Statistics: "Jupyter notebooks"
|-- Experiments: "Extra experiments using the yali infrastructure (each one of them has its own ReadME)"
|-- Utils: "Python scripts to support the `Experiments` folder and the Jupyter Notebooks"
|-- Volume: "Volume of the container"
|-- Csv: "CSVs with the histograms"
|-- Embeddings: "Different representations of programs in the Source folder"
|-- Histograms: "histograms in the Numpy format"
|-- Irs: "LLVM IRs of the programs"
|-- Results: "Results of the training/testing phase"
|-- Source: "Source code of the programs"
```
## :closed_book: Technical Report
This framework is used in the following published papers:
- [*A Game-Based Framework to Compare Program Classifiers and Evaders*](https://doi.org/10.1145/3579990.3580012). To cite it:
```latex
@inproceedings{damasio23,
author = {Dam\'{a}sio, Tha\'{\i}s and Canesche, Michael and Pacheco, Vin\'{\i}cius and Botacin, Marcus and Faustino da Silva, Anderson and Quint\~{a}o Pereira, Fernando M.},
title = {A Game-Based Framework to Compare Program Classifiers and Evaders},
year = {2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3579990.3580012},
doi = {10.1145/3579990.3580012},
booktitle = {Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization},
pages = {108–121},
numpages = {14},
keywords = {algorithm classification, obfuscation},
location = {Montr\'{e}al, QC, Canada},
series = {CGO 2023}
}
```
- [*Impacto de Ofuscadores e Otimizadores de Código na Acurácia de Classificadores de Programa*](https://doi.org/10.1145/3561320.3561322). To cite it:
```latex
@inproceedings{damasio22,
author = {Dam\'{a}sio, Tha\'{\i}s and Canesche, Michael and Pacheco, Vin\'{\i}cius and Faustino, Anderson and Quintao Pereira, Fernando Magno},
title = {Impacto de Ofuscadores e Otimizadores de C\'{o}Digo Na Acur\'{a}Cia de Classificadores de Programas},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3561320.3561322},
doi = {10.1145/3561320.3561322},
booktitle = {Proceedings of the XXVI Brazilian Symposium on Programming Languages},
pages = {68–75},
numpages = {8},
keywords = {neural network, compiler optimizations, obfuscation},
location = {Virtual Event, Brazil},
series = {SBLP '22}
}
```