{"id":21244604,"url":"https://github.com/lac-dcc/yali","last_synced_at":"2025-07-10T21:31:00.474Z","repository":{"id":49817392,"uuid":"437184155","full_name":"lac-dcc/yali","owner":"lac-dcc","description":"A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.","archived":false,"fork":false,"pushed_at":"2023-08-01T14:16:46.000Z","size":50079,"stargazers_count":33,"open_issues_count":0,"forks_count":4,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-06-07T10:02:14.084Z","etag":null,"topics":["machine-learning","obfuscation","ollvm"],"latest_commit_sha":null,"homepage":"https://doi.org/10.1145/3579990.3580012","language":"LLVM","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lac-dcc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-11T04:09:13.000Z","updated_at":"2025-03-27T15:15:54.000Z","dependencies_parsed_at":"2024-11-21T01:39:57.308Z","dependency_job_id":null,"html_url":"https://github.com/lac-dcc/yali","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lac-dcc/yali","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lac-dcc%2Fyali","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lac-dcc%2Fyali/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lac-dcc%2Fyali/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lac-dcc%2Fyali/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lac-dcc","download_url":"https://codeload.github.com/lac-dcc/yali/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lac-dcc%2Fyali/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264666022,"owners_count":23646570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","obfuscation","ollvm"],"created_at":"2024-11-21T01:29:01.253Z","updated_at":"2025-07-10T21:30:56.787Z","avatar_url":"https://github.com/lac-dcc.png","language":"LLVM","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003ch1\u003e Yali \u003c/h1\u003e\n    \u003cdiv style=\"font-style: italic\"\u003e\n        A framework to analyze a space formed by the combination of program encodings, obfuscation passes and stochastic classification models.\n    \u003c/div\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg alt=\"logo\" src=\"./Docs/yali.png\" width=\"35%\" height=\"auto\"/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/thais-damasio/yali/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-GPL%203.0%20only-green?style=for-the-badge\" alt=\"License: GPL v3\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/PyCQA/pylint\"\u003e\u003cimg src=\"https://img.shields.io/badge/linting-pylint-yellowgreen?style=for-the-badge\" alt=\"Linting: Pylint\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/lac-dcc/yali/commits/main\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/last-commit/lac-dcc/yali/main?style=for-the-badge\"\n         alt=\"Last update\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\n## :pushpin: **Contents Table**\n\n* [Introduction](#introduction)\n* [Getting Started](#getting-started)\n    * [Prerequisites](#prerequisites)\n    * [Setup](#setup)\n    * [Running](#running)\n* [Statistics](#statistics)\n* [Structure](#structure)\n* [Technical Report](#technical-report)\n\n\n\n---\n\u003ca id=\"introduction\"\u003e\u003c/a\u003e\n\n## :scroll: **Introduction**\n\nLet _D_ be a deep learning model that classifies programs according to the problem they solve. This project aims to evaluate how _D_ behaves with obfuscated code. We want to know how much the accuracy of _D_ is affected. This study also evaluates different types of program representations.\n\n\u003cp align=\"center\"\u003e\n\u003cimg alt=\"examples of classifications\" src=\"./Docs/examples.gif\" width=\"60%\" height=\"auto\"/\u003e\n\u003c/p\u003e\n\n\u003e The top of the image above shows the histogram produced by a specific strategy for program *292*. This program belongs to class 11 of the *POJ-104 dataset*. The bottom of the image shows how each model classifies the variations of program *292*.\n\n---\n\u003ca id=\"getting-started\"\u003e\u003c/a\u003e\n\n## :checkered_flag: **Getting Started**\nIn this section are the steps to reproduce our experiments.\n\n\n\u003ca id=\"prerequisites\"\u003e\u003c/a\u003e\n\n### **Prerequisites**\nYou need to install the following packages to run this project:\n\n* [Docker](https://www.docker.com/get-started/) and [Docker Compose](https://docs.docker.com/compose/install/) to run our experiments\n* [Python-3](https://www.python.org/downloads/) to plot the results in the project's Jupyter Notebook\n* [Wget](https://www.gnu.org/software/wget/), [Tar](https://www.gnu.org/software/tar/) and [Sed](https://www.gnu.org/software/sed/) to run the initial scripts to configure the repository\n\n\u003ca id=\"setup\"\u003e\u003c/a\u003e\n\n###  **Setup**\n\nFirst, you should copy the `.env.example` file and rename it to `.env`.\nYou can now set environment variables in the `.env` file at the project's root. You can change the following variables:\n\n\u003ctable\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n            \u003cth\u003eVariable\u003c/th\u003e\n            \u003cth\u003eDescription\u003c/th\u003e\n            \u003cth\u003eValue\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003eREPRESENTATION\u003c/td\u003e\n            \u003ctd\u003eProgram embedding that will be used to represent a program. This variable is required.\u003c/td\u003e\n            \u003ctd\u003e\n                \u003cul\u003e\n                    \u003cli\u003ehistogram\u003c/li\u003e\n                    \u003cli\u003eir2vec\u003c/li\u003e\n                    \u003cli\u003emilepost\u003c/li\u003e\n                    \u003cli\u003ecfg\u003c/li\u003e\n                    \u003cli\u003ecfg_compact\u003c/li\u003e\n                    \u003cli\u003ecdfg\u003c/li\u003e\n                    \u003cli\u003ecdfg_compact\u003c/li\u003e\n                    \u003cli\u003ecdfg_plus\u003c/li\u003e\n                    \u003cli\u003eprograml\u003c/li\u003e\n                \u003c/ul\u003e\n            \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003eMODEL\u003c/td\u003e\n            \u003ctd\u003eSelected machine learning model. This variable is required. If REPRESENTATION is equal to `cfg`, `cfg_compact`, `cdfg`, `cdfg_compact`, `cdfg_plus` or `programl`, the model must be `dgcnn` or `gcn`.\u003c/td\u003e\n            \u003ctd\u003e\n                \u003cul\u003e\n                    \u003cli\u003e\"cnn\" (Convolutional Neural Network by \u003ca href=\"https://dl.acm.org/doi/10.5555/3015812.3016002\"\u003eLili Mou et al.\u003c/a\u003e)\u003c/li\u003e\n                    \u003cli\u003e\"rf\" (Random Forest) \u003c/li\u003e\n                    \u003cli\u003e\"svm\" (Support Vector Machine) \u003c/li\u003e\n                    \u003cli\u003e\"knn\" (K-Nearest Neighbors) \u003c/li\u003e\n                    \u003cli\u003e\"lr\" (Logistic Regression) \u003c/li\u003e\n                    \u003cli\u003e\"mlp\" (Multilayer Perceptron) \u003c/li\u003e\n                    \u003cli\u003e\"dgcnn\" (Deep Graph CNN) \u003c/li\u003e\n                \u003c/ul\u003e\n            \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003eTRAINDATASET / TESTDATASET\u003c/td\u003e\n            \u003ctd\u003eDataset that will be used in the training/testing phase. TRAINDATASET is required, but \u003cb\u003eTESTDATASET must be empty if you want to use the same dataset in training and testing phase.\u003c/b\u003e\u003c/td\u003e\n            \u003ctd\u003e\n                \u003cul\u003e\n                    \u003cli\u003e\n                        \"OJClone\" (POJ-104 dataset used by \u003ca href=\"https://dl.acm.org/doi/10.5555/3015812.3016002\"\u003eLili Mou et al.\u003c/a\u003e)\n                    \u003c/li\u003e\n                    \u003cli\u003e\n                        \"BCF\" (The OJClone dataset that was obfuscated by the \u003ca href=\"https://github.com/obfuscator-llvm/obfuscator/wiki/Bogus-Control-Flow\"\u003eBogus Control Flow\u003c/a\u003e strategy) \n                    \u003c/li\u003e\n                    \u003cli\u003e\n                        \"FLA\" (The OJClone dataset that was obfuscated by the \u003ca href=\"https://github.com/obfuscator-llvm/obfuscator/wiki/Control-Flow-Flattening\"\u003eControl Flow Flattening\u003c/a\u003e strategy)\n                    \u003c/li\u003e\n                    \u003cli\u003e\n                        \"SUB\" (The OJClone dataset was obfuscated by the \u003ca href=\"https://github.com/obfuscator-llvm/obfuscator/wiki/Instructions-Substitution\"\u003eInstructions Substitution\u003c/a\u003e strategy)\n                    \u003c/li\u003e\n                    \u003cli\u003e\n                        \"OLLVM\" (The OJClone dataset that was obfuscated by the \u003ca href=\"https://github.com/obfuscator-llvm/obfuscator/wiki/Control-Flow-Flattening\"\u003eControl Flow Flattening\u003c/a\u003e, \u003ca href=\"https://github.com/obfuscator-llvm/obfuscator/wiki/Bogus-Control-Flow\"\u003eBogus Control Flow Strategy\u003c/a\u003e and \u003ca href=\"https://github.com/obfuscator-llvm/obfuscator/wiki/Instructions-Substitution\"\u003eInstructions Substitution\u003c/a\u003e strategies, respectively)\n                    \u003c/li\u003e\n                    \u003cli\u003e\n                        \"MCMC\" (The OJClone dataset that was obfuscated by the \u003ca href=\"https://arxiv.org/pdf/2111.10793.pdf\"\u003eMarkov Chain Monte Carlo\u003c/a\u003e strategy)\n                    \u003c/li\u003e\n                    \u003cli\u003e\n                        \"DRLSG\" (The OJClone dataset that was obfuscated by the \u003ca href=\"https://arxiv.org/pdf/2111.10793.pdf\"\u003eDeep Reinforcement Learning Sequence Generation\u003c/a\u003e strategy)\n                    \u003c/li\u003e\n                    \u003cli\u003e\n                        \"RS\" (The OJClone dataset that was obfuscated by the \u003ca href=\"https://arxiv.org/pdf/2111.10793.pdf\"\u003eRandom-Search\u003c/a\u003e strategy)\n                    \u003c/li\u003e\n                \u003c/ul\u003e\n            \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003eOPTLEVELTRAIN / OPTLEVELTEST\u003c/td\u003e\n            \u003ctd\u003eOptimization level applied in the traning/testing dataset. OPTLEVELTRAIN is required, but \u003cb\u003eOPTLEVELTEST must be empty if TESTDATASET is empty.\u003c/b\u003e\u003c/td\u003e\n            \u003ctd\u003e\n                \u003cul\u003e\n                    \u003cli\u003eO0\u003c/li\u003e\n                    \u003cli\u003eO3\u003c/li\u003e\n                \u003c/ul\u003e\n            \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003eNUMCLASSES\u003c/td\u003e\n            \u003ctd\u003eThe number of classes of the dataset. This variable is required.\u003c/td\u003e\n            \u003ctd\u003e\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003eROUNDS\u003c/td\u003e\n            \u003ctd\u003eThe number of rounds to run the model. This variable is required.\u003c/td\u003e\n            \u003ctd\u003e\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003eMEMORYPROF\u003c/td\u003e\n            \u003ctd\u003eIndicate whether a memory profiler will be used. This variable is required.\u003c/td\u003e\n            \u003ctd\u003e\n                \u003cul\u003e\n                    \u003cli\u003eyes\u003c/li\u003e\n                    \u003cli\u003eno\u003c/li\u003e\n                \u003c/ul\u003e\n            \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003eFILTER_HISTOGRAM\u003c/td\u003e\n            \u003ctd\u003eString with a comma separated list of opcodes to consider. Only available if \u003cb\u003eREPRESENTATION=histogram.\u003c/b\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e\n\n\nAfter that, you need to prepare the environment to run our experiments. Run the following command line:\n\n```bash\n$ ./setup.sh\n```\n\u003e This will download the datasets, build the docker image and create the necessary folders for the project. \n\n\n\u003ca id=\"running\"\u003e\u003c/a\u003e\n\n### **Running**\nNow, you can run the following command line:\n\n```bash\n$ ./run.sh MODE\n```\nThere are the following values for `MODE`:\n* **build**: Builds the docker container based on the modifications in the yali project\n* **custom**: Runs the project based on the variables set on `.env` file\n* **all**: Runs all experiments available in `MODE`\n* **speedup**: Runs the speedup analysis with the benchmark game\n* **embeddings**: Runs the embedding analysis\n* **resources**: Runs only the resources analysis\n* **malware**: Runs the experiment to detect classes of malware\n* **game0** Runs the [Game 0](https://doi.org/10.1145/3579990.3580012)\n* **game1**: Runs the [Game 1](https://doi.org/10.1145/3579990.3580012)\n* **game2**: Runs the [Game 2](https://doi.org/10.1145/3579990.3580012)\n* **game3**: Runs the [Game 3](https://doi.org/10.1145/3579990.3580012)\n* **discover**: Runs the [Discover Game](https://doi.org/10.1145/3579990.3580012)\n* **histogram_ext**: Runs an accuracy analysis with an extended histogram\n\n\u003e This will run the docker container with the configurations in the `.env` file.\n\n\n\n---\n\u003ca id=\"statistics\"\u003e\u003c/a\u003e\n\n## :bar_chart: **Statistics**\nThe `Statistics` folder contains _Jupyter Notebooks_ that plot the data generated by the experiments. Each notebook describes each chart and the steps to develop them. There are the following _notebooks_:\n\n* [**EmbeddingResults**](./Statistics/EmbeddingResults.ipynb): Presents information about the accuracy of the dgcnn and cnn models with different representations\n* [**GameResults**](./Statistics/GameResults.ipynb): Presents information about the 4 games proposed in our [work](https://doi.org/10.1145/3579990.3580012).\n* [**ResourceResults**](./Statistics/ResourceResults.ipynb): Presents information about resource consumption (memory and time) of each model\n* [**StrategiesResults**](./Statistics/StrategiesResults.ipynb): Presents the distance between the histograms of the original programs and the histograms generated by the obfuscators \n\n\n\n---\n\u003ca id=\"structure\"\u003e\u003c/a\u003e\n\n## :card_index_dividers: Structure\nThe repository has the following organization:\n\n```bash\n|-- Classification: \"scripts for the classification process\"\n|-- Compilation: \"Scripts for the compilation process\"\n|-- Docs: \"Repository documentation\"\n|-- Entrypoint: \"Container setup\"\n|-- Extraction: \"Script to extract a program representation and convert CSV to Numpy\"\n|-- HistogramPass: \"LLVM pass to get the histograms\"\n|-- MalwareDataset: \"Malware dataset to support experiments in the project\"\n|-- Representations: \"Scripts to extract different program representations\"\n|-- Statistics: \"Jupyter notebooks\"\n    |-- Experiments: \"Extra experiments using the yali infrastructure (each one of them has its own ReadME)\"\n    |-- Utils: \"Python scripts to support the `Experiments` folder and the Jupyter Notebooks\"\n|-- Volume: \"Volume of the container\"\n    |-- Csv: \"CSVs with the histograms\"\n    |-- Embeddings: \"Different representations of programs in the Source folder\"\n    |-- Histograms: \"histograms in the Numpy format\"\n    |-- Irs: \"LLVM IRs of the programs\"\n    |-- Results: \"Results of the training/testing phase\"\n    |-- Source: \"Source code of the programs\"\n```\n\n\n---\n\u003ca id=\"technical-report\"\u003e\u003c/a\u003e\n\n## :closed_book: Technical Report\n\nThis framework is used in the following published papers:\n\n- [*A Game-Based Framework to Compare Program Classifiers and Evaders*](https://doi.org/10.1145/3579990.3580012). To cite it:\n```latex\n@inproceedings{damasio23,\n    author = {Dam\\'{a}sio, Tha\\'{\\i}s and Canesche, Michael and Pacheco, Vin\\'{\\i}cius and Botacin, Marcus and Faustino da Silva, Anderson and Quint\\~{a}o Pereira, Fernando M.},\n    title = {A Game-Based Framework to Compare Program Classifiers and Evaders},\n    year = {2023},\n    publisher = {Association for Computing Machinery},\n    address = {New York, NY, USA},\n    url = {https://doi.org/10.1145/3579990.3580012},\n    doi = {10.1145/3579990.3580012},\n    booktitle = {Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization},\n    pages = {108–121},\n    numpages = {14},\n    keywords = {algorithm classification, obfuscation},\n    location = {Montr\\'{e}al, QC, Canada},\n    series = {CGO 2023}\n}\n```\n- [*Impacto de Ofuscadores e Otimizadores de Código na Acurácia de Classificadores de Programa*](https://doi.org/10.1145/3561320.3561322). To cite it:\n\n```latex\n@inproceedings{damasio22,\n    author = {Dam\\'{a}sio, Tha\\'{\\i}s and Canesche, Michael and Pacheco, Vin\\'{\\i}cius and Faustino, Anderson and Quintao Pereira, Fernando Magno},\n    title = {Impacto de Ofuscadores e Otimizadores de C\\'{o}Digo Na Acur\\'{a}Cia de Classificadores de Programas},\n    year = {2022},\n    publisher = {Association for Computing Machinery},\n    address = {New York, NY, USA},\n    url = {https://doi.org/10.1145/3561320.3561322},\n    doi = {10.1145/3561320.3561322},\n    booktitle = {Proceedings of the XXVI Brazilian Symposium on Programming Languages},\n    pages = {68–75},\n    numpages = {8},\n    keywords = {neural network, compiler optimizations, obfuscation},\n    location = {Virtual Event, Brazil},\n    series = {SBLP '22}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flac-dcc%2Fyali","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flac-dcc%2Fyali","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flac-dcc%2Fyali/lists"}