https://github.com/FFRI/ffridataset-scripts

Make datasets like FFRI Dataset
https://github.com/FFRI/ffridataset-scripts

Last synced: 7 months ago
JSON representation

Make datasets like FFRI Dataset

Host: GitHub
URL: https://github.com/FFRI/ffridataset-scripts
Owner: FFRI
License: apache-2.0
Created: 2019-10-11T07:46:48.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2024-07-23T03:27:29.000Z (over 1 year ago)
Last Synced: 2025-04-15T03:15:41.066Z (8 months ago)
Language: Python
Size: 35.4 MB
Stars: 10
Watchers: 1
Forks: 3
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-executable-packing - FFRI Dataset Scripts - Make datasets like FFRI Dataset. (:bookmark_tabs: Datasets / Scientific Research)

README

# FFRI Dataset scripts

This script allows you to create datasets in the same format as the FFRI dataset.

## Requirements

We recommend using Docker to create datasets. For more information, refer to the [Using Docker](#Using-Docker) section.

Alternatively, you can run this script natively by installing the following dependencies on [tested platforms](#Tested). For detailed instructions, see the [Run this script natively](#Run-This-Script-Natively) section.

- Python 3.12
- [Poetry](https://python-poetry.org/) 1.7+

## Using Docker

### Make A CSV File

This script requires a CSV file that contains file information such as labels, dates, and file paths. For example:

```
path,label,date
./data/cleanware/test0.exe,0,2018/01/01
./data/malware/test1.exe,1,2018/01/02
```

Please note that the file paths in the CSV file should be specified as relative paths from the container's working directory.

### Make Datasets

You can create datasets using the following commands:

```
docker build --target production --tag ffridataset-scripts .
docker run -v /testbin:/work/testbin ffridataset-scripts test_main.py
# Note: The data directory should contain a CSV file and the executable files you want to process.
docker run -v /data:/work/data -v /out_dir:/work/out_dir ffridataset-scripts main.py --csv ./data/target.csv --out ./out_dir --log ./dataset.log --ver
```

Please ensure the following:

- The host directory containing the CSV file and executable files is mounted to the container’s `/work/data`.
- The host directory where you want to save the JSON files is mounted to the container’s `/work/out_dir`.
- Replace `` with vYYYY (e.g., use v2024 for the FFRI Dataset 2024).

To process non-PE files, include the --not-pe-only flag:
```
docker run -v /data:/work/data -v /out_dir:/work/out_dir ffridataset-scripts main.py --csv ./data/target.csv --out ./out_dir --log ./dataset.log --ver --not-pe-only
```

## Run This Script Natively

### Prepare To Use

**Attention** We recommend running the following commands in the working directory (the ffridataset-scripts directory).
```
export LC_ALL=C.UTF-8
export LANG=C.UTF-8

sudo apt update
sudo apt install -y --no-install-recommends wget git gcc g++ make autoconf libfuzzy-dev unar cmake mlocate libssl-dev libglib2.0-0 curl libboost-regex-dev libboost-program-options-dev libboost-system-dev libboost-filesystem-dev build-essential libpcre2-dev libdouble-conversion-dev
sudo apt install -y --no-install-recommends libqt5core5a libqt5svg5 libqt5gui5 libqt5widgets5 libqt5opengl5 libqt5dbus5 libqt5scripttools5 libqt5script5 libqt5network5 libqt5sql5
sudo apt install -y --no-install-recommends libffi-dev libncurses5-dev zlib1g zlib1g-dev libreadline-dev libbz2-dev libsqlite3-dev liblzma-dev
sudo apt install -y --no-install-recommends software-properties-common gpg-agent gpg clang
wget https://github.com/horsicq/DIE-engine/releases/download/3.09/die_3.09_Ubuntu_22.04_amd64.deb
sudo apt --fix-broken install ./die_3.09_Ubuntu_22.04_amd64.deb
rm die_3.09_Ubuntu_22.04_amd64.deb

wget mark0.net/download/trid_linux_64.zip
unar trid_linux_64.zip
cp trid_linux_64/trid ./
chmod u+x trid
cp triddefs_dir/triddefs-dataset2024.trd triddefs.trd

cd workspace

git clone https://github.com/JPCERTCC/impfuzzy.git
cd impfuzzy
git checkout b30548d005c9d980b3e3630648b39830597293fc
cd ../

git clone https://github.com/JusticeRage/Manalyze.git
cd Manalyze
git checkout b6800ffcf2f7f4e82fe1f94d0eb2736e75e175ec
cmake .
make
cd ../

git clone https://github.com/lief-project/LIEF.git
cd LIEF
git checkout 573c885de5a2bb217d4d0255b54f9b53d9a4d7c9
git apply ../../patches/lief.patch
cd ../

git clone https://github.com/trendmicro/tlsh.git
cd tlsh
git checkout 96536e3f5b9b322b44ce88d36126121685e45a77
./make.sh
cd ../

git clone https://github.com/erocarrera/pefile.git
cd pefile
git checkout ceab92e003b3436d2e52b74e9c903e812a4aeae1
cd ../../

wget https://github.com/ninja-build/ninja/releases/download/v1.12.1/ninja-linux.zip
unar ninja-linux.zip
sudo mv ninja /usr/bin/

poetry install --no-root
```

If something goes wrong, refer to the Dockerfile.

### Run Tests

**Attention** Do not store a file named `test.exe` in the working directory. The test script copies `testbin/test.exe` into the directory and then removes it.
```
poetry run python test_main.py
```

### Make Datasets

Before running this script, you need to make a CSV file described in the [Make A CSV File](#Make-A-CSV-File) section and specify its file path as an argument. Unlike when using Docker, file paths can be specified as full paths.

**Attention** Do not store malware and cleanware in the working directory. This script will copy malware and cleanware into the directory and then removes them.

```
poetry run python main.py --csv --out --log --ver
```

## Notes About Hashes

- TLSH may sometimes be an empty string. This occurs because a file must possess a sufficient level of complexity to generate a valid TLSH. For more details, visit https://github.com/trendmicro/tlsh/blob/master/README.md.
- The peHashes (crits, endgame, and totalhash) can be null due to bugs in their implementation.

## Notes About TrID Definition File

- The TrID definition files located in [triddefs_dir](triddefs_dir) are redistributed with the permission from the TrID author, Marco Pontello.
- The latest definition file can be obtained from the [TrID website](https://mark0.net/soft-trid-e.html).

## Tested

- Ubuntu 22.04.2 LTS
- Ubuntu 22.04 on WSL2 on Windows 10

## Development

### Profiling Measurement

First, create two folders:
```
mkdir out_dir
mkdir measurement
```

Next, build a Docker image by specifying the measurement target:
```
docker build --target measurement --tag ffridataset-scripts .
```

Then, run the following command to generate executables and a csv file:
```
docker run -v \testbin:/work/testbin -v \measurement\:/work/measurement ffridataset-scripts poetry run python create_measurement_env.py
```

Now you're ready to do profiling. To generate a cProfile result file, run:
```
docker run -v \measurement:/work/data -v \out_dir:/work/out_dir ffridataset-scripts poetry run python -m cProfile -o ./out_dir/profiling.stats main.py --csv ./data/test.csv --out ./out_dir --log ./test.log --ver v2023
```

Then, execute the following command:
```
docker run -v \out_dir\:/work/out_dir/ --rm -p 8080:8080 ffridataset-scripts poetry run snakeviz /work/out_dir/profiling.stats -s -p 8080 -H 0.0.0.0
```

Now, you can view the profiling results through your browser.

## Author

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/FFRI/ffridataset-scripts

Awesome Lists containing this project

README