Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jahez07/multimodal-fusion-strategy-to-classify-malware

This work focuses on proposing a novel approach towards classifying malware binaries by extracting visual features from malware executables.
https://github.com/jahez07/multimodal-fusion-strategy-to-classify-malware

deep-learning generative-adversarial-network multimodal-deep-learning python3 research-project

Last synced: 20 days ago
JSON representation

This work focuses on proposing a novel approach towards classifying malware binaries by extracting visual features from malware executables.

Awesome Lists containing this project

README

        

# Multimodal-Fusion-Strategy-to-Classify-Malware

All the resources that was used for this work,

(will be updated soon)
1. Big2015 Binary Dataset :
2. MalHub Binary Dataset : malhub_binary_root

This work focuses on proposing a novel approach towards classifying malware binaries by extracting visual features from malware executables.

The dataset used in this work is from Kaggle Challenge for Malware Classification, Big2015.

Big2015 Malware Dataset consists of 9 families and 10,868 malware binary samples. Big2015 is a highly unbalanced dataset, with few families having more than 2000 malware samples, few more 1000 and others below 500.

First malware visual representation we use in this work is Grayscale image, which is generated using the decimal represenation of the hex code that was extracted from the malware executables using Hex Dump Tool.

Hexadecimal
Decimal

The above image is an example snippet of the hexadecimal values extracted from Hex codes of a malware sample and the decimal values.

The below logic was used to convert Hex codes into hexadecimal values. (refer to [hex_HDec.py](scripts/hex_hDec.py))
```bash
import re
hex_regex = r'\b[0-9A-F]{2}\b'
hex_codes = re.findall(hex_regex, contents)
hex_str = ""
for ele in hex_codes:
hex_str += ele
```

To convert Hexadecimal to decimal, we used the below set of code. (refer to [HDec_Dec.py](scripts/HDec_dec.py))
```bash
table = {'0': 0, '1': 1, '2': 2, '3': 3,
'4': 4, '5': 5, '6': 6, '7': 7,
'8': 8, '9': 9, 'A': 10, 'B': 11,
'C': 12, 'D': 13, 'E': 14, 'F': 15}
dec_list = []
for ele in hex_list:
hexadecimal = ele.strip().upper()
res = 0
size = len(hexadecimal) - 1
for num in hexadecimal:
res = res + table[num]*16**size
size = size - 1
dec_list.append(res)
```

### Grayscale Image (GS)

The extracted Hexadecimal values are then convereted into decimal which is then used to generate Grayscale (GS) Images. Preview the code to generate GS images in [GS_Img.py](scripts/GS_Img.py)

The generated grayscale images are now used to train an independent VGG-16 Model. We chose VGG-16 model beacause, among the deep convolutional neural network models, VGG-16 is the most light-weighted comparing to ResNet-50, InceptionNet etc.

### Entropy Graph (EG)

Entropy Graph is also generated from the same decimal values that are converted fromhexadecimal extracted from the Hex code of each malware sample. In computing, entropy is the randomness collected by an operating system or application for use in cryptography or other uses that require random data.

Below is the logic used for entropy extraction from decimal values:
```bash
# creating an average entropy list of all the segments in the array
import math
segment_size = 256
averages = []
for i in range(0, len(arr), segment_size):
subset = arr[i:i+segment_size]
entropy = 0
for element in subset:
prob = np.unique(element, return_counts = True)
entropy += en(prob)
average_entropy = entropy / segment_size
average_entropy = float(average_entropy)
averages.append(average_entropy)
#average_str = str(average_entropy)
```

Run [EntropyGraph.py](scripts/EntropyGraph.py) to generate Entropy Graph from decimal values of malware samples.

### Simhash Image (SH)

Simhash Images used in this work are generated not like Grayscale or Entropy from decimal. Rather, we utilize the assembly code of a malware sample to extract the operational code which is then passed through hash functions like MD5 to generate simhash images.

The assembly code of malware sample is the first data that is used, from which we extract the operational codes or opcodes (eg: push, mov, call, test, etc.). These mnemonic codes are now utilized to generate Simhash signatures for each malware samples using the MD5 hash function. Refer to [asm_op.py](scripts/asm_op.py) to see the mnemonic code extraction logic.

[op_sim.py](scripts/op_sim.py) is the coding for generating simhash signatures from mnemonic code of malware samples.
Below is the logic of generating simhash signature.
```bash
# Calculate the hash value for each keyword and update the 'v' vector
for keyword in keywords:
b = hash_function(keyword)
for i in range(n):
if (b >> i) & 1 == 1:
v[i] += 1
else:
v[i] -= 1
for i in range(n):
if v[i] > 0:
s[i] = 1
else:
s[i] = 0
```

These Simhash signature are then used to generate Simhash images. (Refer to [SimImg.py](scripts/SimImg.py))

```bash
sim = content.split()

sim_list = []
for ele in sim:
el = int(ele)
sim_list.append(el)

array_2d = np.array(sim_list).reshape(16, 32) * 255

image = im.fromarray(array_2d.astype(np.uint8), mode='L')
```

The generated Simhash images are non-square, which are not processable by the proposed VGG-16 model, therefore we resize the generated image without loosing its integrity using **Bilinear Interpolation**. In bilinear interpolation, the original image of size
*(m × n)* is resized to *(a × b)*, where *a* and *b* are set to 224 in this work, favorable to the VGG architecture. (Refer to [BilinearInterpolation.py](scripts/BI.py))
```bash
def bl_resize(original_img, new_h, new_w):
#get dimensions of original image
old_h, old_w = original_img.shape
#create an array of the desired shape.
#We will fill-in the values later.
resized = np.zeros((new_h, new_w))
#Calculate horizontal and vertical scaling factor
w_scale_factor = (old_w ) / (new_w ) if new_h != 0 else 0
h_scale_factor = (old_h ) / (new_h ) if new_w != 0 else 0
for i in range(new_h):
for j in range(new_w):
#map the coordinates back to the original image
x = i * h_scale_factor
y = j * w_scale_factor
#calculate the coordinate values for 4 surrounding pixels.
x_floor = math.floor(x)
x_ceil = min( old_h - 1, math.ceil(x))
y_floor = math.floor(y)
y_ceil = min(old_w - 1, math.ceil(y))

if (x_ceil == x_floor) and (y_ceil == y_floor):
q = original_img[int(x), int(y)]
elif (x_ceil == x_floor):
q1 = original_img[int(x), int(y_floor)]
q2 = original_img[int(x), int(y_ceil)]
q = q1 * (y_ceil - y) + q2 * (y - y_floor)
elif (y_ceil == y_floor):
q1 = original_img[int(x_floor), int(y)]
q2 = original_img[int(x_ceil), int(y)]
q = (q1 * (x_ceil - x)) + (q2 * (x - x_floor))
else:
v1 = original_img[x_floor, y_floor]
v2 = original_img[x_ceil, y_floor]
v3 = original_img[x_floor, y_ceil]
v4 = original_img[x_ceil, y_ceil]

q1 = v1 * (x_ceil - x) + v2 * (x - x_floor)
q2 = v3 * (x_ceil - x) + v4 * (x - x_floor)
q = q1 * (y_ceil - y) + q2 * (y - y_floor)
#print(q)
resized[i,j] = q
return resized.astype(np.uint8)
```


Family
Grayscale Image
Entropy Graph
Simhash Image


Gatak
Gatak
Gatak
Gatak


Kelihos_ver1
Kelihos_ver1
Kelihos_ver1
Kelihos_ver1


Kelihos_ver3
Kelihos_ver3
Kelihos_ver3
Kelihos_ver3

## Proposed Methodology

process

## Experiment 1:
### Effectiveness of GS, EG, and SH VGG16 models in classifying malware binaries

The primary experiment done in this work was to evaluate the performance of VGG16 models on individual malware visual features. And for that we designed a new architecture adding to the VGG16 architecture by freezing the pre-trained weights of VGG16.

Below given image depicts the proposed architecture of the proposed model

architecture

Each malware visual feature, that is, Grayscale Image, Entropy Graph, and Simhash Image will be trained seperately on different proposed VGG16 Architecture and the performances are analysed.

The below table shows the performance of all Grayscale(GS), Entropy Graph(EG) and Simhash (SH) VGG-16 Indepedent Models.

Refer to [VGG_16_Independent](scripts/VGG_16_Independent.ipynb) for the coding of Independent VGG16 models trained on 3 different Malware Visual Feature (GS, EG, SH) and the below set of code is for the model that was desgined.

```Bash
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, Conv1D
from tensorflow.keras import layers
from keras_tuner.tuners import RandomSearch
from tensorflow import keras
from tensorflow.keras.applications import VGG16

def build_model(hp):

model_1 = Sequential()

vgg = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the weights of all layers in the VGG16 model
for layer in vgg.layers:
layer.trainable = False

# Add the VGG16 model to your own model
model_1.add(vgg)

# Add a 1D convolutional layer
model_1.add(Conv1D(filters=32, kernel_size=3, activation='relu')) # Example parameters, you can tune these

# Remove the Flatten layer to maintain the spatial structure
# model_1.add(Flatten())

# Add the dense layer
model_1.add(Dense(units=hp.Int('dense_units', min_value=32, max_value=512, step=32),
activation='relu'))

model_1.add(Dropout(hp.Float('dropout', min_value=0.0, max_value=0.5, step=0.1)))

# adding batch normalization layer
model_1.add(keras.layers.BatchNormalization())

model_1.add(Dense(units = hp.Int('extra_dense_units', min_value = 32, max_value = 512, step = 32), activation = 'relu'))

# Add another Conv1D layer before the output layer
model_1.add(Conv1D(filters=64, kernel_size=3, activation='relu'))

# Flatten the output before the final dense layer
model_1.add(Flatten())

# Add the output layer
model_1.add(Dense(units=9, activation='softmax'))

# Compile the model
model_1.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate', values=[1e-2, 1e-3])),
loss='categorical_crossentropy',
metrics=['accuracy'])

return model_1
```

| Feature | Accuracy | Precision | Recall | F1-Score | Time |
| --- | --- | --- | --- | --- | --- |
| GS | 0.998 | 1.0 | 0.996 | 0.997 | 0.01 |
| EG | 0.999 | 0.998 | 0.998 | 0.998 | 0.01 |
| SH | 0.996 | 0.966 | 0.963 | 0.963 | 0.01 |

Evaluating the results, you can see that Entropy Graph, among all three, GS, EG, and SH, is the best malware visual feature.

## Experiment 2:
But in this work we go on to enhance the visual feature by merging both input and the model to see whether the final merged model gets better performance than the previous independent VGG16 models. The merging process was done using the merging operators in Keras Library.

Refer to https://keras.io/api/layers/merging_layers/

There are several merge operators/layers, but in this work we only focused on :
1. Concatenate Layer
2. Add Layer
3. Average Layer
4. Maximum Layer

Each layer has its own use cases and advantages. In this work we have proven for malware detection and classification, when we merge the visual features, concatenate layer is the most efficient.

| Features | GS | EG | SH |
| -- | -- | -- | -- |
| GS |:x:|:white_check_mark:|:white_check_mark:|
| EG |:x:|:x:|:white_check_mark:|
| GS, EG |:x:|:x:|:white_check_mark:|

Each of these combinations are considered for all four operators (Concatenation, Add, Average, Maximum).

### Concatenation
Concatenates a list of inputs. It takes as input a list of tensors, all of the same shape except for the concatenation axis, and returns a single tensor that is the concatenation of all inputs.

Before concatenating, we will load the three different modalitites of malware visual features and split them into train and test separately for three of them. Then we define VGG16 model for each modality which will be then concatenated together to form the proposed model.

The below code depicts the VGG16 models and then the contenation of them

VGG16:
```Bash
# Designing the first VGG16 model for Hex Images
hex_ = tf.keras.applications.VGG16(weights = 'imagenet', include_top = False, input_shape = (224, 224, 3))

hex_._name = 'hex_vgg'

# Freeze the layers in the VGG16 model so that they are not trained during training
for layer in hex_.layers:
layer.trainable = False

# Pass the input through the VGG16 model
hex_vgg_output = hex_(hex_in)

# Add a classifier on top of the model
#hex_model = Flatten(name = 'hex_flatten')(hex_vgg_output)
hex_model = Dense(512, activation='relu', name='hex_dense')(hex_vgg_output)
```

[merge_operations](scripts/merge.py) loads the independent modality models and then merged using different operators.

Contenation

```Bash
# Concatenate the output of the 2 models
merged = concatenate([hex_model, eg_model, sh_model])

# Add one or more dense layers on top of the merged output

# Add 1D convolutional layers
conv1 = Conv1D(filters=515, kernel_size=3, strides=1, activation='relu')(merged)
flatten = Flatten()(conv1)
dense1 = Dense(512, activation = 'relu')(flatten)
dense2 = Dense(256, activation = 'relu')(dense1)
dense3 = Dense(128, activation = 'relu')(dense2)
dense4 = Dense(64, activation = 'relu',kernel_regularizer = l2(0.01))(dense3)
output = Dense(9, activation = 'softmax')(dense4)

# Define the model
merged_model = Model(inputs = hex_in, outputs = output, name = 'merged_model') # hex_in is the shape of the input layer

# Compile the model
merged_model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
```

After analyzing the outputs of all four operators on different combinations of malware modalities, we came to a conclusion that `concatenation` operator when used for all three modalities (GS, EG, SH) is the highest performing model with 0.99 F1-Score.