Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/jahez07/multimodal-fusion-strategy-to-classify-malware

This work focuses on proposing a novel approach towards classifying malware binaries by extracting visual features from malware executables.
https://github.com/jahez07/multimodal-fusion-strategy-to-classify-malware
deep-learning generative-adversarial-network multimodal-deep-learning python3 research-project
Last synced: 20 days ago
JSON representation
This work focuses on proposing a novel approach towards classifying malware binaries by extracting visual features from malware executables.
Host: GitHub
URL: https://github.com/jahez07/multimodal-fusion-strategy-to-classify-malware
Owner: jahez07
Created: 2024-04-17T12:19:40.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-07-29T17:21:40.000Z (6 months ago)
Last Synced: 2024-11-16T03:42:06.792Z (3 months ago)
Topics: deep-learning, generative-adversarial-network, multimodal-deep-learning, python3, research-project
Language: Jupyter Notebook
Homepage:
Size: 257 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Multimodal-Fusion-Strategy-to-Classify-Malware

All the resources that was used for this work,


(will be updated soon)

1. Big2015 Binary Dataset :

2. MalHub Binary Dataset : malhub_binary_root

This work focuses on proposing a novel approach towards classifying malware binaries by extracting visual features from malware executables. 

The dataset used in this work is from Kaggle Challenge for Malware Classification, Big2015.

Big2015 Malware Dataset consists of 9 families and 10,868 malware binary samples. Big2015 is a highly unbalanced dataset, with few families having more than 2000 malware samples, few more 1000 and others below 500. 

First malware visual representation we use in this work is Grayscale image, which is generated using the decimal represenation of the hex code that was extracted from the malware executables using Hex Dump Tool. 





The above image is an example snippet of the hexadecimal values extracted from Hex codes of a malware sample and the decimal values. 

The below logic was used to convert Hex codes into hexadecimal values. (refer to [hex_HDec.py](scripts/hex_hDec.py))

```bash

    import re

       hex_regex = r'\b[0-9A-F]{2}\b'

       hex_codes = re.findall(hex_regex, contents)

       hex_str = ""

       for ele in hex_codes:

         hex_str += ele 

```

To convert Hexadecimal to decimal, we used the below set of code. (refer to [HDec_Dec.py](scripts/HDec_dec.py))

```bash

    table = {'0': 0, '1': 1, '2': 2, '3': 3,

      '4': 4, '5': 5, '6': 6, '7': 7,

      '8': 8, '9': 9, 'A': 10, 'B': 11,

      'C': 12, 'D': 13, 'E': 14, 'F': 15}

    dec_list = []

    for ele in hex_list:

      hexadecimal = ele.strip().upper()

      res = 0

      size = len(hexadecimal) - 1

      for num in hexadecimal:

        res = res + table[num]*16**size

        size = size - 1

      dec_list.append(res)

```

### Grayscale Image (GS)

The extracted Hexadecimal values are then convereted into decimal which is then used to generate Grayscale (GS) Images. Preview the code to generate GS images in [GS_Img.py](scripts/GS_Img.py)

The generated grayscale images are now used to train an independent VGG-16 Model. We chose VGG-16 model beacause, among the deep convolutional neural network models, VGG-16 is the most light-weighted comparing to ResNet-50, InceptionNet etc.

### Entropy Graph (EG)

Entropy Graph is also generated from the same decimal values that are converted fromhexadecimal extracted from the Hex code of each malware sample. In computing, entropy is the randomness collected by an operating system or application for use in cryptography or other uses that require random data.

Below is the logic used for entropy extraction from decimal values: 

```bash

    # creating an average entropy list of all the segments in the array 

    import math

    segment_size = 256

    averages = []

    for i in range(0, len(arr), segment_size):

      subset = arr[i:i+segment_size]

      entropy = 0

      for element in subset:

        prob = np.unique(element, return_counts = True)

        entropy += en(prob)

      average_entropy = entropy / segment_size

      average_entropy = float(average_entropy)

      averages.append(average_entropy)

      #average_str = str(average_entropy)

```

Run [EntropyGraph.py](scripts/EntropyGraph.py) to generate Entropy Graph from decimal values of malware samples.

### Simhash Image (SH)

Simhash Images used in this work are generated not like Grayscale or Entropy from decimal. Rather, we utilize the assembly code of a malware sample to extract the operational code which is then passed through hash functions like MD5 to generate simhash images. 

The assembly code of malware sample is the first data that is used, from which we extract the operational codes or opcodes (eg: push, mov, call, test, etc.). These mnemonic codes are now utilized to generate Simhash signatures for each malware samples using the MD5 hash function. Refer to  [asm_op.py](scripts/asm_op.py) to see the mnemonic code extraction logic. 

[op_sim.py](scripts/op_sim.py) is the coding for generating simhash signatures from mnemonic code of malware samples.

Below is the logic of generating simhash signature.

```bash

# Calculate the hash value for each keyword and update the 'v' vector

    for keyword in keywords:

        b = hash_function(keyword)

        for i in range(n):

            if (b >> i) & 1 == 1:

                v[i] += 1

            else:

                v[i] -= 1

        for i in range(n):

            if v[i] > 0:

                s[i] = 1

            else:

                s[i] = 0

```

These Simhash signature are then used to generate Simhash images. (Refer to [SimImg.py](scripts/SimImg.py))

```bash

    sim = content.split()

    sim_list = []

    for ele in sim:

      el = int(ele)

      sim_list.append(el)

    array_2d = np.array(sim_list).reshape(16, 32) * 255

    image = im.fromarray(array_2d.astype(np.uint8), mode='L')

```

The generated Simhash images are non-square, which are not processable by the proposed VGG-16 model, therefore we resize the generated image without loosing its integrity using **Bilinear Interpolation**. In bilinear interpolation, the original image of size

*(m × n)* is resized to *(a × b)*, where *a* and *b* are set to 224 in this work, favorable to the VGG architecture. (Refer to [BilinearInterpolation.py](scripts/BI.py))

```bash

def bl_resize(original_img, new_h, new_w):

	#get dimensions of original image

	old_h, old_w = original_img.shape

	#create an array of the desired shape.

	#We will fill-in the values later.

	resized = np.zeros((new_h, new_w))

	#Calculate horizontal and vertical scaling factor

	w_scale_factor = (old_w ) / (new_w ) if new_h != 0 else 0

	h_scale_factor = (old_h ) / (new_h ) if new_w != 0 else 0

	for i in range(new_h):

		for j in range(new_w):

			#map the coordinates back to the original image

			x = i * h_scale_factor

			y = j * w_scale_factor

			#calculate the coordinate values for 4 surrounding pixels.

			x_floor = math.floor(x)

			x_ceil = min( old_h - 1, math.ceil(x))

			y_floor = math.floor(y)

			y_ceil = min(old_w - 1, math.ceil(y))

			if (x_ceil == x_floor) and (y_ceil == y_floor):

				q = original_img[int(x), int(y)]

			elif (x_ceil == x_floor):

				q1 = original_img[int(x), int(y_floor)]

				q2 = original_img[int(x), int(y_ceil)]

				q = q1 * (y_ceil - y) + q2 * (y - y_floor)

			elif (y_ceil == y_floor):

				q1 = original_img[int(x_floor), int(y)]

				q2 = original_img[int(x_ceil), int(y)]

				q = (q1 * (x_ceil - x)) + (q2	 * (x - x_floor))

			else:

				v1 = original_img[x_floor, y_floor]

				v2 = original_img[x_ceil, y_floor]

				v3 = original_img[x_floor, y_ceil]

				v4 = original_img[x_ceil, y_ceil]

				q1 = v1 * (x_ceil - x) + v2 * (x - x_floor)

				q2 = v3 * (x_ceil - x) + v4 * (x - x_floor)

				q = q1 * (y_ceil - y) + q2 * (y - y_floor)

				#print(q)

			resized[i,j] = q

	return resized.astype(np.uint8)

```

  

    Family

    Grayscale Image

    Entropy Graph

    Simhash Image

  

  

    Gatak

    

    

    

  

  

    Kelihos_ver1

    

    

    

  

  

    Kelihos_ver3

    

    

    

  

  

## Proposed Methodology 



## Experiment 1: 

### Effectiveness of GS, EG, and SH VGG16 models in classifying malware binaries

The primary experiment done in this work was to evaluate the performance of VGG16 models on individual malware visual features. And for that we designed a new architecture adding to the VGG16 architecture by freezing the pre-trained weights of VGG16. 

Below given image depicts the proposed architecture of the proposed model 



Each malware visual feature, that is, Grayscale Image, Entropy Graph, and Simhash Image will be trained seperately on different proposed VGG16 Architecture and the performances are analysed. 

The below table shows the performance of all Grayscale(GS), Entropy Graph(EG) and Simhash (SH) VGG-16 Indepedent Models. 

Refer to [VGG_16_Independent](scripts/VGG_16_Independent.ipynb) for the coding of Independent VGG16 models trained on 3 different Malware Visual Feature (GS, EG, SH) and the below set of code is for the model that was desgined.

```Bash

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, Conv1D

from tensorflow.keras import layers

from keras_tuner.tuners import RandomSearch

from tensorflow import keras

from tensorflow.keras.applications import VGG16

def build_model(hp):

    model_1 = Sequential()

    vgg = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

    # Freeze the weights of all layers in the VGG16 model

    for layer in vgg.layers:

        layer.trainable = False

    # Add the VGG16 model to your own model

    model_1.add(vgg)

    # Add a 1D convolutional layer

    model_1.add(Conv1D(filters=32, kernel_size=3, activation='relu'))  # Example parameters, you can tune these

    # Remove the Flatten layer to maintain the spatial structure

    # model_1.add(Flatten())

    # Add the dense layer

    model_1.add(Dense(units=hp.Int('dense_units', min_value=32, max_value=512, step=32),

                    activation='relu'))

    model_1.add(Dropout(hp.Float('dropout', min_value=0.0, max_value=0.5, step=0.1)))

    # adding batch normalization layer

    model_1.add(keras.layers.BatchNormalization())

    model_1.add(Dense(units = hp.Int('extra_dense_units', min_value = 32, max_value = 512, step = 32), activation = 'relu'))

    # Add another Conv1D layer before the output layer

    model_1.add(Conv1D(filters=64, kernel_size=3, activation='relu'))

    # Flatten the output before the final dense layer

    model_1.add(Flatten())

    # Add the output layer

    model_1.add(Dense(units=9, activation='softmax'))

    # Compile the model

    model_1.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate', values=[1e-2, 1e-3])),

                  loss='categorical_crossentropy',

                  metrics=['accuracy'])

    return model_1

```

| Feature | Accuracy | Precision | Recall | F1-Score | Time |

| --- | --- | --- | --- | --- | --- |

| GS | 0.998 | 1.0 | 0.996 | 0.997 | 0.01 |

| EG | 0.999 | 0.998 | 0.998 | 0.998 | 0.01 |

| SH | 0.996 | 0.966 | 0.963 | 0.963 | 0.01 |

Evaluating the results, you can see that Entropy Graph, among all three, GS, EG, and SH, is the best malware visual feature.

## Experiment 2:

But in this work we go on to enhance the visual feature by merging both input and the model to see whether the final merged model gets better performance than the previous independent VGG16 models. The merging process was done using the merging operators in Keras Library. 

Refer to https://keras.io/api/layers/merging_layers/

There are several merge operators/layers, but in this work we only focused on : 

1. Concatenate Layer

2. Add Layer

3. Average Layer

4. Maximum Layer

Each layer has its own use cases and advantages. In this work we have proven for malware detection and classification, when we merge the visual features, concatenate layer is the most efficient. 

| Features | GS | EG | SH |

| -- | -- | -- | -- |

| GS |:x:|:white_check_mark:|:white_check_mark:|

| EG |:x:|:x:|:white_check_mark:|

| GS, EG |:x:|:x:|:white_check_mark:|

Each of these combinations are considered for all four operators (Concatenation, Add, Average, Maximum). 

### Concatenation 

Concatenates a list of inputs. It takes as input a list of tensors, all of the same shape except for the concatenation axis, and returns a single tensor that is the concatenation of all inputs.

Before concatenating, we will load the three different modalitites of malware visual features and split them into train and test separately for three of them. Then we define VGG16 model for each modality which will be then concatenated together to form the proposed model.

The below code depicts the VGG16 models and then the contenation of them

VGG16:

```Bash

# Designing the first VGG16 model for Hex Images

hex_ = tf.keras.applications.VGG16(weights = 'imagenet', include_top = False, input_shape = (224, 224, 3))

hex_._name = 'hex_vgg'

# Freeze the layers in the VGG16 model so that they are not trained during training

for layer in hex_.layers:

  layer.trainable = False

# Pass the input through the VGG16 model

hex_vgg_output = hex_(hex_in)

# Add a classifier on top of the model

#hex_model = Flatten(name = 'hex_flatten')(hex_vgg_output)

hex_model = Dense(512, activation='relu', name='hex_dense')(hex_vgg_output)

```

[merge_operations](scripts/merge.py) loads the independent modality models and then merged using different operators.

Contenation 

```Bash

# Concatenate the output of the 2 models

merged = concatenate([hex_model, eg_model, sh_model])

# Add one or more dense layers on top of the merged output

# Add 1D convolutional layers

conv1 = Conv1D(filters=515, kernel_size=3, strides=1, activation='relu')(merged)

flatten = Flatten()(conv1)

dense1 = Dense(512, activation = 'relu')(flatten)

dense2 = Dense(256, activation = 'relu')(dense1)

dense3 = Dense(128, activation = 'relu')(dense2)

dense4 = Dense(64, activation = 'relu',kernel_regularizer = l2(0.01))(dense3)

output = Dense(9, activation = 'softmax')(dense4)

# Define the model

merged_model = Model(inputs = hex_in, outputs = output, name = 'merged_model') # hex_in is the shape of the input layer

# Compile the model

merged_model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

```

After analyzing the outputs of all four operators on different combinations of malware modalities, we came to a conclusion that `concatenation` operator when used for all three modalities (GS, EG, SH) is the highest performing model with 0.99 F1-Score.