https://github.com/alex-snd/malwareclassifier
👾 Malware Classification using Deep Learning and Cuckoo Sandbox
https://github.com/alex-snd/malwareclassifier
cuckoo-sandbox cvae data-science deep-learning malware malware-classification malware-detection python pytorch vae
Last synced: 11 months ago
JSON representation
👾 Malware Classification using Deep Learning and Cuckoo Sandbox
- Host: GitHub
- URL: https://github.com/alex-snd/malwareclassifier
- Owner: alex-snd
- License: mit
- Created: 2021-03-21T09:19:42.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2022-06-18T08:21:28.000Z (over 3 years ago)
- Last Synced: 2025-04-03T15:44:11.806Z (11 months ago)
- Topics: cuckoo-sandbox, cvae, data-science, deep-learning, malware, malware-classification, malware-detection, python, pytorch, vae
- Language: Python
- Homepage:
- Size: 10.1 MB
- Stars: 14
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Malware Classifier
This is the code repository for **Malware Classification Research**. All the deep learning models are implemented with Python 3.6+ and PyTorch 1.9.
## Data
The source data is the json reports generated by malicious software dynamic analysis system [Cuckoo Sandbox](https://cuckoosandbox.org/).
The data was analyzed in order to extract the most useful information about malicious samples. As a result of the analysis, 3698 features were selected, on the basis of which further classification will be carried out. Thus, each instance of malware is assigned a binary feature vector of dimension 3698, the label of which is the result of classification by Kaspersky anti-virus. The database contains about 10,000 labeled samples from 8 different types of malware and about 14,000 unlabeled samples.
## Data Visualization
The normalized vector of dimension 3698 is represented as an RGB image of the size 61 × 61 (61 ≈ √3698), in which the color of each pixel is set by the value of the corresponding feature.
## Autoencoder
An autoencoder model with a latent space dimension of 200 was trained on the unlabeled data for further malware classification using pretrained encoder.
AE performance, the first row is input, the second is AE output
Also the autoencoder was trained with the size of the latent space equal to 2 for its subsequent visualization on a two-dimensional plane.
Changing the latent space in the learning process
Labeled malware samples displayed in latent space
## Classifier
Сlassifier results: