https://github.com/alex-snd/malwareclassifier

👾 Malware Classification using Deep Learning and Cuckoo Sandbox
https://github.com/alex-snd/malwareclassifier

cuckoo-sandbox cvae data-science deep-learning malware malware-classification malware-detection python pytorch vae

Last synced: about 1 year ago
JSON representation

👾 Malware Classification using Deep Learning and Cuckoo Sandbox

Host: GitHub
URL: https://github.com/alex-snd/malwareclassifier
Owner: alex-snd
License: mit
Created: 2021-03-21T09:19:42.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2022-06-18T08:21:28.000Z (about 4 years ago)
Last Synced: 2025-04-03T15:44:11.806Z (over 1 year ago)
Topics: cuckoo-sandbox, cvae, data-science, deep-learning, malware, malware-classification, malware-detection, python, pytorch, vae
Language: Python
Homepage:
Size: 10.1 MB
Stars: 14
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Malware Classifier

This is the code repository for **Malware Classification Research**. All the deep learning models are implemented with Python 3.6+ and PyTorch 1.9.

## Data
The source data is the json reports generated by malicious software dynamic analysis system [Cuckoo Sandbox](https://cuckoosandbox.org/).
The data was analyzed in order to extract the most useful information about malicious samples. As a result of the analysis, 3698 features were selected, on the basis of which further classification will be carried out. Thus, each instance of malware is assigned a binary feature vector of dimension 3698, the label of which is the result of classification by Kaspersky anti-virus. The database contains about 10,000 labeled samples from 8 different types of malware and about 14,000 unlabeled samples.

## Data Visualization
The normalized vector of dimension 3698 is represented as an RGB image of the size 61 × 61 (61 ≈ √3698), in which the color of each pixel is set by the value of the corresponding feature.

## Autoencoder
An autoencoder model with a latent space dimension of 200 was trained on the unlabeled data for further malware classification using pretrained encoder.