An open API service indexing awesome lists of open source software.

https://github.com/vidhi1290/malware-detection

Welcome to the Malicious Executable Detection project! This repository explores the world of machine learning and clustering analysis to detect malicious executable files ๐Ÿ”ฅ๐Ÿ”
https://github.com/vidhi1290/malware-detection

clustering-algorithm cybersecurity hierarchical-clustering k-means-clustering machine-learning malware-detection python silhouette

Last synced: about 2 months ago
JSON representation

Welcome to the Malicious Executable Detection project! This repository explores the world of machine learning and clustering analysis to detect malicious executable files ๐Ÿ”ฅ๐Ÿ”

Awesome Lists containing this project

README

        

# Malicious Executable Detection using Cluster Analysis ๐Ÿ“Š

Welcome to the Malicious Executable Detection project! This repository explores the world of machine learning and clustering analysis to detect malicious executable files. ๐Ÿ”๐Ÿค–

## Problem Statement ๐ŸŽฏ
In an era where cyber warfare is on the rise, detecting malicious code has become crucial. This project aims to develop a machine learning approach to identify malicious executable files. ๐Ÿ’ป๐Ÿฆ 

## Understanding the Data and Attributes ๐Ÿ“š
The dataset contains features extracted from both malicious and non-malicious Windows executable files. It includes a total of 373 samples, with 301 being malicious and 72 non-malicious files. The dataset is imbalanced, with 531 features represented as F1, F2, and so on, and a label column indicating whether the file is malicious or non-malicious. ๐Ÿ“ˆ๐Ÿง

## Data Preparation ๐Ÿ› ๏ธ
- **Imputation**: Rows and columns with missing data exceeding 70% are removed. ๐Ÿงน
- **Feature Selection**: Relevant features are chosen for analysis. ๐ŸŽฏ
- **Data Standardization**: Standardization is applied to make the data suitable for clustering. ๐Ÿ“Š

## K-Means Clustering ๐Ÿ“ˆ
K-Means clustering is applied to group similar instances together. The Silhouette method is used to determine the optimal number of clusters. ๐Ÿงฉ

## Silhouette Analysis ๐Ÿ“Š
Silhouette analysis helps evaluate the quality of clustering. A higher silhouette score indicates better clustering. ๐Ÿ“ˆ๐Ÿ”

## Cluster Stability Check ๐Ÿ”’
Cluster stability is assessed by comparing clusters with and without random sampling of data. ๐Ÿ”„

## Categorizing New Samples ๐Ÿ†•
The model is used to predict clusters for new executable files. ๐Ÿ“‹

## Learning Outcomes ๐Ÿ“š
- Implementing cluster analysis in Python
- Pre-processing data for analysis
- Hierarchical clustering and dendrogram visualization
- Implementing K-Means clustering
- Determining the optimal number of clusters
- Cluster stability evaluation
- Predicting clusters for new samples

Feel free to explore the notebooks and the code to dive deeper into the analysis!

## Kaggle Notebook ๐Ÿ“Š
You can also view this project on [Kaggle](#Kaggle). ๐Ÿ“‘

## Open in Colab ๐Ÿš€
Want to run the notebooks in Google Colab? Click [here](#Open-In-Colab) to open them directly! ๐Ÿ’ก

## Connect with Us ๐ŸŒ
Join our community and stay updated on our latest projects:

- ๐ŸŒ [GitHub](https://github.com/Vidhi1290)
- ๐Ÿ”— [LinkedIn](https://www.linkedin.com/in/vidhi-waghela-434663198/)
- ๐Ÿฆ [Twitter](https://twitter.com/VidhiWaghela)
- ๐Ÿ“ [Medium](https://medium.com/@datasciencemeetscybersecurity)

Happy coding! ๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป