https://github.com/vidhi1290/malware-detection
Welcome to the Malicious Executable Detection project! This repository explores the world of machine learning and clustering analysis to detect malicious executable files ๐ฅ๐
https://github.com/vidhi1290/malware-detection
clustering-algorithm cybersecurity hierarchical-clustering k-means-clustering machine-learning malware-detection python silhouette
Last synced: about 2 months ago
JSON representation
Welcome to the Malicious Executable Detection project! This repository explores the world of machine learning and clustering analysis to detect malicious executable files ๐ฅ๐
- Host: GitHub
- URL: https://github.com/vidhi1290/malware-detection
- Owner: Vidhi1290
- Created: 2023-09-16T11:53:38.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-09-16T12:01:34.000Z (over 1 year ago)
- Last Synced: 2025-02-02T18:33:26.205Z (4 months ago)
- Topics: clustering-algorithm, cybersecurity, hierarchical-clustering, k-means-clustering, machine-learning, malware-detection, python, silhouette
- Language: Jupyter Notebook
- Homepage:
- Size: 12.7 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Malicious Executable Detection using Cluster Analysis ๐
Welcome to the Malicious Executable Detection project! This repository explores the world of machine learning and clustering analysis to detect malicious executable files. ๐๐ค
## Problem Statement ๐ฏ
In an era where cyber warfare is on the rise, detecting malicious code has become crucial. This project aims to develop a machine learning approach to identify malicious executable files. ๐ป๐ฆ## Understanding the Data and Attributes ๐
The dataset contains features extracted from both malicious and non-malicious Windows executable files. It includes a total of 373 samples, with 301 being malicious and 72 non-malicious files. The dataset is imbalanced, with 531 features represented as F1, F2, and so on, and a label column indicating whether the file is malicious or non-malicious. ๐๐ง## Data Preparation ๐ ๏ธ
- **Imputation**: Rows and columns with missing data exceeding 70% are removed. ๐งน
- **Feature Selection**: Relevant features are chosen for analysis. ๐ฏ
- **Data Standardization**: Standardization is applied to make the data suitable for clustering. ๐## K-Means Clustering ๐
K-Means clustering is applied to group similar instances together. The Silhouette method is used to determine the optimal number of clusters. ๐งฉ## Silhouette Analysis ๐
Silhouette analysis helps evaluate the quality of clustering. A higher silhouette score indicates better clustering. ๐๐## Cluster Stability Check ๐
Cluster stability is assessed by comparing clusters with and without random sampling of data. ๐## Categorizing New Samples ๐
The model is used to predict clusters for new executable files. ๐## Learning Outcomes ๐
- Implementing cluster analysis in Python
- Pre-processing data for analysis
- Hierarchical clustering and dendrogram visualization
- Implementing K-Means clustering
- Determining the optimal number of clusters
- Cluster stability evaluation
- Predicting clusters for new samplesFeel free to explore the notebooks and the code to dive deeper into the analysis!
## Kaggle Notebook ๐
You can also view this project on [Kaggle](#Kaggle). ๐## Open in Colab ๐
Want to run the notebooks in Google Colab? Click [here](#Open-In-Colab) to open them directly! ๐ก## Connect with Us ๐
Join our community and stay updated on our latest projects:- ๐ [GitHub](https://github.com/Vidhi1290)
- ๐ [LinkedIn](https://www.linkedin.com/in/vidhi-waghela-434663198/)
- ๐ฆ [Twitter](https://twitter.com/VidhiWaghela)
- ๐ [Medium](https://medium.com/@datasciencemeetscybersecurity)Happy coding! ๐ฉโ๐ป๐จโ๐ป