https://github.com/pavankethavath/microsoft-classifying-cybersecurity-incidents-with-ml

A machine learning pipeline for classifying cybersecurity incidents as True Positive(TP), Benign Positive(BP), or False Positive(FP) using the Microsoft GUIDE dataset. Features advanced preprocessing, XGBoost optimization, SMOTE, SHAP analysis, and deployment-ready models. Tools: Python, scikit-learn, XGBoost, LightGBM, SHAP and imbalanced-learn
https://github.com/pavankethavath/microsoft-classifying-cybersecurity-incidents-with-ml

classificationreport correlation-analysis dataanalysis decision-tree-classifier exploratory-data-analysis feature-engineering feature-selection gradientboosting hyperparameter-tuning joblib lgbmclassifier logistic-regression machine-learning modelselection pandas randomforestclassifier randomsearchcv shap smote xgboost-classifier

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/pavankethavath/microsoft-classifying-cybersecurity-incidents-with-ml
Owner: pavankethavath
Created: 2024-11-07T17:02:56.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-11-27T16:38:58.000Z (11 months ago)
Last Synced: 2025-03-29T22:25:06.602Z (7 months ago)
Topics: classificationreport, correlation-analysis, dataanalysis, decision-tree-classifier, exploratory-data-analysis, feature-engineering, feature-selection, gradientboosting, hyperparameter-tuning, joblib, lgbmclassifier, logistic-regression, machine-learning, modelselection, pandas, randomforestclassifier, randomsearchcv, shap, smote, xgboost-classifier
Language: Jupyter Notebook
Homepage:
Size: 4.54 MB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Microsoft: Classifying Cybersecurity Incidents with Machine Learning

## Overview

This repository contains the implementation of a machine learning pipeline designed to classify cybersecurity incidents into three categories: **True Positive (TP)**, **Benign Positive (BP)**, and **False Positive (FP)**. Using the Microsoft GUIDE dataset, the project leverages advanced data preprocessing, feature engineering, and classification techniques to optimize model performance and support Security Operation Centers (SOCs) in automating incident triage.

## Key Features

- **Extensive Data Preprocessing and Feature Engineering**:

  - Null value handling and removal of irrelevant features.

  - Time-based feature extraction (day, hour, etc.) from timestamps.

  - Label encoding for categorical variables.

  - Feature correlation analysis to drop highly correlated features.

- **Machine Learning Model Training and Optimization**:

  - Baseline models: Logistic Regression and Decision Trees.

  - Advanced models: Random Forest, Gradient Boosting, XGBoost, and LightGBM.

  - Techniques to handle class imbalance: SMOTE and class-weight adjustments.

  - Hyperparameter tuning using RandomizedSearchCV.

- **Model Evaluation**:

  - Metrics: Macro-F1 score, precision, recall.

  - SHAP analysis to identify important features.

  - Comparison of models to select the best performer.

- **Deployment-Ready Solution**:

  - Final model saved using `joblib` for easy deployment.

## Business Use Cases

#### 1. Security Operation Centers (SOCs)  

Automate the triage process to prioritize critical threats efficiently.

#### 2. Incident Response Automation  

Enable systems to suggest appropriate actions for incident mitigation.

#### 3. Threat Intelligence  

Enhance detection capabilities using historical evidence and customer responses.

#### 4. Enterprise Security Management  

Reduce false positives and ensure timely addressing of true threats.

## Dataset

The Microsoft GUIDE dataset provides comprehensive telemetry data across three hierarchies: evidence, alerts, and incidents. Key highlights include:

GUIDE_train.csv (2.43 GB)

GUIDE_test.csv (1.09 GB)

[Kaggle Link to Dataset](https://www.kaggle.com/datasets/Microsoft/microsoft-security-incident-prediction)

Sample row: 

| Id | OrgId        | IncidentId | AlertId | Timestamp                | DetectorId | AlertTitle    | Category      | MitreTechniques | IncidentGrade | ActionGrouped | ActionGranular | EntityType | EvidenceRole | DeviceId | Sha256 | IpAddress  | Url | AccountSid | AccountUpn | AccountObjectId | AccountName | DeviceName | NetworkMessageId | EmailClusterId | RegistryKey | RegistryValueName | RegistryValueData | ApplicationId | ApplicationName | OAuthApplicationId | ThreatFamily | FileName | FolderPath | ResourceIdName | ResourceType | Roles | OSFamily | OSVersion | AntispamDirection | SuspicionLevel | LastVerdict | CountryCode | State | City |

|----|--------------|------------|---------|--------------------------|-------------|----------------|---------------|------------------|---------------|---------------|----------------|------------|--------------|----------|--------|------------|-----|-------------|-------------|------------------|--------------|------------|------------------|----------------|--------------|-------------------|-------------------|----------------|----------------|--------------------|--------------|----------|------------|-----------------|--------------|-------|----------|-----------|-------------------|----------------|--------------|-------------|-------|------|

| 0  | 180388628218 | 0          | 612     | 2024-06-04T06:05:15.000Z | 7           | InitialAccess  | NaN           | TruePositive      | NaN           | NaN           | Ip             | Related      | 98799      | 138268 | 27         | 160396 | 441377      | 673934       | 425863          | 453297       | 153085     | 529644           | NaN            | 1631         | 635               | 860               | 2251           | 3421           | 881                | NaN          | 289573   | 117668     | 3586            | NaN          | NaN   | 5        | 66        | NaN               | NaN            | 31           | 6           | 3     |

- Volume: Contains over 13 million pieces of evidence.

- Annotations: Includes more than 1 million incidents with triage labels and 26,000 incidents labeled with remediation actions.

- Telemetry: Drawn from 6,100+ organizations, covering 441 techniques from the MITRE ATT&CK framework.

- Partitioning: Split into 70% training data and 30% testing data, maintaining stratified representation across triage grades and identifiers.

The dataset has been processed into training and testing sets (`traindata_processed.csv` and `testdata_processed.csv`), which form the backbone of this analysis.

## Project Workflow

### 1. **Data Preprocessing**

- Removed columns with >50% missing values.

- Engineered features like `Hour`, `Day`, and `Time` from timestamps.

- Encoded categorical features using `LabelEncoder`.

- Handled missing and duplicate values, ensuring clean data.

  

---

### 2. **Exploratory Data Analysis (EDA)**

- Visualized incident distributions across `Hour`, `Day`,`month` and `Category`.

  ![EDA Visualizations](./images/hourly_incidents.png)

  ![EDA Visualizations](./images/daywise_incidents.png)

  ![EDA Visualizations](./images/monthly_incidents.png)

  ![EDA Visualizations](./images/category.png)

  

- Identified significant class imbalance in target labels (`TP`, `BP`, `FP`).

  ![EDA Visualizations](./images/target_distribution.png)

- Co-relation heatmap to understand co-linearity among the features

  ![EDA Visualizations](./images/heatmap.png)

---

### 3. **Model Training and Evaluation**

- **Baseline Models**: Logistic Regression and Decision Tree for initial benchmarks.

- **Advanced Models**: Random Forest, Gradient Boosting, XGBoost, and LightGBM.

 ![Model Performance](./images/all_models.png)

  

- Addressed class imbalance with **SMOTE**, improving F1-scores.

- Selected **XGBoost** with the top 11 features for final evaluation.

  ![Model Performance](./images/xgb_top11.png)

---

### 4. **Hyperparameter Tuning**

- Optimized `XGBoost` hyperparameters using **RandomizedSearchCV**.

- Tuned parameters included `max_depth`, `learning_rate`, and `n_estimators`.

  ![Hyperparameter Tuning](./images/hyperparameter_tuning.png)

---

### 5. **Feature Importance**

- Identified top features with **SHAP**, including `OrgId`, `IncidentId`, `DetectorId`, and more.

- Used these features to improve computational efficiency and model accuracy.

  ![SHAP Analysis](./images/shap.png)

---

### 6. **Final Evaluation**

- Tested the final model on unseen data, achieving high **Macro-F1 Score**.

- Delivered a balanced and generalizable model for real-world applications.

  ![Final Evaluation](./images/test_eval.png)

## Results

- **Best Model**: XXGBoost with hyperparameter tuning and without SMOTE.

- **Macro-F1 Score**:

  - Validation Set: **0.91**

  - Test Set: **0.90**

- **Feature Importance**:

  - Top features like `OrgId`, `IncidentId`,`alertTitle` and `DetectorId` significantly influenced predictions.

- **Model Performance**:

  - Balanced precision and recall for all three classes (`TP`, `BP`, `FP`).

- **Top Features**: Insights from SHAP analysis enabled computational efficiency and improved results.

---

## Technologies Used

### Programming Languages

- Python

### Libraries

- Data Processing: `pandas`, `numpy`

- Visualization: `matplotlib`, `seaborn`

- Machine Learning: 

  - `scikit-learn` (Logistic Regression, Decision Trees, Random Forest)

  - `XGBoost`

  - `LightGBM`

  - `imbalanced-learn` (SMOTE)

- Feature Analysis: `SHAP`

### Dataset

- Microsoft GUIDE Dataset (processed into `traindata_processed.csv` and `testdata_processed.csv`)

### Additional Tools

- Model Saving: `joblib`

## How to Run

1. Clone the repository:

   ```bash

   git clone 

   cd 

 ## Acknowledgments

- Microsoft for providing the GUIDE dataset.

- Open-source contributors of libraries and tools used in this project.

- The data science and cybersecurity communities for inspiration and knowledge.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pavankethavath/microsoft-classifying-cybersecurity-incidents-with-ml

Awesome Lists containing this project

README