https://github.com/spk-22/phish-guard

A comprehensive deep learning framework for phishing detection, utilizing Graph Neural Networks (GraphSAGE) to analyze interconnected web features. Features include temporal graph construction, causal learning for robust time-series analysis, and integrated noise injection testing to evaluate model resilience against data imperfections.
https://github.com/spk-22/phish-guard

casual-sampling gnn-model graphsage ids phishing-detection pytorch temporal-data

Last synced: 12 months ago
JSON representation

Host: GitHub
URL: https://github.com/spk-22/phish-guard
Owner: spk-22
Created: 2025-05-23T14:58:21.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-27T14:56:26.000Z (about 1 year ago)
Last Synced: 2025-06-08T06:42:01.017Z (about 1 year ago)
Topics: casual-sampling, gnn-model, graphsage, ids, phishing-detection, pytorch, temporal-data
Language: Python
Homepage:
Size: 47.9 KB
Stars: 3
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Phishing Detection Using Graph Sage using Casual Sampling (GNNs)

This repository presents a complete workflow for phishing detection leveraging **GraphSAGE**, a type of Graph Neural Network (GNN), with temporal modeling, causal sampling, and robustness testing.

## 🧠 Overview

Phishing attacks often involve subtle patterns that can be better detected using relational and temporal data. This project converts phishing datasets into graphs and applies a GNN model that:

- Respects **causal constraints** in message passing.
- Incorporates **temporal windowing** for realistic data flow.
- Tests **robustness** through noise injection.

## 🛠 Tech Stack

- **Programming Language:** Python
- **Graph Processing:** [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/)
- **Machine Learning:** PyTorch, Scikit-learn
- **Data Handling:** pandas, numpy
- **Visualization:** matplotlib

## 📊 Workflow Summary

### 1. **Data Preprocessing**
- Load and clean phishing data from `phish.xlsx`
- One-hot encode categorical features
- Scale numerical features
- Combine features for each URL

### 2. **Graph Construction**
- Create a similarity graph using cosine similarity
- Connect each node to k=5 nearest neighbors
- Partition data into time windows of 10 samples
- Generate PyG `Data` objects for each time window

### 3. **Causal GraphSAGE Model**
- Custom model using `SAGEConv`, `BatchNorm`, `Dropout`
- Enforces **causal message passing** (no future info leakage)

### 4. **Noise Injection for Robustness**
- Add Gaussian noise to node features
- Randomly flip labels to simulate real-world inconsistencies

### 5. **Training**
- Trained with Binary Cross-Entropy loss and Adam optimizer
- Evaluated using AUC-ROC score and ROC curve visualization

## 📈 Evaluation

The model achieved strong performance on phishing detection:

| Metric | Value |
|------------|--------|
| Accuracy | 86.36% |
| Precision | 86.32% |
| Recall | 86.36% |
| F1-Score | 86.14% |
| AUC-ROC | 0.9023 | Visualized in final plot |

# Visualizations
* Training Loss and Accuracy Over Epochs (Causal GraphSAGE): Visualizes the convergence of the model during causal training, showing decreasing loss and increasing accuracy over epochs.
* Confusion Matrix: Provides a detailed breakdown of true positives, true negatives, false positives, and false negatives from the final evaluation, illustrating the model's classification accuracy for each class.
* ROC Curve: Illustrates the model's trade-off between True Positive Rate and False Positive Rate across various classification thresholds, with the AUC-ROC score quantifying overall performance.
* Training Loss - Phishing Noise Training: Depicts the loss reduction during the training phase where noise was intentionally injected, demonstrating the model's ability to learn effectively despite data imperfections.
* Overall Training Loss/Accuracy: Shows the general learning progression of the model, likely from an initial training phase, with loss decreasing and accuracy increasing.
* Visual Interface: The dashboard helps to visualize the data fed to the global (fusion classifier) and attack - specific models for viewing class probabilities, graph plot visualization and accuracy metrics, confidence scores of both models and the probable reason behind the respective model's classification.

# Dependencies
The project relies on the following key libraries:

Python 3.x
torch (PyTorch)
torch-geometric (PyG)
torch-scatter
pandas
numpy
scikit-learn
matplotlib
gradio

```bash
git clone https://github.com/spk-22/Phish-Guard
```
```bash
pip install -r requirements.txt
# (Or manually install: torch, torch-geometric, scikit-learn, pandas, numpy, matplotlib)
# Ensure torch-geometric, torch-scatter, and torch-sparse versions are compatible with your PyTorch version.
```
```bash
python phish.py
```
```bash
streamlit run web_app.py
```
## 🔍 Use Case

This pipeline is ideal for cybersecurity researchers and engineers looking to detect phishing attempts using relational and temporal patterns within data.
The AUC-ROC score of 0.9023 signifies excellent discriminative power, even when trained on noisy data, indicating the model's strong ability to differentiate between phishing and legitimate attempts.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/spk-22/phish-guard

Awesome Lists containing this project

README