https://github.com/spk-22/phish-guard
A comprehensive deep learning framework for phishing detection, utilizing Graph Neural Networks (GraphSAGE) to analyze interconnected web features. Features include temporal graph construction, causal learning for robust time-series analysis, and integrated noise injection testing to evaluate model resilience against data imperfections.
https://github.com/spk-22/phish-guard
casual-sampling gnn-model graphsage ids phishing-detection pytorch temporal-data
Last synced: 12 months ago
JSON representation
A comprehensive deep learning framework for phishing detection, utilizing Graph Neural Networks (GraphSAGE) to analyze interconnected web features. Features include temporal graph construction, causal learning for robust time-series analysis, and integrated noise injection testing to evaluate model resilience against data imperfections.
- Host: GitHub
- URL: https://github.com/spk-22/phish-guard
- Owner: spk-22
- Created: 2025-05-23T14:58:21.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-27T14:56:26.000Z (about 1 year ago)
- Last Synced: 2025-06-08T06:42:01.017Z (about 1 year ago)
- Topics: casual-sampling, gnn-model, graphsage, ids, phishing-detection, pytorch, temporal-data
- Language: Python
- Homepage:
- Size: 47.9 KB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Phishing Detection Using Graph Sage using Casual Sampling (GNNs)
This repository presents a complete workflow for phishing detection leveraging **GraphSAGE**, a type of Graph Neural Network (GNN), with temporal modeling, causal sampling, and robustness testing.
## 🧠 Overview
Phishing attacks often involve subtle patterns that can be better detected using relational and temporal data. This project converts phishing datasets into graphs and applies a GNN model that:
- Respects **causal constraints** in message passing.
- Incorporates **temporal windowing** for realistic data flow.
- Tests **robustness** through noise injection.
## 🛠 Tech Stack
- **Programming Language:** Python
- **Graph Processing:** [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/)
- **Machine Learning:** PyTorch, Scikit-learn
- **Data Handling:** pandas, numpy
- **Visualization:** matplotlib
## 📊 Workflow Summary
### 1. **Data Preprocessing**
- Load and clean phishing data from `phish.xlsx`
- One-hot encode categorical features
- Scale numerical features
- Combine features for each URL
### 2. **Graph Construction**
- Create a similarity graph using cosine similarity
- Connect each node to k=5 nearest neighbors
- Partition data into time windows of 10 samples
- Generate PyG `Data` objects for each time window
### 3. **Causal GraphSAGE Model**
- Custom model using `SAGEConv`, `BatchNorm`, `Dropout`
- Enforces **causal message passing** (no future info leakage)
### 4. **Noise Injection for Robustness**
- Add Gaussian noise to node features
- Randomly flip labels to simulate real-world inconsistencies
### 5. **Training**
- Trained with Binary Cross-Entropy loss and Adam optimizer
- Evaluated using AUC-ROC score and ROC curve visualization
## 📈 Evaluation
The model achieved strong performance on phishing detection:
| Metric | Value |
|------------|--------|
| Accuracy | 86.36% |
| Precision | 86.32% |
| Recall | 86.36% |
| F1-Score | 86.14% |
| AUC-ROC | 0.9023 | Visualized in final plot |
# Visualizations
* Training Loss and Accuracy Over Epochs (Causal GraphSAGE): Visualizes the convergence of the model during causal training, showing decreasing loss and increasing accuracy over epochs.
* Confusion Matrix: Provides a detailed breakdown of true positives, true negatives, false positives, and false negatives from the final evaluation, illustrating the model's classification accuracy for each class.
* ROC Curve: Illustrates the model's trade-off between True Positive Rate and False Positive Rate across various classification thresholds, with the AUC-ROC score quantifying overall performance.
* Training Loss - Phishing Noise Training: Depicts the loss reduction during the training phase where noise was intentionally injected, demonstrating the model's ability to learn effectively despite data imperfections.
* Overall Training Loss/Accuracy: Shows the general learning progression of the model, likely from an initial training phase, with loss decreasing and accuracy increasing.
* Visual Interface: The dashboard helps to visualize the data fed to the global (fusion classifier) and attack - specific models for viewing class probabilities, graph plot visualization and accuracy metrics, confidence scores of both models and the probable reason behind the respective model's classification.
# Dependencies
The project relies on the following key libraries:
Python 3.x
torch (PyTorch)
torch-geometric (PyG)
torch-scatter
pandas
numpy
scikit-learn
matplotlib
gradio
```bash
git clone https://github.com/spk-22/Phish-Guard
```
```bash
pip install -r requirements.txt
# (Or manually install: torch, torch-geometric, scikit-learn, pandas, numpy, matplotlib)
# Ensure torch-geometric, torch-scatter, and torch-sparse versions are compatible with your PyTorch version.
```
```bash
python phish.py
```
```bash
streamlit run web_app.py
```
## 🔍 Use Case
This pipeline is ideal for cybersecurity researchers and engineers looking to detect phishing attempts using relational and temporal patterns within data.
The AUC-ROC score of 0.9023 signifies excellent discriminative power, even when trained on noisy data, indicating the model's strong ability to differentiate between phishing and legitimate attempts.