https://github.com/pirocheto/phishing-url-detection
Train a machine learning model for Phishing URL Detection with mlops practices.
https://github.com/pirocheto/phishing-url-detection
ai anti-phishing cybersecurity data-science machine-learning mlops phishing-detection
Last synced: about 1 year ago
JSON representation
Train a machine learning model for Phishing URL Detection with mlops practices.
- Host: GitHub
- URL: https://github.com/pirocheto/phishing-url-detection
- Owner: pirocheto
- Created: 2023-10-29T00:52:20.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-12-04T15:12:57.000Z (over 2 years ago)
- Last Synced: 2025-02-13T00:49:55.304Z (over 1 year ago)
- Topics: ai, anti-phishing, cybersecurity, data-science, machine-learning, mlops, phishing-detection
- Language: Python
- Homepage:
- Size: 202 MB
- Stars: 8
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Phishing URL Detection
with
Machine Learning
This repository contains the code for training a machine learning model for phishing URL detection.
The dataset used and the latest model are hosted on Hugging Face:
- Dataset: https://huggingface.co/datasets/pirocheto/phishing-url
- Model: https://huggingface.co/pirocheto/phishing-url-detection
> ℹ️ You can test the model on the demo page [here](https://pirocheto.github.io/phishing-url-detection/).
## Consideration Regarding The Model
The model architecture consists of a TF-IDF (character n-grams + word n-grams) for vectorization and a linear SVM for classification.
:white_check_mark: **Lightweight**: Easy to handle, you can embed it in your applications without the need for a remote server to host it.
:white_check_mark: **Fast**: Your application will experience no additional latency due to model inferences.
:white_check_mark: **Works Offline**: The use of URL tokens alone enables usage without an internet connection.
On the other hand, it could be less efficient than more complex models or those using external features.
## Reproduce The Model
```bash
# 1. Clone the repository
git clone https://github.com/pirocheto/phishing-url-detection.git
# 2. Go inside the project
cd phishing-url-detection
# 3. Install dependencies
poetry install --no-root
# 4. Run the pipeline
dvc repro -s download_data
dvc repro -s train
```
For more details, see the pipeline in the [dvc.yaml](dvc.yaml) file.
## Project Structure
- `live`: Artifacts created during pipeline execution
- `notebooks`: Contains the code for the exploration phase
- `ressources`: Miscellaneous resources used by scripts
- `tests`: Test files
- `src`: Python scripts
- `params.yaml`: Parameters for the DVC experiment
- `dvc.yaml`: Pipeline to run the experiment and reproduce executions
## Main Tools Used in This Project
- [DVC](https://dvc.org/): Version data and experiments
- [CML](https://cml.dev/): Post a comment to the pull request showing the metrics and parameters of an experiment
- [Scikit-Learn](https://scikit-learn.org/stable/): Framework to train the model
- [Optuna](https://optuna.readthedocs.io/en/stable/): Find the best hyperparameters for model