https://github.com/javi-aranda/pelusa-server
Backend and ML configuration of PELUSA, the ML engine to detect malicious URLs
https://github.com/javi-aranda/pelusa-server
docker-compose fastapi hacktoberfest machine-learning pandas python sklearn
Last synced: 2 months ago
JSON representation
Backend and ML configuration of PELUSA, the ML engine to detect malicious URLs
- Host: GitHub
- URL: https://github.com/javi-aranda/pelusa-server
- Owner: javi-aranda
- License: mit
- Created: 2023-09-28T21:06:00.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-03-19T12:42:23.000Z (over 2 years ago)
- Last Synced: 2025-05-30T01:19:29.758Z (about 1 year ago)
- Topics: docker-compose, fastapi, hacktoberfest, machine-learning, pandas, python, sklearn
- Language: Python
- Homepage:
- Size: 13.1 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# Pelusa Server

[](https://github.com/javi-aranda/pelusa-server/actions/workflows/test.yaml)
## Description
Pelusa (Predictive Engine for Legitimate & Unverified Site Assessment) is a machine learning
based application that predicts the legitimacy of a website based on the URL provided. It is
built using FastAPI and PostgreSQL, deployed with Docker Compose.
## Installation
To get started, clone the repository and run with Docker Compose.
```bash
git clone https://github.com/javi-aranda/pelusa-server
cd pelusa-server
docker-compose up # add flag -d to run detached
docker-compose exec -T backend alembic upgrade head # run SQLAlchemy migrations
```
That should run the application on [http://localhost:8000](http://localhost:8000).
## Usage
You can get a more detailed reference of the API by visiting [http://localhost:8000/docs](http://localhost:8000/docs).
But mainly it consists of an endpoint `api/v1/analysis` that accepts JSON body with `{"input": ""}`
and returns the legitimacy of the website (1 means potentially bad, 0 means potentially safe).
Those results are stored in a PostgreSQL database, which could be useful to train the model in a future
or as persistence mechanism in case an URL is submitted multiple times in a short period of time.
### Exploring the database
There is a [pgAdmin](https://www.pgadmin.org/) instance running on [http://localhost:5050](http://localhost:5050) with credentials
defined in `.env` file. After connecting to the PostgreSQL server, you can explore the database and run any query you want.
## Dataset
The dataset used for training the model is handmade, it consists on 30000 URLs, 50% legitimate and 50% malicious.
Malicious websites were randomly sampled from [PhishTank active threats](http://data.phishtank.com/data/online-valid.csv)
and legitimate URLs were sampled from multiple [Kaggle datasets](https://www.kaggle.com/search?q=urls+in%3Adatasets).
After extracting features for both types, the resulting dataset is [phishing_dataset.csv](https://github.com/javi-aranda/pelusa-server/blob/master/backend/app/ml/data/phishing_dataset.csv)
## Training
The model is trained using a Random Forest Classifier with an accuracy of 94% over the training dataset
and the code is available as a Jupyter Notebook in [train.ipynb](https://github.com/javi-aranda/pelusa-server/blob/master/backend/app/ml/notebooks/train.ipynb)
## Credits
This project was made keeping in mind [FastAPI Starter](https://github.com/gaganpreet/fastapi-starter) as a reference,
but bundling the frontend in a different repository, which is available in [Pelusa React](https://github.com/javi-aranda/pelusa-react).