Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/javi-aranda/pelusa-server

Backend and ML configuration of PELUSA, the ML engine to detect malicious URLs
https://github.com/javi-aranda/pelusa-server

docker-compose fastapi hacktoberfest machine-learning pandas python sklearn

Last synced: 7 days ago
JSON representation

Backend and ML configuration of PELUSA, the ML engine to detect malicious URLs

Host: GitHub
URL: https://github.com/javi-aranda/pelusa-server
Owner: javi-aranda
License: mit
Created: 2023-09-28T21:06:00.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2023-12-27T18:17:49.000Z (about 1 year ago)
Last Synced: 2023-12-27T21:06:59.954Z (about 1 year ago)
Topics: docker-compose, fastapi, hacktoberfest, machine-learning, pandas, python, sklearn
Language: Python
Homepage:
Size: 13.5 MB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 13
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

# Pelusa Server

![GitHub last commit (branch)](https://img.shields.io/github/last-commit/javi-aranda/pelusa-server/master)
[![Build and test](https://github.com/javi-aranda/pelusa-server/actions/workflows/test.yaml/badge.svg)](https://github.com/javi-aranda/pelusa-server/actions/workflows/test.yaml)

## Description
Pelusa (Predictive Engine for Legitimate & Unverified Site Assessment) is a machine learning
based application that predicts the legitimacy of a website based on the URL provided. It is
built using FastAPI and PostgreSQL, deployed with Docker Compose.

## Installation
To get started, clone the repository and run with Docker Compose.

```bash
git clone https://github.com/javi-aranda/pelusa-server
cd pelusa-server
docker-compose up # add flag -d to run detached
docker-compose exec -T backend alembic upgrade head # run SQLAlchemy migrations
```

That should run the application on [http://localhost:8000](http://localhost:8000).

## Usage
You can get a more detailed reference of the API by visiting [http://localhost:8000/docs](http://localhost:8000/docs).
But mainly it consists of an endpoint `api/v1/analysis` that accepts JSON body with `{"input": ""}`
and returns the legitimacy of the website (1 means potentially bad, 0 means potentially safe).

Those results are stored in a PostgreSQL database, which could be useful to train the model in a future
or as persistence mechanism in case an URL is submitted multiple times in a short period of time.

### Exploring the database
There is a [pgAdmin](https://www.pgadmin.org/) instance running on [http://localhost:5050](http://localhost:5050) with credentials
defined in `.env` file. After connecting to the PostgreSQL server, you can explore the database and run any query you want.

## Dataset
The dataset used for training the model is handmade, it consists on 30000 URLs, 50% legitimate and 50% malicious.

Malicious websites were randomly sampled from [PhishTank active threats](http://data.phishtank.com/data/online-valid.csv)
and legitimate URLs were sampled from multiple [Kaggle datasets](https://www.kaggle.com/search?q=urls+in%3Adatasets).
After extracting features for both types, the resulting dataset is [phishing_dataset.csv](https://github.com/javi-aranda/pelusa-server/blob/master/backend/app/ml/data/phishing_dataset.csv)

## Training
The model is trained using a Random Forest Classifier with an accuracy of 94% over the training dataset
and the code is available as a Jupyter Notebook in [train.ipynb](https://github.com/javi-aranda/pelusa-server/blob/master/backend/app/ml/notebooks/train.ipynb)

## Credits

This project was made keeping in mind [FastAPI Starter](https://github.com/gaganpreet/fastapi-starter) as a reference,
but bundling the frontend in a different repository, which is available in [Pelusa React](https://github.com/javi-aranda/pelusa-react).