Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/javi-aranda/pelusa-server
Backend and ML configuration of PELUSA, the ML engine to detect malicious URLs
https://github.com/javi-aranda/pelusa-server
docker-compose fastapi hacktoberfest machine-learning pandas python sklearn
Last synced: 7 days ago
JSON representation
Backend and ML configuration of PELUSA, the ML engine to detect malicious URLs
- Host: GitHub
- URL: https://github.com/javi-aranda/pelusa-server
- Owner: javi-aranda
- License: mit
- Created: 2023-09-28T21:06:00.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-12-27T18:17:49.000Z (about 1 year ago)
- Last Synced: 2023-12-27T21:06:59.954Z (about 1 year ago)
- Topics: docker-compose, fastapi, hacktoberfest, machine-learning, pandas, python, sklearn
- Language: Python
- Homepage:
- Size: 13.5 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# Pelusa Server
![GitHub last commit (branch)](https://img.shields.io/github/last-commit/javi-aranda/pelusa-server/master)
[![Build and test](https://github.com/javi-aranda/pelusa-server/actions/workflows/test.yaml/badge.svg)](https://github.com/javi-aranda/pelusa-server/actions/workflows/test.yaml)## Description
Pelusa (Predictive Engine for Legitimate & Unverified Site Assessment) is a machine learning
based application that predicts the legitimacy of a website based on the URL provided. It is
built using FastAPI and PostgreSQL, deployed with Docker Compose.## Installation
To get started, clone the repository and run with Docker Compose.```bash
git clone https://github.com/javi-aranda/pelusa-server
cd pelusa-server
docker-compose up # add flag -d to run detached
docker-compose exec -T backend alembic upgrade head # run SQLAlchemy migrations
```That should run the application on [http://localhost:8000](http://localhost:8000).
## Usage
You can get a more detailed reference of the API by visiting [http://localhost:8000/docs](http://localhost:8000/docs).
But mainly it consists of an endpoint `api/v1/analysis` that accepts JSON body with `{"input": ""}`
and returns the legitimacy of the website (1 means potentially bad, 0 means potentially safe).Those results are stored in a PostgreSQL database, which could be useful to train the model in a future
or as persistence mechanism in case an URL is submitted multiple times in a short period of time.### Exploring the database
There is a [pgAdmin](https://www.pgadmin.org/) instance running on [http://localhost:5050](http://localhost:5050) with credentials
defined in `.env` file. After connecting to the PostgreSQL server, you can explore the database and run any query you want.## Dataset
The dataset used for training the model is handmade, it consists on 30000 URLs, 50% legitimate and 50% malicious.Malicious websites were randomly sampled from [PhishTank active threats](http://data.phishtank.com/data/online-valid.csv)
and legitimate URLs were sampled from multiple [Kaggle datasets](https://www.kaggle.com/search?q=urls+in%3Adatasets).
After extracting features for both types, the resulting dataset is [phishing_dataset.csv](https://github.com/javi-aranda/pelusa-server/blob/master/backend/app/ml/data/phishing_dataset.csv)## Training
The model is trained using a Random Forest Classifier with an accuracy of 94% over the training dataset
and the code is available as a Jupyter Notebook in [train.ipynb](https://github.com/javi-aranda/pelusa-server/blob/master/backend/app/ml/notebooks/train.ipynb)## Credits
This project was made keeping in mind [FastAPI Starter](https://github.com/gaganpreet/fastapi-starter) as a reference,
but bundling the frontend in a different repository, which is available in [Pelusa React](https://github.com/javi-aranda/pelusa-react).