Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dominodatalab/reference-project-fraud-detection

Last synced: 29 days ago
JSON representation

Host: GitHub
URL: https://github.com/dominodatalab/reference-project-fraud-detection
Owner: dominodatalab
Created: 2021-10-14T07:04:27.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2021-12-16T21:20:38.000Z (about 3 years ago)
Last Synced: 2023-08-07T03:05:39.306Z (over 1 year ago)
Language: Jupyter Notebook
Size: 64.2 MB
Stars: 3
Watchers: 6
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Pre-canned solution: Credit Card Fraud Detection

Credit card fraud represents a significant problem for financial institutions, and reliable fraud detection is generally challenging.

This project can be used as a template, facilitating the training of a machine learning model on a real-world credit card fraud dataset.

It also employs techniques like oversampling and threshold moving to address class imbalance.

The dataset used in this project has been collected as part of a research collaboration between Worldline and the Machine Learning Group of Université Libre de Bruxelles, and the raw data can be freely downloaded from [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud).

The assets included in the project are:

* **FraudDetection.ipynb** - a notebook that performs exploratory data analysis, data wrangling, hyperparameter optimisation, model training and evaluation. The notebook introduces the usecases and discusses the key techniques needed for implementing a classification model (e.g. oversampling, threshold moving etc.)

* **model_train.py** - a training script that can be operationalised and retrain the model on demand or on schedule. The script can be used as a template. The key elements that need to be customized for other datasets are:

    * *load_data* - data ingestion function

    * *feature_eng* - data wrangling

    * *xgboost_search* - more specifically, the values in *params*, which define the grid search scope

    

* **model_api.py** - a scoring function that exposes the persisted model as Model API. The *score* function accepts as arguments all independent parameters of the dataset and uses the model to compute the fraud probability for the individual transaction.

**Note:** You need to unzip the *dataset/creditcard.csv.zip* file before running any of the above.

## Dockerfile

This project uses a compute environment based on dominodatalab/base:Ubuntu18_DAD_Py3.7_R3.6_20200508

Add the following entries to the Dockerfile:

```

RUN echo "ubuntu    ALL=NOPASSWD: ALL" >> /etc/sudoers

RUN pip install --upgrade pip

RUN pip install imblearn && pip install xgboost

```

## Model API

You can test the Model API using the following observation:

{

  "data": {

    "V1": -0.88, 

    "V2": 0.40, 

    "V3": 0.73, 

    "V4":-1.65, 

    "V5":2.73, 

    "V6":3.41, 

    "V7":0.23, 

    "V8":0.71, 

    "V9":-0.35, 

    "V10":-0.45,

    "V11":-0.16, 

    "V12":-0.36, 

    "V13":-0.10, 

    "V14":-0.06, 

    "V15":0.86, 

    "V16":0.83, 

    "V17":-1.28, 

    "V18":0.14, 

    "V19":-0.27, 

    "V20":0.10,

    "V21":-0.25, 

    "V22":-0.90, 

    "V23":-0.22, 

    "V24":0.98, 

    "V25":0.27,

    "V26":-0.001, 

    "V27":-0.29, 

    "V28":-0.14, 

    "Amount":-68.74, 

    "Hour":5.98

  }

}