Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dominodatalab/reference-project-fraud-detection
https://github.com/dominodatalab/reference-project-fraud-detection
Last synced: 29 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/dominodatalab/reference-project-fraud-detection
- Owner: dominodatalab
- Created: 2021-10-14T07:04:27.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2021-12-16T21:20:38.000Z (about 3 years ago)
- Last Synced: 2023-08-07T03:05:39.306Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 64.2 MB
- Stars: 3
- Watchers: 6
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Pre-canned solution: Credit Card Fraud Detection
Credit card fraud represents a significant problem for financial institutions, and reliable fraud detection is generally challenging.
This project can be used as a template, facilitating the training of a machine learning model on a real-world credit card fraud dataset.
It also employs techniques like oversampling and threshold moving to address class imbalance.The dataset used in this project has been collected as part of a research collaboration between Worldline and the Machine Learning Group of Université Libre de Bruxelles, and the raw data can be freely downloaded from [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud).
The assets included in the project are:
* **FraudDetection.ipynb** - a notebook that performs exploratory data analysis, data wrangling, hyperparameter optimisation, model training and evaluation. The notebook introduces the usecases and discusses the key techniques needed for implementing a classification model (e.g. oversampling, threshold moving etc.)
* **model_train.py** - a training script that can be operationalised and retrain the model on demand or on schedule. The script can be used as a template. The key elements that need to be customized for other datasets are:
* *load_data* - data ingestion function
* *feature_eng* - data wrangling
* *xgboost_search* - more specifically, the values in *params*, which define the grid search scope
* **model_api.py** - a scoring function that exposes the persisted model as Model API. The *score* function accepts as arguments all independent parameters of the dataset and uses the model to compute the fraud probability for the individual transaction.**Note:** You need to unzip the *dataset/creditcard.csv.zip* file before running any of the above.
## Dockerfile
This project uses a compute environment based on dominodatalab/base:Ubuntu18_DAD_Py3.7_R3.6_20200508
Add the following entries to the Dockerfile:
```
RUN echo "ubuntu ALL=NOPASSWD: ALL" >> /etc/sudoers
RUN pip install --upgrade pip
RUN pip install imblearn && pip install xgboost
```## Model API
You can test the Model API using the following observation:
{
"data": {
"V1": -0.88,
"V2": 0.40,
"V3": 0.73,
"V4":-1.65,
"V5":2.73,
"V6":3.41,
"V7":0.23,
"V8":0.71,
"V9":-0.35,
"V10":-0.45,
"V11":-0.16,
"V12":-0.36,
"V13":-0.10,
"V14":-0.06,
"V15":0.86,
"V16":0.83,
"V17":-1.28,
"V18":0.14,
"V19":-0.27,
"V20":0.10,
"V21":-0.25,
"V22":-0.90,
"V23":-0.22,
"V24":0.98,
"V25":0.27,
"V26":-0.001,
"V27":-0.29,
"V28":-0.14,
"Amount":-68.74,
"Hour":5.98
}
}