Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/emilyfelker/ieee_cis_fraud_detection
Which online transactions are fraudulent? Program that uses machine learning to detect fraud.
https://github.com/emilyfelker/ieee_cis_fraud_detection
kaggle logistic-regression machine-learning pandas poetry python scikit-learn sklearn xgboost
Last synced: about 2 months ago
JSON representation
Which online transactions are fraudulent? Program that uses machine learning to detect fraud.
- Host: GitHub
- URL: https://github.com/emilyfelker/ieee_cis_fraud_detection
- Owner: emilyfelker
- Created: 2024-12-08T14:08:25.000Z (about 2 months ago)
- Default Branch: master
- Last Pushed: 2024-12-12T16:14:17.000Z (about 2 months ago)
- Last Synced: 2024-12-12T17:26:00.487Z (about 2 months ago)
- Topics: kaggle, logistic-regression, machine-learning, pandas, poetry, python, scikit-learn, sklearn, xgboost
- Language: Python
- Homepage:
- Size: 588 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# IEEE-CIS Fraud Detection in Online Payments
## Introduction
This program is designed for the
[IEEE-CIS Fraud Detection](https://www.kaggle.com/competitions/ieee-fraud-detection/) competition on Kaggle.
The challenge is to predict the probability that an online transaction is fraudulent,
based on a large-scale dataset of transactions from a leading online payment processing company.
I implemented the program in a single Python script (`main.py`) for ease of automatic conversion to a
Jupyter notebook for submission to the Kaggle competition.## Approach
This program contains a pipeline to handle feature processing, model training and evaluation, selection
of the best model, visualization of the model performance, and generation of predictions in the format specified
by Kaggle.### Feature processing
- **One-hot encoding**: Categorical features are one-hot encoded for categories present in the training set above a certain frequency threshold.
- **Z-scaling**: All features are z-scaled to improve performance and convergence of the machine learning algorithms.
- **Imputing missing values**: Missing values are imputed with mean values since not all model types can handle missing data natively.### Models used
I experimented with two types of
machine learning models:
- **Binomial Logistic
Regression** using
LogisticRegression from `sklearn`
- **Gradiented-boosted decision
trees** using XGBoostClassifier from
`xgboost`### Evaluating model performance
This program evaluates and compares the area under the ROC curve for models whose name and parameters
are manually specified in a list passed into the `main_model_evaluation` function, and it selects the model
with the best score to generate the final predictions for Kaggle.
The best model so far was an XGBoostClassifier with parameters {'n_estimators': 32, 'max_depth': 8},
whose area under the ROC curve for this program's validation set was **0.9353**.## Visualizations
### Feature Importance Plot
![Feature Importance](feature_importance.png)
This plot highlights which features had the biggest impact on the model's
predictions. These most important features could be selected to train a neural network in a later stage.### ROC Curve
![ROC Curve](roc_curve.png)
Receiver Operating Characteristic (ROC) curve for the best model with an Area Under the Curve (AUC) of 0.9353,
indicating strong model performance in distinguishing between classes.### Confusion Matrix
![Confusion Matrix](confusion_matrix.png)
This plot displays how well the model fraudulent transactions by comparing the true vs. predicted classes.## Next steps
This program could still be improved by:
* More sophisticated feature engineering and selection and smarter handling of missing data
* Trying additional types of machine learning models, including neural networks
* Exploring the effects of varying model parameters more systematically with a grid search
* Implementing cross-validation to better assess model generalization and prevent overfitting to the validation set