https://github.com/sevilaymuni/e-commerce-fraud-detection

Detecting fraudulent transactions in e-commerce data provided by Vesta Corporation.
https://github.com/sevilaymuni/e-commerce-fraud-detection

agglomerative-clustering banking-applications catboost-classifier feature-engineering fraud-detection imbalanced-dataset lgbmclassifier pca-analysis

Last synced: 7 months ago
JSON representation

Detecting fraudulent transactions in e-commerce data provided by Vesta Corporation.

Host: GitHub
URL: https://github.com/sevilaymuni/e-commerce-fraud-detection
Owner: SevilayMuni
License: mit
Created: 2025-03-03T19:18:34.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-03-03T19:47:00.000Z (8 months ago)
Last Synced: 2025-03-03T20:29:50.174Z (8 months ago)
Topics: agglomerative-clustering, banking-applications, catboost-classifier, feature-engineering, fraud-detection, imbalanced-dataset, lgbmclassifier, pca-analysis
Language: Jupyter Notebook
Homepage:
Size: 627 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

[](https://github.com/SevilayMuni/e-commerce-fraud-detection/blob/main/images/cover-image.png)
# Fraud Detection from Customer Transactions

This project focuses on detecting fraudulent transactions in e-commerce data provided by Vesta Corporation. The goal is to build a robust machine learning model to classify transactions as fraudulent or non-fraudulent accurately.

## Key Challenges
- **Highly Imbalanced Dataset:** The dataset is severely imbalanced, with only a tiny fraction of fraudulent transactions (3% of dataset). This requires special handling to ensure the model does not become biased towards the majority class.
- **High Dimensionality:** The dataset contains **393 features**, many of which are masked or have missing values, adding complexity to the feature engineering process.
- **Temporal Nature:** The data is *time-sensitive*, requiring time-based splits for training and evaluation to avoid data leakage.

## Project Structure
**Data Exploration and Preprocessing**
1. Handling missing values by **imputing** numerical columns with their mean and categorical columns with 'Unknown'.
2. **Feature engineering** to create *time-based* features, *aggregated* features, and *interaction* features.
3. *Standardization* and *dimensionality reduction* using **PCA**.
4. **Hierarchical clustering** to group similar features and reduce redundancy.

**Feature Selection**
1. **Baseline Model:** A **LightGBM** model is trained as a baseline, with *feature importance* analysis to identify key predictors.
2. **Feature Selection:** **Correlation matrix** analysis and **Lasso regression** are used to select the most relevant features.
3. **Undersampling:** The majority class is undersampled to balance the dataset, preserving the temporal order of transactions.

**Modeling and Evaluation**
1. **CatBoost with Optuna:** Hyperparameter tuning is performed using **Optuna** to optimize the **CatBoost** model.
2. The model is evaluated using **time-based cross-validation** to ensure robustness.
3. The final model is evaluated using **ROC-AUC** and **Precision-Recall AUC** scores.
4. A **confusion matrix** is generated to visualize the model's performance in classifying fraudulent and non-fraudulent transactions.
5. **Feature importance** is analyzed to understand the key drivers of the model's predictions.

## Results
| | Precision | Recall | F1-Score | Support |
| -------- | -------- | ------- | ------- | ------- |
| 0 | 0.95 | 1.00 | 0.97 | 37254 |
| 1 | 0.93 | 0.51 | 0.66 | 4072 |
| Accuracy | | | 0.95 | 41326 |
| Macro Avg. | 0.94 |0.75 | 0.82 | 41326 |
| Weighted Avg. | 0.95 | 0.95 | 0.94 | 41326 |

- The final CatBoost model achieved an **ROC-AUC** score of **0.935** on the test set, indicating strong performance distinguishing between fraudulent and non-fraudulent transactions.

## Key Takeaways
- **Handling Imbalanced Data:** Undersampling the majority class helped improve the model's ability to detect fraudulent transactions without introducing significant bias.
- **Feature Engineering:** Time-based interaction features were crucial in improving model performance.
- **Hyperparameter Tuning:** Optuna was instrumental in finding the optimal hyperparameters for the CatBoost model, leading to improved performance.

## 👩‍💻 Author
📌 Developed by Sevilay Munire Girgin
📧 [Contact Me](https://linktr.ee/sevilaymgirgin)
🌐 [Portfolio](sevilaymuni.github.io/Girgin)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sevilaymuni/e-commerce-fraud-detection

Awesome Lists containing this project

README