https://github.com/sevilaymuni/e-commerce-fraud-detection
Detecting fraudulent transactions in e-commerce data provided by Vesta Corporation.
https://github.com/sevilaymuni/e-commerce-fraud-detection
agglomerative-clustering banking-applications catboost-classifier feature-engineering fraud-detection imbalanced-dataset lgbmclassifier pca-analysis
Last synced: 7 months ago
JSON representation
Detecting fraudulent transactions in e-commerce data provided by Vesta Corporation.
- Host: GitHub
- URL: https://github.com/sevilaymuni/e-commerce-fraud-detection
- Owner: SevilayMuni
- License: mit
- Created: 2025-03-03T19:18:34.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-03-03T19:47:00.000Z (8 months ago)
- Last Synced: 2025-03-03T20:29:50.174Z (8 months ago)
- Topics: agglomerative-clustering, banking-applications, catboost-classifier, feature-engineering, fraud-detection, imbalanced-dataset, lgbmclassifier, pca-analysis
- Language: Jupyter Notebook
- Homepage:
- Size: 627 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[
](https://github.com/SevilayMuni/e-commerce-fraud-detection/blob/main/images/cover-image.png)
# Fraud Detection from Customer TransactionsThis project focuses on detecting fraudulent transactions in e-commerce data provided by Vesta Corporation. The goal is to build a robust machine learning model to classify transactions as fraudulent or non-fraudulent accurately.
![]()
![]()
![]()
![]()
![]()
## Key Challenges
- **Highly Imbalanced Dataset:** The dataset is severely imbalanced, with only a tiny fraction of fraudulent transactions (3% of dataset). This requires special handling to ensure the model does not become biased towards the majority class.
- **High Dimensionality:** The dataset contains **393 features**, many of which are masked or have missing values, adding complexity to the feature engineering process.
- **Temporal Nature:** The data is *time-sensitive*, requiring time-based splits for training and evaluation to avoid data leakage.## Project Structure
**Data Exploration and Preprocessing**
1. Handling missing values by **imputing** numerical columns with their mean and categorical columns with 'Unknown'.
2. **Feature engineering** to create *time-based* features, *aggregated* features, and *interaction* features.
3. *Standardization* and *dimensionality reduction* using **PCA**.
4. **Hierarchical clustering** to group similar features and reduce redundancy.
**Feature Selection**
1. **Baseline Model:** A **LightGBM** model is trained as a baseline, with *feature importance* analysis to identify key predictors.
2. **Feature Selection:** **Correlation matrix** analysis and **Lasso regression** are used to select the most relevant features.
3. **Undersampling:** The majority class is undersampled to balance the dataset, preserving the temporal order of transactions.
**Modeling and Evaluation**
1. **CatBoost with Optuna:** Hyperparameter tuning is performed using **Optuna** to optimize the **CatBoost** model.
2. The model is evaluated using **time-based cross-validation** to ensure robustness.
3. The final model is evaluated using **ROC-AUC** and **Precision-Recall AUC** scores.
4. A **confusion matrix** is generated to visualize the model's performance in classifying fraudulent and non-fraudulent transactions.
5. **Feature importance** is analyzed to understand the key drivers of the model's predictions.## Results
| | Precision | Recall | F1-Score | Support |
| -------- | -------- | ------- | ------- | ------- |
| 0 | 0.95 | 1.00 | 0.97 | 37254 |
| 1 | 0.93 | 0.51 | 0.66 | 4072 |
| Accuracy | | | 0.95 | 41326 |
| Macro Avg. | 0.94 |0.75 | 0.82 | 41326 |
| Weighted Avg. | 0.95 | 0.95 | 0.94 | 41326 |- The final CatBoost model achieved an **ROC-AUC** score ofΒ **0.935**Β on the test set, indicating strong performance distinguishing between fraudulent and non-fraudulent transactions.
## Key Takeaways
- **Handling Imbalanced Data:** Undersampling the majority class helped improve the model's ability to detect fraudulent transactions without introducing significant bias.
- **Feature Engineering:** Time-based interaction features were crucial in improving model performance.
- **Hyperparameter Tuning:** Optuna was instrumental in finding the optimal hyperparameters for the CatBoost model, leading to improved performance.## π©βπ» Author
π Developed by Sevilay Munire Girgin
π§ [Contact Me](https://linktr.ee/sevilaymgirgin)
π [Portfolio](sevilaymuni.github.io/Girgin)