https://github.com/assem-elqersh/e-commerce-customers-segmentation
This Repository is a Part of MSC EELU Data Science & Machine Learning Bootcamp Final Project
https://github.com/assem-elqersh/e-commerce-customers-segmentation
clustering-algorithms customer-segmentation data-science e-commerce machine-learning targeted-marketing
Last synced: 3 months ago
JSON representation
This Repository is a Part of MSC EELU Data Science & Machine Learning Bootcamp Final Project
- Host: GitHub
- URL: https://github.com/assem-elqersh/e-commerce-customers-segmentation
- Owner: Assem-ElQersh
- License: gpl-3.0
- Created: 2024-09-17T22:58:30.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-09-17T23:16:02.000Z (9 months ago)
- Last Synced: 2025-01-19T17:58:28.519Z (5 months ago)
- Topics: clustering-algorithms, customer-segmentation, data-science, e-commerce, machine-learning, targeted-marketing
- Language: Jupyter Notebook
- Homepage:
- Size: 517 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# *E-Commerce Customers Segmentation*
(This repo is still under development for further improvement)
### **Project Overview**
This project applies **clustering techniques** to segment customers of an E-commerce platform based on their transaction history and demographic data. The goal is to group customers with similar behaviors for marketing and personalization purposes. We compare three clustering algorithms: **K-Means**, **DBSCAN**, and **Hierarchical Clustering**, and evaluate their performance using silhouette scores.### Dataset Information
The dataset contains customer transaction records along with demographic information, which includes:
- **customers**: Information on customer IDs, gender, and city.
- **transactions**: Records of purchases, including transaction status and coupon usage.
- **genders**: Mapping between gender ID and gender name.
- **cities**: Mapping between city ID and city name.Key Features:
- `gender_name`: Gender of the customer.
- `city_name`: City where the customer resides.
- `coupon_usage_frequency`: Number of times the customer used coupons.
- `total_transactions`: Total number of transactions made by the customer.### Project Structure
The project consists of the following key steps:### 1. **Data Loading and Preprocessing**
- Load data from an Excel file containing multiple sheets.
- Drop irrelevant columns such as `burn_date` from the transaction data.
- Handle missing data using **SimpleImputer**:
- Categorical columns: Imputed with the most frequent value.
- Numerical columns: Imputed with the mean value.
- One-Hot Encoding was applied to categorical features like `gender_name` and `city_name`.### 2. **Data Merging and Aggregation**
- Merged customer demographic information with transaction data.
- Aggregated transactions by calculating `coupon_usage_frequency` and `total_transactions` for each customer.
### 3. **Feature Selection and Preprocessing for Clustering**
- Selected features: `gender_name`, `city_name`, `coupon_usage_frequency`, and `total_transactions`.
- Preprocessed the data by scaling for **Hierarchical Clustering**, but no scaling was applied for **K-Means** and **DBSCAN** as scaling negatively impacted their results.### 4. **Clustering Techniques**
We applied three clustering algorithms to the preprocessed data:#### a. **K-Means Clustering**
- Tuned the number of clusters (`k`) using the **Elbow Method** and **Silhouette Scores**.
- The optimal number of clusters was **4**, with the highest silhouette score of **0.62**.#### b. **DBSCAN Clustering**
- Applied grid search to optimize `eps` and `min_samples`.
- Best parameters: `eps=0.5`, `min_samples=5`.
- DBSCAN struggled with varying densities in the dataset, especially in high-dimensional spaces, making it less suitable for this particular problem.#### c. **Hierarchical Clustering**
- Tested different linkage methods (Ward, Complete, Average).
- Best performance was achieved with **5 clusters** and **Average Linkage**.
- Hierarchical clustering produced clear groups but was more computationally expensive.### 5. **Model Comparison**
| Model | Best Params | Silhouette Score |
|---------------|----------------------------------------------|----------------------------|
| K-Means | `n_clusters=4` | 0.62 |
| DBSCAN | `eps=0.5`, `min_samples=5` | 0.99(unstable due to noise)|
| Hierarchical | `n_clusters=5`, `linkage='average'` | 0.80 |**Key Findings**:
- **K-Means** performed the best with a silhouette score of **0.62**, identifying 4 distinct customer segments.
- **DBSCAN** had difficulties with noise and clusters of varying densities, making it less effective.
- **Hierarchical Clustering** was computationally expensive but produced reasonable clusters with an average silhouette score of **0.58**.### 6. **Dimensionality Reduction for Visualization**
- Applied **PCA (Principal Component Analysis)** to reduce the dataset to 2 dimensions for visualizing the clusters.
### 7. **Visualization of Clusters**
Visualizations were generated to display the clustering results using **PCA** for dimensionality reduction:
- **K-Means Clustering**: 4 distinct clusters with good separation.
- **DBSCAN**: Struggled with noisy data and did not form clear clusters.
- **Hierarchical Clustering**: Produced 5 clusters with moderate separation.### How to Run the Project
1. **Clone the Repository**:
```bash
git clone https://github.com/Assem-ElQersh//E-Commerce-Customers-Segmentation.git
cd E-commerce-Customer-Segmentation
```2. **Install the Required Dependencies**:
```bash
pip install -r requirements.txt
```3. **Run the Jupyter Notebook**:
Execute the notebook to preprocess the data, apply clustering algorithms, and generate visualizations.#### Repository Structure
```
├── data/
│ ├── E-commerce_data.xlsx
├── src/
│ ├── Model.ipynb
├── README.md
└── requirements.txt
```### Results and Conclusion
- **K-Means** was this dataset's most effective clustering method, producing 4 customer segments with the best silhouette score.
- **DBSCAN** struggled with noise and varying densities,resulting in lower performance.
- **Hierarchical Clustering** could identify clear segments but was less scalable and more computationally expensive compared to K-Means.This project demonstrates a comprehensive approach to customer segmentation using various clustering techniques. The analysis offers valuable insights into customer behavior, which can be utilized by the E-commerce platform for targeted marketing strategies and personalized recommendations.
##
- Please note that this project is part of the **MLSC Data Science Graduation Project**.