https://github.com/anonymo2239/big-data-churn-analyzer

Scalable customer churn prediction using PySpark. Includes EDA, feature engineering, modeling, and real-time inference on new data.
https://github.com/anonymo2239/big-data-churn-analyzer

big-data churn-analysis churn-prediction classification-algorithm data-analysis data-science data-visualization modeling pyspark

Last synced: about 2 months ago
JSON representation

Scalable customer churn prediction using PySpark. Includes EDA, feature engineering, modeling, and real-time inference on new data.

Host: GitHub
URL: https://github.com/anonymo2239/big-data-churn-analyzer
Owner: anonymo2239
License: mit
Created: 2025-04-30T17:37:08.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-25T14:41:04.000Z (about 1 year ago)
Last Synced: 2025-12-29T23:32:33.962Z (7 months ago)
Topics: big-data, churn-analysis, churn-prediction, classification-algorithm, data-analysis, data-science, data-visualization, modeling, pyspark
Language: Jupyter Notebook
Homepage:
Size: 1.47 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Customer Churn Prediction with PySpark

This project is a **customer churn prediction system** built using **Apache Spark**. The goal is to analyze customer behavior and predict the likelihood of churn using gradient boosting models (GBM) on tabular customer data.

## Project Overview

Using **PySpark**, this project processes a dataset with customer attributes such as age, spending, and account tenure. The model classifies whether a customer will churn (leave) or stay.

## Key Features

* **PySpark-based scalable processing**

* **Exploratory Data Analysis (EDA)** on customer data

* **Feature engineering** (e.g., polynomial features)

* **Gradient Boosted Tree Classifier** via `GBTClassifier`

* **Hyperparameter tuning** with `CrossValidator` and `ParamGridBuilder`

* **Churn prediction** for new (unseen) customer data

## Dataset

![Dataset](readme_images/dataset.png)

Each row in the dataset represents a customer and includes:

| Column            | Description                            |

| ----------------- | -------------------------------------- |

| `names`           | Customer name                          |

| `age`             | Age of the customer                    |

| `total_purchase`  | Total money spent                      |

| `account_manager` | Whether an account manager is assigned |

| `years`           | Number of years as a customer          |

| `num_sites`       | Number of sites they use               |

| `churn`           | 1 if churned, 0 otherwise              |

## Exploratory Data Analysis (EDA)

* Descriptive statistics for numeric columns

* Class balance inspection

* Distribution of variables by churn

* Visualization with `seaborn.pairplot`

![EDA Pairplot](readme_images/eda_pairplot.png)

## Feature Engineering

* Removed missing values

* Added polynomial features (`age_square`)

* Used `VectorAssembler` for machine learning input

## Modeling

* **Train-test split:** 70/30

* **Classifier:** Gradient Boosted Tree (GBT)

* **Tuning:** 10-fold cross-validation over:

  * `maxDepth`: \[2, 4, 6]

  * `maxBins`: \[20, 30]

  * `maxIter`: \[10, 20]

```python

GBTClassifier(maxIter=10, featuresCol="features", labelCol="label")

```

## Results

* Accuracy on test set:  

![test accuracy](readme_images/test_accuracy.png)

## Inference on New Customers

Sample predictions for new customers with unseen data:

| Name             | Prediction | Probability                |

| ---------------- | ---------- | -------------------------- |

| Ali Ahmetoğlu    | 1.0        | \[0.20, 0.80] (will churn) |

| Harika Gündüz    | 0.0        | \[0.85, 0.15] (will stay)  |

| Ata Bakmayan Ali | 0.0        | \[0.99, 0.01] (will stay)  |

## Technologies Used

| Technology  | Purpose                                     |

| ----------- | ------------------------------------------- |

| **PySpark** | Distributed data processing and ML pipeline |

| **Seaborn** | Visualization                               |

| **Pandas**  | Data manipulation                           |

| **Python**  | Core programming language                   |

## Setup Instructions

1. Install Spark and set environment variables (e.g., `SPARK_HOME`)

2. Install dependencies:

```bash

pip install pyspark findspark pandas seaborn

```

3. Place `churn.csv` in your project directory.

4. Run the notebook or script.

## Sample Commands

```python

import findspark

findspark.init("C:\\spark")  # Your Spark installation path

```

## Developers

* **Alperen Arda** – [alperen.arda.adem22@gmail.com](mailto:alperen.arda.adem22@gmail.com)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anonymo2239/big-data-churn-analyzer

Awesome Lists containing this project

README