https://github.com/bniladridas/churnprediction

Customers who'll likely stop using subscriptions.
https://github.com/bniladridas/churnprediction

dataset keras numpy pandas scikit-learn seaborn

Last synced: 4 months ago
JSON representation

Customers who'll likely stop using subscriptions.

Host: GitHub
URL: https://github.com/bniladridas/churnprediction
Owner: bniladridas
Created: 2024-07-31T19:39:46.000Z (11 months ago)
Default Branch: master
Last Pushed: 2024-09-09T07:27:17.000Z (10 months ago)
Last Synced: 2024-11-20T16:55:38.713Z (7 months ago)
Topics: dataset, keras, numpy, pandas, scikit-learn, seaborn
Language: Jupyter Notebook
Homepage:
Size: 24.1 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # **🌟 Churn Prediction Challenge for Video Streaming Service 🌟**

## **🔍 Introduction**

Welcome to the **Churn Prediction Challenge**! Dive into this exciting opportunity to showcase your machine learning prowess by predicting subscriber churn for a leading video streaming service. Your mission is to develop a sophisticated model to forecast whether subscribers will maintain their subscription for the upcoming month. Minimizing churn is crucial for subscription-based services, making this challenge both impactful and rewarding.

## **📈 Problem Overview**

Subscription services span diverse industries like fitness, video streaming, and retail. Reducing churn is pivotal for these companies to sustain growth and customer loyalty. In this challenge, you'll utilize a 2021 dataset to identify subscribers likely to cancel their subscription. The dataset provides insights into subscriptions just before cancellation.

### **Possible Reasons for Churn:**

- **Content Saturation**: The subscriber has exhausted their content needs.

- **Temporary Absence**: The subscriber is currently busy but intends to return.

- **Service Mismatch**: The subscriber finds the service inadequate and seeks alternatives.

## **📊 Datasets**

### **1. `train.csv`**

- **Description**: Comprises 70% of the dataset (243,787 subscriptions) with ground truth indicating if the subscription continued into the next month.

- **Target Column**: `Churn` (binary: 0 = No churn, 1 = Churn)

### **2. `test.csv`**

- **Description**: Contains the remaining 30% of the dataset (104,480 subscriptions) without ground truth. Your task is to predict the continuation status of these subscriptions.

## **📁 Submission Format**

Your final submission should be a CSV file named `prediction_df.csv` with the following structure:

- **Columns**:

  - `CustomerID`: Unique identifier for each subscription.

  - `predicted_probability`: Probability of churn for each subscription.

Ensure the file has exactly 104,480 rows and 2 columns. Column names and order are essential for autograding.

## **🔧 Steps to Complete the Challenge**

### **1. Import Required Python Modules**

```python

# Install necessary packages

!pip install pandas numpy scipy scikit-learn keras matplotlib seaborn

# Import essential libraries

import pandas as pd

import numpy as np

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import train_test_split

from sklearn.dummy import DummyClassifier

from matplotlib import pyplot as plt

import seaborn as sns

%matplotlib inline

```

### **2. Load the Data**

```python

# Load the datasets

train_df = pd.read_csv("train.csv")

print('train_df Shape:', train_df.shape)

train_df.head()

test_df = pd.read_csv("test.csv")

print('test_df Shape:', test_df.shape)

test_df.head()

```

### **3. Data Exploration and Cleaning (Optional)**

```python

# Explore and clean the dataset

data = pd.read_csv('train.csv')

# Handle missing values and remove duplicates

data = data.fillna(method='ffill').drop_duplicates()

print(data.info())

print(data.describe())

# Encode categorical variables

categorical_cols = data.select_dtypes(include=['object']).columns

data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)

# Visualize the data distribution

plt.figure(figsize=(10, 6))

data.hist(bins=30, figsize=(20, 15))

plt.show()

plt.figure(figsize=(10, 8))

sns.heatmap(data.corr(), annot=True, cmap='coolwarm')

plt.show()

```

### **4. Train and Evaluate Your Model**

```python

from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import StandardScaler

# Prepare features and target

X_train = train_df.drop(['CustomerID', 'Churn'], axis=1)

y_train = train_df['Churn']

# Standardize features

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

# Initialize and fit the model

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

rf_clf.fit(X_train_scaled, y_train)

# Prepare and standardize test features

X_test = test_df.drop(['CustomerID'], axis=1)

X_test_scaled = scaler.transform(X_test)

# Predict churn probabilities

predicted_probabilities = rf_clf.predict_proba(X_test_scaled)[:, 1]

# Create submission dataframe

prediction_df = pd.DataFrame({

    'CustomerID': test_df['CustomerID'].values,

    'predicted_probability': predicted_probabilities

})

# Validate submission format and save

assert prediction_df.shape == (104480, 2)

assert set(prediction_df.columns) == {'CustomerID', 'predicted_probability'}

prediction_df.to_csv('prediction_df.csv', index=False)

print(prediction_df.head(10))

```

### **5. Final Tests**

```python

# Validate submission format

submission = pd.read_csv('prediction_df.csv')

assert isinstance(submission, pd.DataFrame), 'The file should be a dataframe named prediction_df.'

assert submission.columns[0] == 'CustomerID', 'The first column name should be CustomerID.'

assert submission.columns[1] == 'predicted_probability', 'The second column name should be predicted_probability.'

assert submission.shape[0] == 104480, 'The dataframe should have 104,480 rows.'

assert submission.shape[1] == 2, 'The dataframe should have 2 columns.'

```

## **📚 Resources**

- [Pandas Documentation](https://pandas.pydata.org/docs/)

- [Scikit-learn Documentation](https://scikit-learn.org/stable/)

- [Keras Documentation](https://keras.io/)

- [Matplotlib Documentation](https://matplotlib.org/)

- [Seaborn Documentation](https://seaborn.pydata.org/)

---

**Good luck, and may your model achieve stellar performance! 🚀**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bniladridas/churnprediction

Awesome Lists containing this project

README