Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bniladridas/churnprediction

Customers who'll likely stop using subscriptions.
https://github.com/bniladridas/churnprediction

dataset keras numpy pandas scikit-learn seaborn

Last synced: about 1 month ago
JSON representation

Customers who'll likely stop using subscriptions.

Awesome Lists containing this project

README

        

# **🌟 Churn Prediction Challenge for Video Streaming Service 🌟**

## **🔍 Introduction**

Welcome to the **Churn Prediction Challenge**! Dive into this exciting opportunity to showcase your machine learning prowess by predicting subscriber churn for a leading video streaming service. Your mission is to develop a sophisticated model to forecast whether subscribers will maintain their subscription for the upcoming month. Minimizing churn is crucial for subscription-based services, making this challenge both impactful and rewarding.

## **📈 Problem Overview**

Subscription services span diverse industries like fitness, video streaming, and retail. Reducing churn is pivotal for these companies to sustain growth and customer loyalty. In this challenge, you'll utilize a 2021 dataset to identify subscribers likely to cancel their subscription. The dataset provides insights into subscriptions just before cancellation.

### **Possible Reasons for Churn:**
- **Content Saturation**: The subscriber has exhausted their content needs.
- **Temporary Absence**: The subscriber is currently busy but intends to return.
- **Service Mismatch**: The subscriber finds the service inadequate and seeks alternatives.

## **📊 Datasets**

### **1. `train.csv`**
- **Description**: Comprises 70% of the dataset (243,787 subscriptions) with ground truth indicating if the subscription continued into the next month.
- **Target Column**: `Churn` (binary: 0 = No churn, 1 = Churn)

### **2. `test.csv`**
- **Description**: Contains the remaining 30% of the dataset (104,480 subscriptions) without ground truth. Your task is to predict the continuation status of these subscriptions.

## **📁 Submission Format**

Your final submission should be a CSV file named `prediction_df.csv` with the following structure:
- **Columns**:
- `CustomerID`: Unique identifier for each subscription.
- `predicted_probability`: Probability of churn for each subscription.

Ensure the file has exactly 104,480 rows and 2 columns. Column names and order are essential for autograding.

## **🔧 Steps to Complete the Challenge**

### **1. Import Required Python Modules**

```python
# Install necessary packages
!pip install pandas numpy scipy scikit-learn keras matplotlib seaborn

# Import essential libraries
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
```

### **2. Load the Data**

```python
# Load the datasets
train_df = pd.read_csv("train.csv")
print('train_df Shape:', train_df.shape)
train_df.head()

test_df = pd.read_csv("test.csv")
print('test_df Shape:', test_df.shape)
test_df.head()
```

### **3. Data Exploration and Cleaning (Optional)**

```python
# Explore and clean the dataset
data = pd.read_csv('train.csv')

# Handle missing values and remove duplicates
data = data.fillna(method='ffill').drop_duplicates()
print(data.info())
print(data.describe())

# Encode categorical variables
categorical_cols = data.select_dtypes(include=['object']).columns
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)

# Visualize the data distribution
plt.figure(figsize=(10, 6))
data.hist(bins=30, figsize=(20, 15))
plt.show()

plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()
```

### **4. Train and Evaluate Your Model**

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

# Prepare features and target
X_train = train_df.drop(['CustomerID', 'Churn'], axis=1)
y_train = train_df['Churn']

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Initialize and fit the model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train_scaled, y_train)

# Prepare and standardize test features
X_test = test_df.drop(['CustomerID'], axis=1)
X_test_scaled = scaler.transform(X_test)

# Predict churn probabilities
predicted_probabilities = rf_clf.predict_proba(X_test_scaled)[:, 1]

# Create submission dataframe
prediction_df = pd.DataFrame({
'CustomerID': test_df['CustomerID'].values,
'predicted_probability': predicted_probabilities
})

# Validate submission format and save
assert prediction_df.shape == (104480, 2)
assert set(prediction_df.columns) == {'CustomerID', 'predicted_probability'}

prediction_df.to_csv('prediction_df.csv', index=False)
print(prediction_df.head(10))
```

### **5. Final Tests**

```python
# Validate submission format
submission = pd.read_csv('prediction_df.csv')
assert isinstance(submission, pd.DataFrame), 'The file should be a dataframe named prediction_df.'
assert submission.columns[0] == 'CustomerID', 'The first column name should be CustomerID.'
assert submission.columns[1] == 'predicted_probability', 'The second column name should be predicted_probability.'
assert submission.shape[0] == 104480, 'The dataframe should have 104,480 rows.'
assert submission.shape[1] == 2, 'The dataframe should have 2 columns.'
```

## **📚 Resources**

- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Scikit-learn Documentation](https://scikit-learn.org/stable/)
- [Keras Documentation](https://keras.io/)
- [Matplotlib Documentation](https://matplotlib.org/)
- [Seaborn Documentation](https://seaborn.pydata.org/)

---

**Good luck, and may your model achieve stellar performance! 🚀**