Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bniladridas/churnprediction
Customers who'll likely stop using subscriptions.
https://github.com/bniladridas/churnprediction
dataset keras numpy pandas scikit-learn seaborn
Last synced: 2 days ago
JSON representation
Customers who'll likely stop using subscriptions.
- Host: GitHub
- URL: https://github.com/bniladridas/churnprediction
- Owner: bniladridas
- Created: 2024-07-31T19:39:46.000Z (6 months ago)
- Default Branch: master
- Last Pushed: 2024-09-09T07:27:17.000Z (5 months ago)
- Last Synced: 2024-11-20T16:55:38.713Z (2 months ago)
- Topics: dataset, keras, numpy, pandas, scikit-learn, seaborn
- Language: Jupyter Notebook
- Homepage:
- Size: 24.1 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# **🌟 Churn Prediction Challenge for Video Streaming Service 🌟**
## **🔍 Introduction**
Welcome to the **Churn Prediction Challenge**! Dive into this exciting opportunity to showcase your machine learning prowess by predicting subscriber churn for a leading video streaming service. Your mission is to develop a sophisticated model to forecast whether subscribers will maintain their subscription for the upcoming month. Minimizing churn is crucial for subscription-based services, making this challenge both impactful and rewarding.
## **📈 Problem Overview**
Subscription services span diverse industries like fitness, video streaming, and retail. Reducing churn is pivotal for these companies to sustain growth and customer loyalty. In this challenge, you'll utilize a 2021 dataset to identify subscribers likely to cancel their subscription. The dataset provides insights into subscriptions just before cancellation.
### **Possible Reasons for Churn:**
- **Content Saturation**: The subscriber has exhausted their content needs.
- **Temporary Absence**: The subscriber is currently busy but intends to return.
- **Service Mismatch**: The subscriber finds the service inadequate and seeks alternatives.## **📊 Datasets**
### **1. `train.csv`**
- **Description**: Comprises 70% of the dataset (243,787 subscriptions) with ground truth indicating if the subscription continued into the next month.
- **Target Column**: `Churn` (binary: 0 = No churn, 1 = Churn)### **2. `test.csv`**
- **Description**: Contains the remaining 30% of the dataset (104,480 subscriptions) without ground truth. Your task is to predict the continuation status of these subscriptions.## **📁 Submission Format**
Your final submission should be a CSV file named `prediction_df.csv` with the following structure:
- **Columns**:
- `CustomerID`: Unique identifier for each subscription.
- `predicted_probability`: Probability of churn for each subscription.Ensure the file has exactly 104,480 rows and 2 columns. Column names and order are essential for autograding.
## **🔧 Steps to Complete the Challenge**
### **1. Import Required Python Modules**
```python
# Install necessary packages
!pip install pandas numpy scipy scikit-learn keras matplotlib seaborn# Import essential libraries
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
```### **2. Load the Data**
```python
# Load the datasets
train_df = pd.read_csv("train.csv")
print('train_df Shape:', train_df.shape)
train_df.head()test_df = pd.read_csv("test.csv")
print('test_df Shape:', test_df.shape)
test_df.head()
```### **3. Data Exploration and Cleaning (Optional)**
```python
# Explore and clean the dataset
data = pd.read_csv('train.csv')# Handle missing values and remove duplicates
data = data.fillna(method='ffill').drop_duplicates()
print(data.info())
print(data.describe())# Encode categorical variables
categorical_cols = data.select_dtypes(include=['object']).columns
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)# Visualize the data distribution
plt.figure(figsize=(10, 6))
data.hist(bins=30, figsize=(20, 15))
plt.show()plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()
```### **4. Train and Evaluate Your Model**
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler# Prepare features and target
X_train = train_df.drop(['CustomerID', 'Churn'], axis=1)
y_train = train_df['Churn']# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)# Initialize and fit the model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train_scaled, y_train)# Prepare and standardize test features
X_test = test_df.drop(['CustomerID'], axis=1)
X_test_scaled = scaler.transform(X_test)# Predict churn probabilities
predicted_probabilities = rf_clf.predict_proba(X_test_scaled)[:, 1]# Create submission dataframe
prediction_df = pd.DataFrame({
'CustomerID': test_df['CustomerID'].values,
'predicted_probability': predicted_probabilities
})# Validate submission format and save
assert prediction_df.shape == (104480, 2)
assert set(prediction_df.columns) == {'CustomerID', 'predicted_probability'}prediction_df.to_csv('prediction_df.csv', index=False)
print(prediction_df.head(10))
```### **5. Final Tests**
```python
# Validate submission format
submission = pd.read_csv('prediction_df.csv')
assert isinstance(submission, pd.DataFrame), 'The file should be a dataframe named prediction_df.'
assert submission.columns[0] == 'CustomerID', 'The first column name should be CustomerID.'
assert submission.columns[1] == 'predicted_probability', 'The second column name should be predicted_probability.'
assert submission.shape[0] == 104480, 'The dataframe should have 104,480 rows.'
assert submission.shape[1] == 2, 'The dataframe should have 2 columns.'
```## **📚 Resources**
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Scikit-learn Documentation](https://scikit-learn.org/stable/)
- [Keras Documentation](https://keras.io/)
- [Matplotlib Documentation](https://matplotlib.org/)
- [Seaborn Documentation](https://seaborn.pydata.org/)---
**Good luck, and may your model achieve stellar performance! 🚀**