Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/harris-giki/e-comdataanalysis_ml
E-commerce Customer Analysis with Linear Regression: analyzes customer behavior within an e-commerce setting and predict yearly customer spending based on various features using a linear regression model.
https://github.com/harris-giki/e-comdataanalysis_ml
development ecommerce linear-regression machine-learning model prediction-model python scikit-learn
Last synced: about 2 months ago
JSON representation
E-commerce Customer Analysis with Linear Regression: analyzes customer behavior within an e-commerce setting and predict yearly customer spending based on various features using a linear regression model.
- Host: GitHub
- URL: https://github.com/harris-giki/e-comdataanalysis_ml
- Owner: Harris-giki
- Created: 2024-11-06T14:05:36.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-11-11T18:47:24.000Z (about 2 months ago)
- Last Synced: 2024-11-23T07:07:30.935Z (about 2 months ago)
- Topics: development, ecommerce, linear-regression, machine-learning, model, prediction-model, python, scikit-learn
- Language: Jupyter Notebook
- Homepage: https://ecom-data-analysis.streamlit.app/
- Size: 864 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Project Name: E-commerce Customer Analysis with Linear Regression
README
Project Purpose
In this model, we are predicting how much an e-commerce customer will spend in a year using data like their time spent on the website and how long they've been a member. We load and explore the data, select the most relevant factors (features), and build a linear regression model to make predictions. We then evaluate the model’s accuracy using error metrics, visualize the results, and interpret which features have the most impact on spending. The goal is to create a model that can predict future spending based on customer behavior.
Data Requirements
Ensure that the dataset
ecommerce.csv
is in the same directory as the code file. The dataset can be downloaded from the repository or from Kaggle if not already included.
Procedure Overview
Data Loading & Exploration: Load the dataset, examine the structure, and perform initial statistical analyses. Visualize key relationships between features and target variables to gain insights.
Feature Engineering and Model Selection: Select relevant features based on correlation analysis and apply a linear regression model using scikit-learn to predict the target variable.
Model Evaluation: Assess model performance using metrics like Mean Absolute Error, Mean Squared Error, and Root Mean Squared Error. Visualize predictions and residuals to analyze the model's performance.
Interpretation and Insights: Interpret model coefficients to understand feature importance. Assess residual distribution to ensure model assumptions hold.
Step-by-Step Guide
Step 1: Import Libraries
Pandas - data handling
Matplotlib & Seaborn - visualization
Scikit-learn - machine learning
SciPy - statistical analysis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import scipy.stats as stats
Step 2: Data Loading & Initial Exploration
Load the data and check the structure:
df = pd.read_csv('ecommerce.csv')
df.head()
Step 3: Exploratory Data Analysis (EDA)
Visualize relationships with joint plots and pair plots:
sns.jointplot(x='Time on Website', y='Yearly Amount Spent', data=df, alpha=0.5)
sns.pairplot(df, plot_kws={'alpha': 0.4})
Step 4: Data Splitting & Model Training
Split data and train the model:
x = df[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']]
y = df['Yearly Amount Spent']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
lm = LinearRegression()
lm.fit(X_train, y_train)
Step 5: Model Interpretation
View feature impact with model coefficients:
cdf = pd.DataFrame(lm.coef_, x.columns, columns=['Coeff'])
Step 6: Predictions and Visualization
Plot predicted values against actual values:
predictions = lm.predict(X_test)
sns.scatterplot(x=predictions, y=y_test)
Step 7: Performance Metrics
Evaluate using MAE, MSE, and RMSE:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import math
print("MAE:", mean_absolute_error(y_test, predictions))
print("RMSE:", math.sqrt(mean_squared_error(y_test, predictions)))
Step 8: Residual Analysis
Verify residuals for model fit assessment:
residuals = y_test - predictions
sns.histplot(residuals, bins=30)
Results
The model shows strong predictive performance with meaningful features. Residuals follow a near-normal distribution, supporting model fit.
Applications
Marketing: Predict spending for targeted campaigns.
Customer Retention: Identify high-value customer characteristics.
Business Decisions: Data-driven insights for strategic planning.
Instructions to Run
- Ensure Python and libraries are installed.
- Download
ecommerce.csv
and place it in the project folder.
- Run each section in a Jupyter Notebook or compatible IDE to analyze results.