An open API service indexing awesome lists of open source software.

https://github.com/lucashomuniz/project-07

Statistical Analysis of Socioeconomic Data for Business insights
https://github.com/lucashomuniz/project-07

analysis-algorithms business-solutions data-analytics data-driven-model data-exploration data-munging dplyr ggplot2 language-r socio-economic-indicators statistics

Last synced: 3 months ago
JSON representation

Statistical Analysis of Socioeconomic Data for Business insights

Awesome Lists containing this project

README

        

# ✅ PROJECT-07

In this project, we provide a comprehensive guide to constructing, training, evaluating, and selecting the best **Regression** models among three approaches: **Benchmark**, **Ridge Regression**, and **LASSO Regression**. We cover the entire process—from defining the business problem to interpreting the models and delivering results to decision-makers. The company under analysis is an **e-commerce** that sells products via a **website** and **mobile application**. Each customer login is recorded with its session duration, and sales data (total monthly spending per customer) is also tracked.

The main objective is to increase **sales**, given a limited budget that restricts investment to either the **website** or the **application**. By enhancing the customer experience, we aim to boost session duration, engagement, and ultimately drive higher sales. Although the dataset is fictitious, it reflects real-world **e-commerce** scenarios. The data were collected over one month of portal operation, and each column in the dataset is self-explanatory.

Keywords: Python Language, Data Visualization, Data Analysis, Linear Regression, Benchmark, Ridge Regression, LASSO Regression, Machine Learning, e-commerce.

# ✅ PROCESS

The **exploratory analysis** commences immediately after data loading. This phase includes **data cleaning** tasks such as removing **duplicate values**, handling **missing values**, and applying necessary **transformations**. The primary goal is to comprehend the **dataframe** by visualizing **numerical** and **categorical variables**, assessing their **distributions**, and addressing **outliers** using **boxplots**, **descriptive statistics**, and **frequency counts**. Ensuring the absence of **duplicate rows** or **columns** is crucial to prevent **redundant information** that could introduce **bias** into the **model**. The objective is to develop a **generalizable model** devoid of unwanted biases.

Utilizing a **scatter plot** derived from the **correlation matrix**, we analyze **variable interactions**. Notably, an increase in **application login time** exhibits a **moderate positive correlation** with the **total amount spent**. In a **regression machine learning** project, it is advantageous for **predictor variables** to have a strong correlation with the **target variable** while avoiding high inter-correlations among predictors to prevent **multicollinearity**. Additionally, the **scatter plot** reveals a high positive correlation between **client_registration_time** and **spend_total_value**, indicating that longer **customer registration durations** are associated with higher **spending**, suggesting that **veteran customers** tend to spend more.

The subsequent phase involves **Attribute Engineering**, encompassing advanced **transformations** and the **creation** and **modification** of variables. This stage may include **attribute selection** to identify optimal variables for the **Machine Learning** process. A critical technique here is the development of a **Correlation Table**, enabling the detection of **positive** or **negative relationships** between variables and assessing **multicollinearity**.

Following this, the **pre-processing** step entails converting **textual variables** to **numeric formats**. This phase also involves configuring the entire **Machine Learning model**, which includes selecting the primary **algorithm**, applying **label encoding**, **normalization**, **standardization**, and **scaling**. A common practice during pre-processing is to **split** the **dataframe** into **training** and **testing sets**. This separation is essential as the model is trained on the **training set** and evaluated on the **test set**. Presenting the model with training data post-training is ineffective due to prior exposure. To accurately assess model performance, it is imperative to use **unseen data** with known outcomes.

image

# ✅ CONCLUSION

There are three prevalent regression algorithms in Machine Learning: **Benchmark Linear Regression**, **Ridge Regression**, and **LASSO Regression**. **Benchmark Linear Regression** is straightforward and interpretable but assumes linearity and is vulnerable to outliers and non-linear relationships. **Ridge Regression** addresses multicollinearity and reduces model complexity through regularization, enhancing performance on unseen data, though it results in less interpretable coefficients and poses challenges in selecting the regularization parameter. **LASSO Regression** combines regularization with automatic variable selection, promoting model sparsity and interpretability, but it remains sensitive to multicollinearity and the choice of the regularization parameter.

Upon evaluating the models using performance metrics, **LASSO Regression** (Model 3) exhibited a slightly higher Root Mean Squared Error (RMSE) and was consequently discarded. **Benchmark Linear Regression** (Model 1) and **Ridge Regression** (Model 2) showed comparable performance. Favoring simplicity, **Benchmark Linear Regression** was selected as the optimal model due to its ease of interpretation without significant performance loss.

The coefficient analysis revealed that increases in **customer registration time**, **average clicks per session**, and **time logged into the application** significantly boost the **total amount spent** per customer monthly, with respective increases of BRL 63.74, BRL 26.24, and BRL 38.57. Conversely, time logged into the website had a minimal impact (BRL 0.68). These insights suggest that investing in the application would yield higher returns and that strategies to enhance customer engagement and session duration are beneficial. Conversely, updating the website is currently not justified due to its negligible expected return.