https://github.com/mindful-ai-assistants/linear-regression--datascalinganalysis.md
This project demonstrates a complete machine learning workflow for price predictions
https://github.com/mindful-ai-assistants/linear-regression--datascalinganalysis.md
Last synced: 4 months ago
JSON representation
This project demonstrates a complete machine learning workflow for price predictions
- Host: GitHub
- URL: https://github.com/mindful-ai-assistants/linear-regression--datascalinganalysis.md
- Owner: Mindful-AI-Assistants
- License: mit
- Created: 2025-05-10T18:54:54.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-06-16T02:07:23.000Z (4 months ago)
- Last Synced: 2025-06-22T07:35:51.491Z (4 months ago)
- Language: Jupyter Notebook
- Homepage: https://github.com/Mindful-AI-Assistants/Linear-Regression--DataScalingAnalysis.md
- Size: 79.1 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# Linear Regression and Data Scaling Analysis
## Project Overview
This project demonstrates a complete machine learning workflow for price prediction using:
- **Stepwise Regression** for feature selection
- Advanced statistical analysis (ANOVA, R² metrics)
- Full model diagnostics
- Interactive visualization integration[](https://colab.research.google.com/github/yourusername/repo-name/blob/main/price_prediction.ipynb)
---
## Table of Contents
1. [What is Data Normalization/Scaling?](#what-is-data-normalizationscaling)
2. [Common Scaling Methods](#common-scaling-methods)
3. [Why is this Important in Machine Learning?](#why-is-this-important-in-machine-learning)
4. [Practical Example](#practical-example)
5. [Code Example (Python)](#code-example-python)
6. [Linear Regression: Price Prediction Case Study 📈](#linear-regression-price-prediction-case-study)
- [I. Use Case Implementation & Dataset Description](#i-use-case-implementation--dataset-description)
- [II. Methodology (Stepwise Regression)](#ii-methodology-stepwise-regression)
- [III. Statistical Analysis](#iii-statistical-analysis)
- [IV. Full Implementation Code](#iv-full-implementation-code)
- [V. Visualization](#v-visualization)
- [VI. How to Run](#vi-how-to-run)
7. [Linear Regression Analysis Report 📊](#linear-regression-analysis-report)
- [Dataset Overview](#dataset-overview)
- [Key Formulas](#key-formulas)
- [Statistical Results](#statistical-results)
- [Code Implementation](#code-implementation)
- [Stepwise Regression](#stepwise-regression)---
## What is Data Normalization/Scaling?
A preprocessing technique that adjusts numerical values in a dataset to a standardized scale (e.g., \[0, 1\] or \[-1, 1\]). This is essential for:
- **Reducing outlier influence**
- **Ensuring stable performance** in machine learning algorithms (e.g., neural networks, SVM)
- **Enabling fair comparison** between variables with different units or magnitudes---
## Common Scaling Methods
1. **Min-Max Scaling (Normalization)**
- **Formula:**
\[
X_{\text{norm}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
\]
- **Result:** Values scaled to the \[0, 1\] interval.2. **Standardization (Z-Score)**
- **Formula:**
\[
X_{\text{std}} = \frac{X - \mu}{\sigma}
\]
- **Where:** \(\mu\) is the mean and \(\sigma\) is the standard deviation.
- **Result:** Data with a mean of 0 and standard deviation of 1.3. **Robust Scaling**
- Uses the median and interquartile range (IQR) to reduce the impact of outliers.
- **Formula:**
\[
X_{\text{robust}} = \frac{X - \text{Median}(X)}{\text{IQR}(X)}
\]---
## Why is this Important in Machine Learning?
- **Scale-sensitive algorithms:** Methods like neural networks, SVM, and KNN rely on the distances between data points; unscaled data can hinder model convergence.
- **Interpretation:** Variables with different scales can distort the weights in linear models (e.g., logistic regression).
- **Optimization Speed:** Gradients in optimization algorithms converge faster with normalized data.
## Practical Example
For a dataset containing:
- **Age:** Values between 18–90 years
- **Salary:** Values between \$1k–\$20kAfter applying **Min-Max Scaling**:
- **Age 30** transforms to approximately \[0.17\]
- **Salary \$5k** transforms to approximately \[0.21\]This process ensures both features contribute equally to the model.
## Code Example (Python) – Data Normalization
```python
from sklearn.preprocessing import MinMaxScaler
import numpy as np# Sample data: Age and Salary
data = np.array([[30], [5000]], dtype=float).reshape(-1, 1)
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)print(normalized_data)
# Expected Output: [[0.17], [0.21]]
```
Linear Regression: Price Prediction Case Study 📈
Dataset: housing_data.xlsx (included in repository) Tech Stack: Python 3.9, Jupyter Notebook, scikit-learn, statsmodels## I. Use Case Implementation & Dataset Description
| Variable | Type | Range | Description |
|----------------|-------|---------------|--------------------------------------|
| `area_sqm` | float | 40–220 | Living area in square meters |
| `bedrooms` | int | 1–5 | Number of bedrooms |
| `distance_km` | float | 0.5–15 | Distance to city center (km) |
| `price` | float | \$50k–\$1.2M | Property price in USD |
## II. Methodology (Stepwise Regression)```python
import statsmodels.api as smdef stepwise_selection(X, y):
"""Automated feature selection using p-values."""
included = []
while True:
changed = False
# Forward step: consider adding each excluded feature
excluded = list(set(X.columns) - set(included))
pvalues = pd.Series(index=excluded)
for new_column in excluded:
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
pvalues[new_column] = model.pvalues[new_column]
best_pval = pvalues.min()
if best_pval < 0.05:
best_feature = pvalues.idxmin()
included.append(best_feature)
changed = True
# Backward step: consider removing features with high p-value
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
pvalues = model.pvalues.iloc[1:] # Exclude intercept
worst_pval = pvalues.max()
if worst_pval > 0.05:
worst_feature = pvalues.idxmax()
included.remove(worst_feature)
changed = True
if not changed:
break
return included# Example usage (assuming X_train and y_train are predefined):
# selected_features = stepwise_selection(X_train, y_train)
```
## III. Statistical Analysis### Key Metrics Table
| Metric | Value | Interpretation |
|----------------|---------|---------------------------------|
| **R²** | 0.872 | 87.2% variance explained |
| **Adj. R²** | 0.865 | Adjusted for feature complexity |
| **F-statistic**| 124.7 | p-value = 2.3e-16 (Significant) |
| **Intercept** | 58,200 | Base price without features |
### Correlation Matrix```python
import seaborn as sns
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
``
## IV. Full Implementation Code
### Model Training & Evaluation
```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np# Assuming X_train, y_train, X_test, and y_test are predefined
final_model = LinearRegression()
final_model.fit(X_train[selected_features], y_train)# Predictions on test set
y_pred = final_model.predict(X_test[selected_features])# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = final_model.score(X_test[selected_features], y_test)
```
## V. Visualization – Actual vs Predicted Prices
```python
import matplotlib.pyplot as plt
import seaborn as snsplt.figure(figsize=(10,6))
sns.scatterplot(x=y_test, y=y_pred, hue=X_test['bedrooms'])
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', color='red')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Model Performance Visualization')
plt.savefig('results/scatter_plot.png')
plt.show()
```
## VI. How to Run
```
1. Install Dependencies: ```bash pip install -r requirements.txt
```
2. Download Dataset:
* From: data/housing_data.xlsx
* Or use this [dataset link]()
3..Execute Jupyter Notebook:
```bash
jupyter notebook price_prediction.ipynb```
Note: Full statistical outputs and diagnostic plots are available in the notebook.
## Linear Regression Analysis Report 📊
### Dataset Overview -
📌 **Important Note:**
> This dataset is a fictitious example created solely for demonstration and educational purposes. There is no external source for this dataset.
> For real-world datasets, consider exploring sources such as the [UC Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) or [Kaggle](https://www.kaggle.com/datasets).
| Variable | Type | Range | Description |
|-------------|-------|---------------|--------------------------------------|
| area_sqm | float | 40–220 | Living area in square meters |
| bedrooms | int | 1–5 | Number of bedrooms |
| distance_km | float | 0.5–15 | Distance to city center (km) |
| price | float | \$50k–\$1.2M | Property price in USD |
## Key Formulas
1.Regression Equation
$$
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n
$$
2.R-Squared
$$
R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}
$$
3.F-Statistic (ANOVA)
$$
F = \frac{\text{MS}\_\text{model}}{\text{MS}\_\text{residual}}
$$
## Statistical
| Metric | Value | Critical Value | Interpretation |
|-------------|--------|----------------|-----------------------------|
| R² | 0.872 | > 0.7 | Strong explanatory power |
| Adj. R² | 0.865 | > 0.6 | Robust to overfitting |
| F-statistic | 124.7 | 4.89 | p < 0.001 (Significant) |
| Intercept | 58,200 | - | Base property value |## Stepwise Regression
```python
import statsmodels.api as smdef stepwise_selection(X, y, threshold_in=0.05, threshold_out=0.1):
included = []
while True:
changed = False
# Forward step
excluded = list(set(X.columns) - set(included))
new_pval = pd.Series(index=excluded)
for new_column in excluded:
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
new_pval[new_column] = model.pvalues[new_column]
best_pval = new_pval.min()
if best_pval < threshold_in:
best_feature = new_pval.idxmin()
included.append(best_feature)
changed = True
# Backward step
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
pvalues = model.pvalues.iloc[1:]
worst_pval = pvalues.max()
if worst_pval > threshold_out:
worst_feature = pvalues.idxmax()
included.remove(worst_feature)
changed = True
if not changed:
break
return included
```#
Copyright 2025 Mindful AI Assistnts.Code released under the MIT License.