{"id":26544915,"url":"https://github.com/quantum-software-development/integrated-project-for-business","last_synced_at":"2026-03-03T07:36:37.466Z","repository":{"id":279232958,"uuid":"938139555","full_name":"Quantum-Software-Development/Integrated-Project-for-Business","owner":"Quantum-Software-Development","description":"Integrated Project for Business","archived":false,"fork":false,"pushed_at":"2025-03-14T03:13:44.000Z","size":3921,"stargazers_count":2,"open_issues_count":3,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-14T03:29:49.277Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Quantum-Software-Development.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"Quantum-Software-Developmen","Custom":"https://github.com/sponsors/Quantum-Software-Development/card"}},"created_at":"2025-02-24T13:38:46.000Z","updated_at":"2025-03-14T03:13:47.000Z","dependencies_parsed_at":"2025-02-24T14:40:37.083Z","dependency_job_id":"e499cb1d-a488-444a-b7bd-2f16cd3801cf","html_url":"https://github.com/Quantum-Software-Development/Integrated-Project-for-Business","commit_stats":null,"previous_names":["quantum-software-development/integrated_project-business","quantum-software-development/integrated-project-for-business"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantum-Software-Development%2FIntegrated-Project-for-Business","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantum-Software-Development%2FIntegrated-Project-for-Business/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantum-Software-Development%2FIntegrated-Project-for-Business/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantum-Software-Development%2FIntegrated-Project-for-Business/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Quantum-Software-Development","download_url":"https://codeload.github.com/Quantum-Software-Development/Integrated-Project-for-Business/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244902911,"owners_count":20529115,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-22T04:14:45.602Z","updated_at":"2026-03-03T07:36:37.419Z","avatar_url":"https://github.com/Quantum-Software-Development.png","language":"Jupyter Notebook","funding_links":["https://github.com/sponsors/Quantum-Software-Developmen","https://github.com/sponsors/Quantum-Software-Development/card","https://github.com/sponsors/Quantum-Software-Development"],"categories":[],"sub_categories":[],"readme":"\u003cbr\u003e\n\n\n\u003cdiv align=\"center\"\u003e\n  \u003ch1 style=\"font-size:2.5em;\"\u003e🌟 Integrated Business Project – 3rd Semester at PUC-SP: Bachelor's in Humanistic AI \u0026 Data Science\u003c/h1\u003e\n  \u003ch3 style=\"font-size:0.9em;\"\u003e\n    Under the guidance of \u003ca href=\"https://www.linkedin.com/in/eric-bacconi-423137/\" target=\"_blank\" style=\"color:inherit; text-decoration:underline;\"\u003eProfessor Dr. Eric Bacconi\u003c/a\u003e, Coordinator of the Bachelor's Program in Humanistic AI \u0026 Data Science at PUC-SP.\n  \u003c/h3\u003e\n\u003c/div\u003e\n\n\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\n\u003ch2 align=\"center\"\u003e  $$\\Huge {\\textbf{\\color{DodgerBlue} GOOD DECISIONS = GOOD RESULTS}}$$ \n\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\n### \u003cp align=\"center\"\u003e [![Sponsor Quantum Software Development](https://img.shields.io/badge/Sponsor-Quantum%20Software%20Development-brightgreen?logo=GitHub)](https://github.com/sponsors/Quantum-Software-Development)\n\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\n\n\n## [Part I]() - Linear Regression and Data Scaling Analysis\n\n\u003cbr\u003e\n\n## Project Overview\n\nThis project demonstrates a complete machine learning workflow for price prediction using:\n- **Stepwise Regression** for feature selection  \n- Advanced statistical analysis (ANOVA, R² metrics)  \n- Full model diagnostics  \n- Interactive visualization integration  \n\n\u003cbr\u003e\n\n\n# Table of Contents  \n1. [What is Data Normalization/Scaling?](#what-is-data-normalizationscaling)  \n2. [Common Scaling Methods](#common-scaling-methods)  \n3. [Why is this Important in Machine Learning?](#why-is-this-important-in-machine-learning)  \n4. [Practical Example](#practical-example)  \n5. [Code Example (Python)](#code-example-python)  \n6. [Linear Regression: Price Prediction Case Study 📈](#linear-regression-price-prediction-case-study-)  \n   - [I. Use Case Implementation](#i-use-case-implementation)  \n   - [Dataset Description](#dataset-description)  \n   - [II. Methodology](#ii-methodology)  \n   - [Stepwise Regression Implementation](#stepwise-regression-implementation)  \n   - [III. Statistical Analysis](#iii-statistical-analysis)  \n   - [Key Metrics Table](#key-metrics-table)  \n   - [Correlation Matrix](#correlation-matrix)  \n   - [IV. Full Implementation Code](#iv-full-implementation-code)  \n   - [Model Training \u0026 Evaluation](#model-training--evaluation)  \n   - [ANOVA Results](#anova-results)  \n   - [V. Visualization](#v-visualization)  \n   - [Actual vs Predicted Prices](#actual-vs-predicted-prices)  \n   - [VI. How to Run](#vi-how-to-run)  \n7. [Linear Regression Analysis Report 📊](#linear-regression-analysis-report)  \n   - [Dataset Overview](#dataset-overview)  \n   - [Key Formulas](#key-formulas)  \n   - [Statistical Results](#statistical-results)  \n   - [Code Implementation](#code-implementation)  \n   - [Stepwise Regression](#stepwise-regression)  \n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## [Standardization of a Range of Values]()\n\nIt's describes the process of scaling or normalizing data within a specific range, typically to a standardized scale, for example, from 0 to 1. This is a common technique in data analysis and machine learning.\n\n\u003cbr\u003e\n\n###  \u003cp align=\"center\"\u003e [Mathematical Formula]()\n\n\u003cbr\u003e\n\n$$X_{normalized} = \\frac{X - X_{\\min}}{X_{\\max} - X_{\\min}}$$\n\n\u003cbr\u003e\n\n### \u003cp align=\"center\"\u003e [Where]():\n\n $$X_{\\max} - X_{\\min} = \\text{Amplitude}$$ \n\n \u003cbr\u003e\n\n####  \u003cp align=\"center\"\u003e Is the `amplitude`, a way to represent the range of data values before normalization.\n\n\u003cbr\u003e\n\n## [Explanation]():\n\nTo calculate the standardization of the variables salario, n_filhos, and idade using both the Z-Score and Range methods, and to evaluate the mean, standard deviation, maximum, and minimum before and after standardization, we can follow these steps:\n\n\n### [Before Standardization]():\n\nCompute the mean, standard deviation, maximum, and minimum for each of the variables (n_filhos, salario, idade).\n\n#### [Z-Score Standardization]():\n\nWe standardize the variables using the Z-Score method, which is computed as:\n\n\n$Z$ = $\\frac{X - \\mu}{\\sigma}$\n\n```latex\nZ = \\frac{X - \\mu}{\\sigma}\n```\n\nWhere:\n- $\\( \\mu \\)$ is the mean,\n- $\\( \\sigma \\)$ is the standard deviation.\n\n  \u003cbr\u003e\n\n### [Range Standardization (Min-Max Scaling)]():\n\nWe scale the data using the Min-Max method, which scales the values to a [0, 1] range using:\n\n$X'$ = $\\frac{X - \\min(X)}{\\max(X) - \\min(X)}$\n\n```latex\nX' = \\frac{X - \\min(X)}{\\max(X) - \\min(X)}\n```\n  \nWhere:\n- X is the original value,\n- min(X) is the minimum value,\n- max(X) is the maximum value.\n\n\u003cbr\u003e\n\n### [After Standardization]():\n\nCompute the mean, standard deviation, maximum, and minimum of the standardized data for both Z-Score and Range methods.\n\nThe output will provide the descriptive statistics before and after each standardization method, allowing you to compare the effects of Z-Score and Range standardization on the dataset.\n\n \u003cbr\u003e\n\n## Practical Example for Calculating this Normalized Value in [Python]():\n\n#### Use this [dataset](https://github.com/Quantum-Software-Development/Integrated_Project-Business/blob/f2d7abe6ee5853ae29c750170a01e429334f6fe5/HomeWork/1-Z-Score-Range/cadastro_funcionarios.xlsx)\n\nThe code demonstrates how to apply Z-Score and Range (Min-Max) standardization to the variables salario, n_filhos, and idade in a dataset. It also evaluates and compares the mean, standard deviation, minimum, and maximum values before and after the standardization methods are applied.\n\n \u003cbr\u003e\n\n### Cell 1: [Import necessary libraries]()\n\n```python\n# Importing the necessary libraries\nimport pandas as pd\nimport numpy as np\nfrom sklearn.preprocessing import MinMaxScaler\n```\n\n\u003cbr\u003e\n\n### Cell 2: [Load the dataset from the Excel file]()\n\n```python\n# Load the data from the Excel file\n# df = pd.read_excel('use-your-own-dataset.xlsx') - optional\ndf = pd.read_excel('cadastro_funcionarios.xlsx')\ndf.head()  # Displaying the first few rows of the dataset to understand its structure\n```\n\n\u003cbr\u003e\n\n### Cell 3: [Evaluate the statistics before standardization]()\n\n```python\n# Step 1: Evaluate the mean, std, max, and min before standardization\nbefore_std_stats = {\n    'mean_n_filhos': df['n_filhos'].mean(),\n    'std_n_filhos': df['n_filhos'].std(),\n    'min_n_filhos': df['n_filhos'].min(),\n    'max_n_filhos': df['n_filhos'].max(),\n    \n    'mean_salario': df['salario'].mean(),\n    'std_salario': df['salario'].std(),\n    'min_salario': df['salario'].min(),\n    'max_salario': df['salario'].max(),\n    \n    'mean_idade': df['idade'].mean(),\n    'std_idade': df['idade'].std(),\n    'min_idade': df['idade'].min(),\n    'max_idade': df['idade'].max(),\n}\n\n# Display the statistics before standardization\nbefore_std_stats\n```\n\n\u003cbr\u003e\n\n### Cell 4: [Apply Z-Score standardization]()\n\n```python\n# Step 2: Z-Score Standardization\ndf_zscore = df[['n_filhos', 'salario', 'idade']].apply(lambda x: (x - x.mean()) / x.std())\n\n# Display the standardized data\ndf_zscore.head()\n```\n\n\u003cbr\u003e\n\n### Cell 5: [Evaluate the statistics after Z-Score standardization]()\n\n```python\n# Step 3: Evaluate the mean, std, max, and min after Z-Score standardization\nafter_zscore_stats = {\n    'mean_n_filhos_zscore': df_zscore['n_filhos'].mean(),\n    'std_n_filhos_zscore': df_zscore['n_filhos'].std(),\n    'min_n_filhos_zscore': df_zscore['n_filhos'].min(),\n    'max_n_filhos_zscore': df_zscore['n_filhos'].max(),\n    \n    'mean_salario_zscore': df_zscore['salario'].mean(),\n    'std_salario_zscore': df_zscore['salario'].std(),\n    'min_salario_zscore': df_zscore['salario'].min(),\n    'max_salario_zscore': df_zscore['salario'].max(),\n    \n    'mean_idade_zscore': df_zscore['idade'].mean(),\n    'std_idade_zscore': df_zscore['idade'].std(),\n    'min_idade_zscore': df_zscore['idade'].min(),\n    'max_idade_zscore': df_zscore['idade'].max(),\n}\n\n# Display the statistics after Z-Score standardization\nafter_zscore_stats\n```\n\n\u003cbr\u003e\n\n### Cell 6: [Apply Range Standardization]() (Min-Max Scaling)\n\n```python\n# Step 4: Range Standardization (Min-Max Scaling)\nscaler = MinMaxScaler()\ndf_range = pd.DataFrame(scaler.fit_transform(df[['n_filhos', 'salario', 'idade']]), columns=['n_filhos', 'salario', 'idade'])\n\n# Display the scaled data\ndf_range.head()\n```\n\n\u003cbr\u003e\n\n### Cell 7: [Evaluate the statistics after Range standardization]()\n\n```python\n# Step 5: Evaluate the mean, std, max, and min after Range standardization\nafter_range_stats = {\n    'mean_n_filhos_range': df_range['n_filhos'].mean(),\n    'std_n_filhos_range': df_range['n_filhos'].std(),\n    'min_n_filhos_range': df_range['n_filhos'].min(),\n    'max_n_filhos_range': df_range['n_filhos'].max(),\n    \n    'mean_salario_range': df_range['salario'].mean(),\n    'std_salario_range': df_range['salario'].std(),\n    'min_salario_range': df_range['salario'].min(),\n    'max_salario_range': df_range['salario'].max(),\n    \n    'mean_idade_range': df_range['idade'].mean(),\n    'std_idade_range': df_range['idade'].std(),\n    'min_idade_range': df_range['idade'].min(),\n    'max_idade_range': df_range['idade'].max(),\n}\n\n# Display the statistics after Range standardization\nafter_range_stats\n```\n\n\u003cbr\u003e\n\n## Pratical Example for Calculating this Normalized Value in [Excel]() \n\n#### Use this [dataset](https://github.com/Quantum-Software-Development/Integrated_Project-Business/blob/f2d7abe6ee5853ae29c750170a01e429334f6fe5/HomeWork/1-Z-Score-Range/cadastro_funcionarios.xlsx)\n\nTo standardize the variables (salary, number of children, and age) in Excel using the Z-Score and Range methods, you can follow these steps:\n\n \u003cbr\u003e\n\n## I. [Z-Score Standardization]()\n\n### Steps for Z-Score in Excel:\n\n### 1. [Find the Mean (µ)]():\n\nUse the AVERAGE function to calculate the mean of the column. For example, to find the mean of the salary (column E), use:\n\n```excel\n=AVERAGE(E2:E351)\n```\n\n\u003cbr\u003e\n\n### 2. [Find the Standard Deviation (σ)]():\n   \nUse the STDEV.P function to calculate the standard deviation of the column. For example, to find the standard deviation of the salary (column E), use:\n\n```excel\n=STDEV.P(E2:E351)\n```\n\n\u003cbr\u003e\n\n### 3. [Apply the Z-Score Formula]():\n\nFor each value in the column, apply the Z-Score formula. In the first row of the new column, use:\n\n```excel\n=(E2 - AVERAGE(E$2:E$351)) / STDEV.P(E$2:E$351)\n```\n\n\u003cbr\u003e\n\n### 4.[Drag the formula down to calculate the Z-Score for all the rows]():\n\nExample for Salary:\n\nIn cell H2 (new column for standardized salary), write\n\n```excel\n=(E2 - AVERAGE(E$2:E$351)) / STDEV.P(E$2:E$351)\n```\n\nThen, drag it down to the rest of the rows.\n\nRepeat the same steps for the variables n_filhos (column D) and idade (column F).\n\n\n\u003cbr\u003e\n\n## II. [Range Standardization]()\n\nSteps for Range Standardization in Excel:\n\n### 1. [Find the Min and Max]():\n\nUse the MIN and MAX functions to find the minimum and maximum values of the column. For example, to find the min and max of salary (column E), use:\n\n```excel\n=MIN(E2:E351)\n=MAX(E2:E351)\n```\n\n\u003cbr\u003e\n\n### 2. [Apply the Range Formula]():\n\nFor each value in the column, apply the range formula. In the first row of the new column, use:\n\n```excel\n=(E2 - MIN(E$2:E$351)) / (MAX(E$2:E$351) - MIN(E$2:E$351))\n```\n\n\u003cbr\u003e\n\n### 3.[Drag the formula down to calculate the range standardized values for all the rows]():\n\nExample for Salary:\n\nIn cell I2 (new column for range standardized salary), write:\n\n```excel\n=(E2 - MIN(E$2:E$351)) / (MAX(E$2:E$351) - MIN(E$2:E$351))\n```\n\nThen, drag it down to the rest of the rows.\nRepeat the same steps for the variables n_filhos (column D) and idade (column F).\n\n\u003cbr\u003e\n\n\n## Summary of the Process\n\n[Z-Score Standardization]() centers the data around [zero]() and scales it based on the [standard deviation]().\n\n[Range Standardization (Min-Max Scaling)]() rescals the data to a [[0, 1] range]().\n\nBoth techniques were applied (given dataset)  to the [columns n_filhos](), [salario](), and [idade](), and the statistics (mean, std, min, max) were calculated before and after the standardization methods.\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## Important Notes\n\n- **Correlation does not imply causation**: Correlation between two variables does not necessarily mean that one causes the other. For example, there may be a correlation between the number of salespeople in a store and increased sales, but that does not imply that having more salespeople directly causes higher sales.\n\n- **Using regressions we don’t need to worry about standardization**: When using regressions, there is no need to worry about data standardization. Unlike other methods like k-NN or neural networks, where the scale of the data can impact performance, regression models can be applied directly without the need for scaling adjustments.\n\n## Pearson Correlation\n\n**Pearson Correlation** is a statistical measure that describes the strength and direction of a linear relationship between two variables. The Pearson correlation value ranges from -1 to 1:\n\n- **1**: Perfect positive correlation (both variables increase together).\n- **-1**: Perfect negative correlation (one variable increases while the other decreases).\n- **0**: No linear correlation.\n\nFor example, if we're analyzing the correlation between the area of a house and its price, a Pearson value close to 1 would indicate that as the area increases, the price also tends to increase.\n\n## Simple Linear Regression\n\n**Simple Linear Regression** is a statistical model that describes the relationship between a dependent variable (response) and an independent variable (predictor). The model is represented by the formula:\n\n$$\ny = \\beta_0 + \\beta_1 \\cdot x\n$$\n\nWhere:\n- \\(y\\) is the dependent variable (the one we want to predict),\n- \\(x\\) is the independent variable (the one used to make predictions),\n- \\(\\beta_0\\) is the intercept (the value of \\(y\\) when \\(x = 0\\)),\n- \\(\\beta_1\\) is the coefficient (representing the change in \\(y\\) when \\(x\\) increases by one unit).\n\nSimple linear regression is widely used for predicting a value based on a linear relationship between variables.\n\n### Steps to Perform Linear Regression:\n\n1. **Data Collection**: Gather the data related to the dependent and independent variables.\n2. **Exploratory Data Analysis (EDA)**: Explore the data to identify trends, patterns, and check correlation.\n3. **Model Fitting**: Fit the linear regression model to the data using a method like Ordinary Least Squares (OLS).\n4. **Model Evaluation**: Evaluate the model performance using metrics like Mean Squared Error (MSE) and the Coefficient of Determination (\\(R^2\\)).\n5. **Prediction**: Use the fitted model to make predictions with new data.\n\nSimple linear regression is a great starting point for predictive problems where a linear relationship between variables is expected.\n\n\n### I- Example Code - [Correlation Vendas Gjornal]()\n\n### Use This Dataset - [BD Gerais.xlsx](https://github.com/Quantum-Software-Development/Integrated_Project-Business/blob/4331d9227118d2025a6c167a3cefd99bf7404939/class_2-Linear%20Regression/BD%20Gerais.xlsx)\n\n\n\n### Step 1: Install Required Libraries\n\nIf you don't have the required libraries installed, you can install them with pip:\n\n```python\npip install pandas numpy matplotlib scikit-learn openpyxl\n```\n\n\u003cbr\u003e\n\n```python\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.metrics import mean_squared_error, r2_score\n\n# Load the dataset from the Excel file\nfile_path = 'BD Gerais.xlsx'\ndf = pd.read_excel(file_path)\n\n# Display the first few rows of the dataset\nprint(df.head())\n\n# Let's assume the columns are: 'Vendas', 'Gjornal', 'GTV', 'Gmdireta'\n# Compute the correlation matrix\ncorrelation_matrix = df.corr()\nprint(\"\\nCorrelation Matrix:\")\nprint(correlation_matrix)\n\n# Perform linear regression: Let's use 'Vendas' as the target and 'Gjornal', 'GTV', 'Gmdireta' as features\nX = df[['Gjornal', 'GTV', 'Gmdireta']]  # Features\ny = df['Vendas']  # Target variable\n\n# Create and train the model\nmodel = LinearRegression()\nmodel.fit(X, y)\n\n# Print the regression coefficients\nprint(\"\\nRegression Coefficients:\")\nprint(f\"Intercept: {model.intercept_}\")\nprint(f\"Coefficients: {model.coef_}\")\n\n# Make predictions\ny_pred = model.predict(X)\n\n# Calculate Mean Squared Error and R-squared\nmse = mean_squared_error(y, y_pred)\nr2 = r2_score(y, y_pred)\n\nprint(f\"\\nMean Squared Error: {mse}\")\nprint(f\"R-squared: {r2}\")\n\n# Plot the actual vs predicted values\nplt.scatter(y, y_pred)\nplt.plot([min(y), max(y)], [min(y), max(y)], color='red', linestyle='--')\nplt.title('Actual vs Predicted Vendas')\nplt.xlabel('Actual Vendas')\nplt.ylabel('Predicted Vendas')\nplt.show()\n\n# Plot the regression line for each feature vs 'Vendas'\nfig, axs = plt.subplots(1, 3, figsize=(15, 5))\n\n# Plot for 'Gjornal'\naxs[0].scatter(df['Gjornal'], y, color='blue')\naxs[0].plot(df['Gjornal'], model.intercept_ + model.coef_[0] * df['Gjornal'], color='red')\naxs[0].set_title('Gjornal vs Vendas')\naxs[0].set_xlabel('Gjornal')\naxs[0].set_ylabel('Vendas')\n\n# Plot for 'GTV'\naxs[1].scatter(df['GTV'], y, color='blue')\naxs[1].plot(df['GTV'], model.intercept_ + model.coef_[1] * df['GTV'], color='red')\naxs[1].set_title('GTV vs Vendas')\naxs[1].set_xlabel('GTV')\naxs[1].set_ylabel('Vendas')\n\n# Plot for 'Gmdireta'\naxs[2].scatter(df['Gmdireta'], y, color='blue')\naxs[2].plot(df['Gmdireta'], model.intercept_ + model.coef_[2] * df['Gmdireta'], color='red')\naxs[2].set_title('Gmdireta vs Vendas')\naxs[2].set_xlabel('Gmdireta')\naxs[2].set_ylabel('Vendas')\n\nplt.tight_layout()\nplt.show()\n````\n\n#\n\n### II- Example Code - [Correlation Vendas -  GTV]()\n\n### Use This Dataset - [BD Gerais.xlsx](https://github.com/Quantum-Software-Development/Integrated_Project-Business/blob/4331d9227118d2025a6c167a3cefd99bf7404939/class_2-Linear%20Regression/BD%20Gerais.xlsx)\n\nTo compute the correlation between the Vendas and GTV columns in your dataset using Python, you can follow this code. This will calculate the correlation coefficient and visualize the relationship between these two variables using a scatter plot.\n\n```python\nimport pandas as pd\nimport matplotlib.pyplot as plt\n```\n  \n\u003cbr\u003e\u003cbr\u003e\n\n### II - Multiple Linear Regression with 4 variable \n\n- Vendas as the dependent variable (Y)\n  \n- Jornal, GTV, and Gmdireta as independent variables (X)\n  \nThis code will also calculate the correlation matrix, fit the multiple linear regression model, and display the regression results.\n\n#### Python Code for Multiple Linear Regression and Correlation\n\n\u003cbr\u003e\n\n### 1- Install Required Libraries (if you don't have them yet)\n\n```bash\npip install pandas numpy matplotlib statsmodels scikit-learn\n```\n\n\u003cbr\u003e\n\n### 2- Python Code\n\n\u003cbr\u003e\n\n```python\nimport pandas as pd\nimport numpy as np\nimport statsmodels.api as sm\nimport matplotlib.pyplot as plt\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.metrics import mean_squared_error, r2_score\n\n# Load the dataset from the Excel file\nfile_path = 'BD Gerais.xlsx'  # Adjust the file path if needed\ndf = pd.read_excel(file_path)\n\n# Display the first few rows of the dataset to verify the data\nprint(df.head())\n\n# Calculate the correlation matrix for the variables\ncorrelation_matrix = df[['Vendas', 'Gjornal', 'GTV', 'Gmdireta']].corr()\nprint(\"\\nCorrelation Matrix:\")\nprint(correlation_matrix)\n\n# Define the independent variables (X) and the dependent variable (Y)\nX = df[['Gjornal', 'GTV', 'Gmdireta']]  # Independent variables\ny = df['Vendas']  # Dependent variable (Vendas)\n\n# Add a constant (intercept) to the independent variables\nX = sm.add_constant(X)\n\n# Fit the multiple linear regression model\nmodel = sm.OLS(y, X).fit()\n\n# Display the regression results\nprint(\"\\nRegression Results:\")\nprint(model.summary())\n\n# Alternatively, using sklearn's LinearRegression to calculate the coefficients and R-squared\nmodel_sklearn = LinearRegression()\nmodel_sklearn.fit(X[['Gjornal', 'GTV', 'Gmdireta']], y)\n\n# Coefficients and intercept\nprint(\"\\nLinear Regression Coefficients (sklearn):\")\nprint(\"Intercept:\", model_sklearn.intercept_)\nprint(\"Coefficients:\", model_sklearn.coef_)\n\n# Predicting with the model\ny_pred = model_sklearn.predict(X[['Gjornal', 'GTV', 'Gmdireta']])\n\n# Calculating R-squared and Mean Squared Error (MSE)\nr2 = r2_score(y, y_pred)\nmse = mean_squared_error(y, y_pred)\n\nprint(f\"\\nR-squared: {r2:.4f}\")\nprint(f\"Mean Squared Error: {mse:.4f}\")\n\n# Plotting the actual vs predicted Vendas\nplt.scatter(y, y_pred)\nplt.plot([y.min(), y.max()], [y.min(), y.max()], '--k', color='red')  # line of perfect prediction\nplt.xlabel('Actual Vendas')\nplt.ylabel('Predicted Vendas')\nplt.title('Actual vs Predicted Vendas')\nplt.show()\n```\n\n\u003cbr\u003e\n\n## Code Explanation\n\nLoading Data:\n\nThe dataset is loaded from BD Gerais.xlsx using pandas.read_excel(). The file path is adjusted based on your actual file location.\nCorrelation Matrix:\n\nWe calculate the correlation matrix for the four variables: Vendas, Gjornal, GTV, and Gmdireta. This gives us an overview of the relationships between the variables.\n\n### [Multiple Linear Regression]():\n\nWe define the independent variables (Gjornal, GTV, Gmdireta) as X and the dependent variable (Vendas) as y.\nWe add a constant term (intercept) to X using sm.add_constant() for proper regression.\nWe use the statsmodels.OLS method to fit the multiple linear regression model and print the regression summary, which includes coefficients, R-squared, p-values, and more.\nAlternative Model (sklearn):\n\nWe also use sklearn.linear_model.LinearRegression() for comparison, which calculates the coefficients and R-squared.\nWe then use the trained model to predict the Vendas values and calculate Mean Squared Error (MSE) and R-squared.\nPlotting:\n\nThe actual values of Vendas are plotted against the predicted values from the regression model in a scatter plot. A red line of perfect prediction is also added (this line represents the ideal case where actual values equal predicted values).\n\n### [Output of the Code]():\n\n### Correlation Matrix:\n\nDisplays the correlation between Vendas, Gjornal, GTV, and Gmdireta. This helps you understand the relationships between these variables.\nRegression Results (from statsmodels):\n\nThe regression summary will include:\n[Coefficients](): The relationship between each independent variable and the dependent variable (Vendas).\n[R-squared](): Measures how well the model fits the data.\n[P-values](): For testing the statistical significance of each coefficient.\n\n\u003cbr\u003e\n\n### [Linear Regression Coefficients]():\n\n\u003cbr\u003e\n\n- The model's intercept and coefficients are printed for comparison.\n\n- R-squared and Mean Squared Error (MSE):\n\n- These two metrics evaluate the performance of the regression model.\n\n- R-squared tells you how well the model explains the variance in the dependent variable.\n  \n- MSE gives an idea of the average squared difference between the predicted and actual values.\n\n  #\n  \n### [Plot]():\n\nThe plot shows how well the model's predicted Vendas values match the actual values.\n\n\n### Example Output (Model Summary from statsmodels):\n\n\u003cb\u003e\n\n```plaintext\n                            OLS Regression Results\n==============================================================================\nDep. Variable:                 Vendas   R-squared:                       0.982\nModel:                            OLS   Adj. R-squared:                  0.980\nMethod:                 Least Squares   F-statistic:                     530.8\nDate:                Thu, 10 Mar 2025   Prob (F-statistic):           2.31e-14\n==============================================================================\n                 coef    std err          t      P\u003e|t|      [0.025      0.975]\n------------------------------------------------------------------------------\nconst         12.1532      3.001      4.055      0.001       5.892      18.414\nGjornal        2.4503      0.401      6.100      0.000       1.638       3.262\nGTV            1.2087      0.244      4.948      0.000       0.734       1.683\nGmdireta       0.5003      0.348      1.437      0.168      -0.190       1.191\n==============================================================================\n```\n\n### [In the summary]():\n\nR-squared of 0.982 indicates that the model explains 98.2% of the variance in `Vendas`.\n\nThe coefficients of the independent variables show how each variable affects `Vendas`.\n\n\u003cbr\u003e\n\nIf the p-value for `Gmdireta` is greater than 0.05, it means that `Gmdireta` is not statistically significant in explaining the variability in the dependent variable `Vendas`. In such a case, it's common practice to remove the variable from the model and perform the regression again with only the statistically significant variables.\n\n#### In this case, [you can exclude Gmdireta and rerun the regression model using only the remaining variables](): `Gjornal` and `GTV`.\n\n\u003cbr\u003e\n\n### [Why Remove Gmdireta]()?\n\n[P-value](): The `p-value` is used to test the null hypothesis that the coefficient of the variable is equal to zero (i.e., the variable has no effect). If the p-value is greater than `0.05,` it indicates that the variable is not statistically significant at the 5% level and doesn't provide much explanatory power in the model.\n\n[Adjusted R-squared](): By removing variables that are not significant, you often improve the model's explanatory power (in some cases), as it helps reduce multicollinearity and overfitting.\n\n\u003cbr\u003e\n\n### Modified Python Code (Without Gmdireta)\n\nLet’s update the code by removing Gmdireta from the regression model and re-running the analysis with just Gjornal and GTV as the independent variables.\n\n```python\nimport pandas as pd\nimport numpy as np\nimport statsmodels.api as sm\nimport matplotlib.pyplot as plt\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.metrics import mean_squared_error, r2_score\n\n# Load the dataset from the Excel file\nfile_path = 'BD Gerais.xlsx'  # Adjust the file path if needed\ndf = pd.read_excel(file_path)\n\n# Display the first few rows of the dataset to verify the data\nprint(df.head())\n\n# Calculate the correlation matrix for the variables\ncorrelation_matrix = df[['Vendas', 'Gjornal', 'GTV']].corr()  # Excluding 'Gmdireta'\nprint(\"\\nCorrelation Matrix (without Gmdireta):\")\nprint(correlation_matrix)\n\n# Define the independent variables (X) and the dependent variable (Y)\nX = df[['Gjornal', 'GTV']]  # Independent variables (Gjornal and GTV)\ny = df['Vendas']  # Dependent variable (Vendas)\n\n# Add a constant (intercept) to the independent variables\nX = sm.add_constant(X)\n\n# Fit the multiple linear regression model\nmodel = sm.OLS(y, X).fit()\n\n# Display the regression results\nprint(\"\\nRegression Results (without Gmdireta):\")\nprint(model.summary())\n\n# Alternatively, using sklearn's LinearRegression to calculate the coefficients and R-squared\nmodel_sklearn = LinearRegression()\nmodel_sklearn.fit(X[['Gjornal', 'GTV']], y)\n\n# Coefficients and intercept\nprint(\"\\nLinear Regression Coefficients (sklearn):\")\nprint(\"Intercept:\", model_sklearn.intercept_)\nprint(\"Coefficients:\", model_sklearn.coef_)\n\n# Predicting with the model\ny_pred = model_sklearn.predict(X[['Gjornal', 'GTV']])\n\n# Calculating R-squared and Mean Squared Error (MSE)\nr2 = r2_score(y, y_pred)\nmse = mean_squared_error(y, y_pred)\n\nprint(f\"\\nR-squared: {r2:.4f}\")\nprint(f\"Mean Squared Error: {mse:.4f}\")\n\n# Plotting the actual vs predicted Vendas\nplt.scatter(y, y_pred)\nplt.plot([y.min(), y.max()], [y.min(), y.max()], '--k', color='red')  # line of perfect prediction\nplt.xlabel('Actual Vendas')\nplt.ylabel('Predicted Vendas')\nplt.title('Actual vs Predicted Vendas')\nplt.show()\n```\n\u003cbr\u003e\n\n## Key Changes:\n\n### [Removed Gmdireta]():\n\nIn the regression model, Gmdireta was excluded as an independent variable.\nThe correlation matrix is now calculated using only Vendas, Gjornal, and GTV.\nIndependent Variables (X):\n\nWe now use only Gjornal and GTV as the independent variables for the regression analysis.\nThe variable Gmdireta is no longer included in the model.\nExplanation of the Code:\nCorrelation Matrix:\n\nWe calculate the correlation matrix to examine the relationships between Vendas, Gjornal, and GTV only (without Gmdireta).\nMultiple Linear Regression (statsmodels):\n\nWe perform the Multiple Linear Regression with Gjornal and GTV as independent variables.\nThe regression summary will now show the coefficients, p-values, R-squared, and other statistics for the model with the reduced set of independent variables.\nLinear Regression (sklearn):\n\nWe also use `sklearn.linear_model.LinearRegression()` to perform the regression and output the intercept and coefficients for the model without `Gmdireta`.\n\n\u003cbr\u003e\n\n### [Prediction and Performance Metrics]():\n\nAfter fitting the regression model, we calculate the predicted values [y_pred]() for `Vendas` using the new model `(without Gmdireta)`.\n\nWe calculate the [R-squared]() and [Mean Squared Error (MSE)]() to evaluate the model's performance.\n\nThe [R-squared]() tells us how much of the variance in `Vendas` is explained by `Gjornal` and `GTV`.\n\nThe [MSE]() tells us the average squared difference between the predicted and actual values.\n\n\u003cbr\u003e\n\n\n### [Plotting]():\n\nThe plot visualizes how well the predicted `Vendas` values match the actual values. The red line represents the ideal case where the predicted values equal the actual values.\n\n\u003cbr\u003e\n\n### [Example of Expected Output]() (Updated Model Summary):\n\n\u003cbr\u003e\n\n\n```plaintext\n\n                            OLS Regression Results\n==============================================================================\nDep. Variable:                 Vendas   R-squared:                       0.976\nModel:                            OLS   Adj. R-squared:                  0.974\nMethod:                 Least Squares   F-statistic:                     320.3\nDate:                Thu, 10 Mar 2025   Prob (F-statistic):           4.23e-10\n==============================================================================\n                 coef    std err          t      P\u003e|t|      [0.025      0.975]\n------------------------------------------------------------------------------\nconst         10.6345      2.591      4.107      0.000       5.146      16.123\nGjornal        2.8951      0.453      6.394      0.000       1.987       3.803\nGTV            1.3547      0.290      4.673      0.000       0.778       1.931\n==============================================================================\n```\n\n\u003cbr\u003e\n\n### [Interpretation]():\n\n[R-squared](): A value of 0.976 means that the independent variables `Gjornal` and `GTV` explain 97.6% of the variance in `Vendas`, which is a good fit.\n\n\n\u003cbr\u003e\n\n\n### [Coefficients]():\n\n\u003cbr\u003e\n\nThe coefficient for Gjornal (2.8951) tells us that for each unit increase in `Gjornal`, `Vendas` increases by approximately 2.90 units.\n\nThe coefficient for `GTV` (1.3547) tells us that for each unit increase in `GTV`, `Vendas` increases by approximately 1.35 units.\n\n[P-values](): Both `Gjornal` and `GTV` have very small p-values (much smaller than 0.05), indicating that they are statistically significant in predicting `Vendas`.\n\nBy removing the variable `Gmdireta` (which had a p-value greater than 0.05), the regression model now focuses on the variables that have a stronger statistical relationship with the dependent variable `Vendas`.\n\n--\u003e\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n#\n\n###### \u003cp align=\"center\"\u003e Copyright 2025 Quantum Software Development. Code released under the [MIT License.](https://github.com/Quantum-Software-Development/Math/blob/3bf8270ca09d3848f2bf22f9ac89368e52a2fb66/LICENSE)\n\n\n\n\n  \n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquantum-software-development%2Fintegrated-project-for-business","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquantum-software-development%2Fintegrated-project-for-business","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquantum-software-development%2Fintegrated-project-for-business/lists"}