Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/juzershakir/linear_regression
A Mathematical Intuition behind Linear Regression Algorithm
https://github.com/juzershakir/linear_regression
algorithm bias-variance cost-function feature-scaling gradient-descent house-price-prediction hypothesis linear-algebra linear-equations linear-regression machine-learning matrices mean-normalization mean-square-error mse multivariate-regression partial-derivative regularized-linear-regression univariate-regressions vector
Last synced: about 1 month ago
JSON representation
A Mathematical Intuition behind Linear Regression Algorithm
- Host: GitHub
- URL: https://github.com/juzershakir/linear_regression
- Owner: JuzerShakir
- Created: 2018-07-07T07:11:48.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2021-11-04T16:43:10.000Z (about 3 years ago)
- Last Synced: 2024-10-09T03:40:58.445Z (3 months ago)
- Topics: algorithm, bias-variance, cost-function, feature-scaling, gradient-descent, house-price-prediction, hypothesis, linear-algebra, linear-equations, linear-regression, machine-learning, matrices, mean-normalization, mean-square-error, mse, multivariate-regression, partial-derivative, regularized-linear-regression, univariate-regressions, vector
- Homepage:
- Size: 296 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Linear Regression
###### Written by [Juzer Shakir](https://juzershakir.github.io/)## Table of Contents
- [Description](#description)
- [Notations](#notations)
- [Definition](#definition)
- [Flowchart](#flowchart)
- [Univariate Linear Regression](#univariate-linear-regression)
- [Definition](#definition-for-univariate-linear-regression)
- [Formula](#formula-for-univariate-linear-regression)
- [Cost Function](#cost-function-for-univariate-linear-regression)
- [Gradient Descent](#gradient-descent-for-univariate-linear-regression)
- [Multivariate Linear Regression](#multivariate-linear-regression)
- [Definition](#definition-for-multivariate-linear-regression)
- [Formula](#formula-for-multivariate-linear-regression)
- [Cost Function](#cost-function-for-multivariate-linear-regression)
- [Gradeint Descent](#gradient-descent-for-multivariate-linear-regression)
- [Feature Scaling and Mean Normalization](#feature-scaling-and-mean-normalization)
- [Bias - Variance](#bias---variance)
- [High Bias](#high-bias)
- [Just Right](#just-right)
- [High Variance](#high-variance)
- [Resolving High Variance](#resolving-high-variance)
- [Cost Function](#cost-function-for-regularization)
- [Gradeint Descent](#gradient-descent-for-regularization)## Description
A Mathematical intuition and quick guide and understanding of how Linear Regression Algorithms works. Given links to other study materials in order to understand the concepts more concretly.## Notations
- `m` 👉 Number of Training Examples.
- `x` 👉 "input" variable / features.
- `y` 👉 "ouput" variable / "target" variable.
- `n` 👉 Number of feature variable `(x)`
- `(x, y)` 👉 One training example.
- `x`i , `y`i 👉 ith training example.
- `x`ij 👉 ith training example of the jth column / feature.-----
## Definition
A linear equation that models a function such that if we give any `x` to it, it will predict a value `y` , where both `x and y` are input and output varaibles respectively. These are numerical and continous values.It is the most simple and well known algorithm used in machine learning.
## Flowchart
The above Flowchart represents that we choose our training set, feed it to an algorithm, it will learn the patterns and will output a function called `Hypothesis function 'H(x)'`. We then give any `x` value to that function and it will output an estimated `y` value for it.
For historical reasons, this function `H(x)` is called `hypothesis function.`
-----
## Univariate Linear Regression
### Definition for Univariate Linear Regression
When you have one feature / variable `x` as an input to the function to predict `y`, we call this `Univariate Linear Regression` problem.### Formula for Univariate Linear Regression
H(x) = θ0 + θ1x
Other way of representing this formula as what we are familiar with:
H(x) = b + mx
> Where :
>- b = θ0 👉 y intercept
>- m = θ1 👉 slope
>- x = x 👉 feature / input variable
> **Help** ✍🏼
> - Intuition behind linear equation.
> - Need to Practice?### Cost Function for Univariate Linear Regression
All that said, how do we figure out the best possible straight line to the data that we feed?**This is where `Cost Function` will help us:**
The best fit line to our data will be where we have least distance between the `predicted 'y' value` and `trained 'y' value`.
#### Formula for Cost Function
> Where :
>- h(xi) 👉 hypothesis function
>- yi 👉 actual values of `y`
>- 1/m 👉 gives Mean of Squared Errors
>- 1/2 👉 Mean is halved as a convenience for the computation of the `Gradient Descent`.The above formula takes the sum of the distances between `predicted values` and `actual values` of training set, sqaure it, take the average and multiply it by `1/2`.
This cost function is also called as `Squared Error Function` or `Mean Squared Error`.
🙋 Why do we take squares of the error's?
The `MSE` function is commonly used and is a reasonable choice and works well for most Regression problems.
Let's subsititute `MSE` function to function `J` :
> **Help** ✍🏼
> - Intuition behind Cost Function.### Gradient Descent for Univariate Linear Regression
So now we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That's where `Gradient Descent` comes in.
`Gradient Descent` is used to minimize the cost function `J`, minimizing `J` is same as minimizing `MSE` to get best possible fit line to our data.#### Formula for Gradient Descent
> Where :
>- `:=` 👉 Is the Assignment Operator
>- `α` 👉 is `Alpha`, it's the number which is called learning rate. If its too high it may fail to converge and if too low then descending will be slow.
>- 'θj' 👉 Taking Gradient Descent of a feature or a column of a dataset.
> - ∂/(∂θj) J(θ0,θ1) 👉 Taking partial derivative of `MSE` cost function.
> **Additional Resources** ✍🏼
> - Intuition behind Gradient Descent.
> - Partial Derivative.
**Now Let's apply Gradient Descend to minmize our `MSE` function.**
In order to apply `Gradient Descent`, we need to figure out the partial derivative term.
So let's solve partial derivative of cost function `J`.
Now let's plug these 2 values to our `Gradient Descent`:
> **Note :** 🚩
> - Cost Function for Linear Regression is always going to be Convex or Bowl Shaped Function, so this function doesn't have any local minimum but one global minimum, thus always converging to global minimum.
> - The above hypothesis function has 2 parameters, θ0 & θ1, so Gradient Descent will run on each feature, hence here two times, one for feature and one for base `(y-intercept)`, to get minimum value of `j`. So if we have `n` features, Gradient Descent will run on all `n+1` features.-----
## Multivariate Linear Regression
> **Linear Algebra** ✍🏼
> - Intro to Vectors & Scalars.
> - Combined Vector Operations.
> - Intro to Matrices.
> - Representing linear systems with matrices.
> - Add & subtract matrices.
> - Multipling matrices.
> - Identity Matrix.
> - Properties of Matrix Multiplication.
> - Matrix Inverses.
> - Matrix Transpose.### Definition for Multivariate Linear Regression
Its same as `Univariate Linear Regression`, except it has more than one feature variable `(x)` to predict target variable `(y)`.### Formula for Multivariate Linear Regression
Our hypothesis function for `n` = 4 :
> Where :
>- θ0 👉 y intercept
> - And rest are features `x` to help predict `y` value.
>> **Intuition:**
>>In order to develop an intuition about this function, let's imagine that this function represents price of a house `(y)` based on the features given `(x)`, then we can think of this function as:
>> - θ0 as the basic price of a house.
>> - θ1 as price/m2.
>> - x1 as area of a house (m2).
>> - θ2 as price/floor.
>> - x2 as number of floors.
>> - etc _(You get the idea)_
Let's set all the parameters:
θ0, θ1, θ2, θ3.......θn = θ
And Let's set all the features:x0, x1, x2, x3.......xn = x
> Where :
>- θ 👉 will be `n+1` dimensional vector because we have θ0 which is not a feature.
> - x0 👉 is added just for convenience so that we can take matrix multiplication of `θ` as θT and `x` and we will set x0 value to 1, so this doesn't change the values.
> - x 👉 will also be `n+1` dimensional vector.
### Cost Function for Multivariate Linear Regression
So,
J(θ0, θ1, θ2, θ3.......θn) = J(θ)
Where,
### Gradient Descent for Multivariate Linear Regression
**Appling Gradient Descend to minmize our `MSE` function after solving partial derivative of J(θ), we get :**
**For Example : If we had 2 features, this is how gradient descent would run on each parameter:**
------
## Feature Scaling and Mean Normalization
We can speed up `Gradient Descent` by having each of our input values in roughly the same range. This is because `θ` will descend quickly on small ranges and slowly on large ranges. If we have large ranges, it will oscilate inefficiently down to optimum (minimum) when the variables are very uneven.
The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally between :
-1 ≤ xi ≤ 1
OR
-0.5 ≤ xi ≤ 0.5
These aren't exact requirements, we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges.
This can make `Gradient Descent` run much faster and converge in a lot few iterations.
Two techniques to help with this are `Feature Scaling` and `Mean Normalization`.- `Feature Scaling` involves diving the input values by the range (i.e max value - min value) of the input variable, resulting in a new values.
- `Mean Normalization` involves subtracting the average of a feature variable from the values of the feature, dividing by _range_ of values or by _standard deviation_, resulting in a new values.
> Where:
> - μj 👉 Average of a Feature variable `j`.
> - si 👉 Either `Range` or `Standard Deviation` of a Feature `j`.-----
## Bias - Variance
### High Bias
Consider a problem of predicting `y`. The figure below shows the result of fitting a `hypothesis function` θ0 + θ1x to a dataset.
We see that the data doesn't really lie on a straight line and so the fit is not very good.
This figure is an instance of `Underfitting`, in which the data clearly shows structure not captured by the model `h(x)`.
`Underfitting` or `high bias` is when the form of our `h(x)` function maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features.-----
### Just Right
If we add an extra feature x2 and fit `hypothesis function` θ0 + θ1x + θ2x2, then we obtain a slightly better fit to the data.
------
### High Variance
If we add more features, it would intuitively seem that it could perform much better, however there's also a danger in adding too amny features. The result of fitting 4th order polynomial where our `h(x)` is θ0 + θ1x + θ2x2 + θ3x3 + θ4x4
We see that even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor of for example housing prices `y` for different areas `x`.
`Overfitting` or `High Variance`, is caused by `h(x)` function that fits the available data but does not generalize (unable to accurately predict) well to predict new data. It is usually caused by giving too many features or having a complicated `h(x)` function that creates lots of unnecessory curves and angles unrelated to the data.
## Resolving High Variance
There are 2 main options to address the issue of Overfitting:
- **Regularization**
- Keep all features, but reduce the magnitude of parameters θj.
- It works well when we have a lot of slightly useful features.
- **Reduce number of features**
- Manually select which features to keep.
- Use `model selection` algorithm.
### Cost Function for Regularization
If we have overfitting from our `hypothesis function`, for example like in figure 3, we can reduce the weight that some of the terms in our function carry by increasing their cost.
For example, let's say we wanted to make the following function more quadratic:
θ0 + θ1x + θ2x2 + θ3x3 + θ4x4
We'll want to eliminate the influence of θ3x3 and θ4x4, without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function.
We've added two extra terms at the end to inflate the cost of θ3 and θ4. Now, in order for the cost function to get close to `0`, we will have to reduce the values θ3 and θ4 to near `0`. This will in turn greatly reduce the values of θ3x3 and θ4x4 in our `hypothesis function`.
As a result we see that the new `H(x)` looks like a quadratic function but fits the data better compared to _figure 3_ due to extra small terms θ3x3 and θ4x4.
We can also regularize all of our parameters in a single summation as:
The `λ`, or `lambda`, is regularization parameter. It determines how much costs of our `θ` parameters are inflated.
Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If `lambda` is choosen to be too large, it may smooth our function too much and causing `High bias` or `Underfitting`.> **Note:**
> We penalize the parameters from θ1 .... θn , but we don't penalize θ0. We treat this differently.------
### Gradient Descent for Regularization
We will modify our gradient descent function to separate out θ0 from rest of the parameters because we do not want to penalize θ0.
The term λ/m times θj performs our regularization.
With some manipulation our updated Gradient Descent for θj can also be represented as:
The first term in the equation will always be less than 1. Intuitively you can see it as reducing the value θj by some amount on every update. And the second term is exactly the same as it was in `Gradient Descent` without applying regularization.
-----