https://github.com/s1dewalker/model_validation

Model Management in Python. Steps involved in Model Validation and tuning. Testing Model Assumptions in Factor Analysis with OLS Regression.
https://github.com/s1dewalker/model_validation

assumptions bias-variance cross-validation hyperparameter-tuning linear-regression-models model-validation ols-regression python regression regression-models tuning

Last synced: 20 days ago
JSON representation

Model Management in Python. Steps involved in Model Validation and tuning. Testing Model Assumptions in Factor Analysis with OLS Regression.

Host: GitHub
URL: https://github.com/s1dewalker/model_validation
Owner: s1dewalker
Created: 2024-12-15T03:31:35.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-02-07T04:19:20.000Z (5 months ago)
Last Synced: 2025-03-30T23:14:30.797Z (3 months ago)
Topics: assumptions, bias-variance, cross-validation, hyperparameter-tuning, linear-regression-models, model-validation, ols-regression, python, regression, regression-models, tuning
Language: Jupyter Notebook
Homepage:
Size: 6.13 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Description

## Example 1: Model validation of Assumptions of Linear regression in Fama French 3-Factor Model

### 1. Checking Multicollinearity of features or independent variables w/ Correlation matrix
### 2. Checking Linearity w/ Scatter plots
### 3. Checking Independence of residuals w/ Autocorrelation Function (ACF) and D-W test
### 4. Checking Normality of residuals w/ histogram
### 5. Checking Homoscedasticity (equal variance) of Residuals w/ scatter plot of residuals and fitted values

Description

Consequences:

### 1. Multicollinearity = Redundancy = It will be difficult for the model to find which feature is actually contributing to predict the target
### 2. Non-linearity = Model won't capture the relationship closely, leading to large errors in fitting
### 3. Autocorrelation in residuals = Missing something important. Check for some important feature
### 4. Non-Normality of residuals = Assumption of tests of having a normal distribution on residuals won't hold. Apply transformations on features.
### 5. No Homoscedasticity of residuals = less precision in estimates

### [Check out Model Validation for Linear Regression in Factor analysis in Python](https://github.com/s1dewalker/Model_Validation/blob/main/Multi_Factor_Analysis3.ipynb)

## Example 2: Model validation and tuning in Random Forest Regression, on a continuos data

### 1. Get the data
### 2. Define the target (y) and features (X)
### 3. Split the data into training and testing set (validation if required)
### 4. Initiate a model, set parameters, and Fit the training set | `X_train, y_train`
### 5. Predict on `X_test`
### 6. Accuracy or Error metrics on `y_test` | Ex: R squared
### 7. Bias-Variance trade-off check | Balancing underfitting and overfitting
### 8. Iterate to tune the model (from step 4)
### 9. Cross Validation | if model not generalizing well
### 10. Selecting the best model w/ Hyperparameter tuning

### [Check out Model Validation and Tuning for RFR in Python](https://github.com/s1dewalker/Model_Validation/blob/main/Model_Validation.ipynb)

Few Details:

## Bias-Variance trade-off

Description

**Bias = failing to find relationship b/w data and response** = ERROR due to OVERLY SIMPLISTIC models (underfitting)

**Variance = following training data too closely** = ERROR due to OVERLY COMPLEX models (overfitting) that are SENSITIVE TO FLUCTUATIONS (noise) in the training data

High Bias + Low Variance: Underfitting (simpler models)

Low Bias + High Variance: Overfitting (complex models)

### **Training error high = Underfitting**
### **Testing error >> Training error = Overfitting**

## Cross Validation - An efficient method to find the balance
Description

###### by sharpsightlabs.com

### Splitting data into distinct subsets. Each subset used once as a test set while the remaining as training set. Results from all splits are averaged.

Why use?

- Better Generalization: If our models are not generalizing well (Generalization refers to a model's ability to perform well on new, unseen data, not just the data it was trained on)
- Reliable Evaluation
- Efficient use of data (if we have limited data)

Types:

1. **cross_val_score**
Description

2. **Leave-one-out-cross-validation (LOOCV)**

Use when data is limited, but computationally expensive

**Each data point is used as a test set**

`cv = X.shape[0]`

##### [LinkedIn](https://www.linkedin.com/in/sujay-bhaumik-d12/) | [email protected] | [Research Works](https://github.com/s1dewalker/Research-Works)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/s1dewalker/model_validation

Awesome Lists containing this project

README