Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/s1dewalker/model_validation
Model Management in Python. Steps involved in Model Validation and tuning. Testing Model Assumptions in Factor Analysis with OLS Regression.
https://github.com/s1dewalker/model_validation
assumptions bias-variance cross-validation hyperparameter-tuning linear-regression-models model-validation ols-regression python regression regression-models tuning
Last synced: about 5 hours ago
JSON representation
Model Management in Python. Steps involved in Model Validation and tuning. Testing Model Assumptions in Factor Analysis with OLS Regression.
- Host: GitHub
- URL: https://github.com/s1dewalker/model_validation
- Owner: s1dewalker
- Created: 2024-12-15T03:31:35.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2025-02-06T03:59:37.000Z (about 6 hours ago)
- Last Synced: 2025-02-06T04:41:43.741Z (about 5 hours ago)
- Topics: assumptions, bias-variance, cross-validation, hyperparameter-tuning, linear-regression-models, model-validation, ols-regression, python, regression, regression-models, tuning
- Language: Jupyter Notebook
- Homepage:
- Size: 6.11 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Example 1: Model validation of Assumptions of OLS regression in Fama French 3-Factor Model
### 1. Checking Multicollinearity of features or independent variables w/ Correlation matrix
### 2. Checking Linearity w/ Scatter plots
### 3. Checking Idependence of residuals w/ Autocorrelation Function (ACF) and D-W test
### 4. Checking Normality of residuals w/ histogram
### 5. Checking Homoscedasticity (equal variance) of Residuals w/ scatter plot of residuals and fitted values
Consequences:
### 1. Multicollinearity = Redundancy = It will be difficult for the model to find which feature is actually contributing to predict the target
### 2. Non-linearity = Model won't capture the relationship closely, leading to large errors in fitting
### 3. Autocorrelation in residuals = Missing something important. Check for some important feature
### 4. Non-Normality of residuals = Assumption of tests of having a normal distribution on residuals won't hold. Apply transformations on features.
### 5. No Homoscedasticity of residuals = less precision in estimates### [Check out Model Validation for OLS Regression in Factor analysis in Python](https://github.com/s1dewalker/Model_Validation/blob/main/Multi_Factor_Analysis3.ipynb)
## Example 2: Model validation and tuning in Random Forest Regression, on a continuos data
### 1. Get the data
### 2. Define the target (y) and features (X)
### 3. Split the data into training and testing set (validation if required)
### 4. Initiate a model, set parameters, and Fit the training set | `X_train, y_train`
### 5. Predict on `X_test`
### 6. Accuracy or Error metrics on `y_test` | Ex: R squared
### 7. Bias-Variance trade-off check | Balancing underfitting and overfitting
### 8. Iterate to tune the model (from step 4)
### 9. Cross Validation | if model not generalizing well
### 10. Selecting the best model w/ Hyperparameter tuning### [Check out Model Validation and Tuning for RFR in Python](https://github.com/s1dewalker/Model_Validation/blob/main/Model_Validation.ipynb)
Few Details:
## Bias-Variance trade-off
**Bias = failing to find relationship b/w data and response** = ERROR due to OVERLY SIMPLISTIC models (underfitting)
**Variance = following training data too closely** = ERROR due to OVERLY COMPLEX models (overfitting) that are SENSITIVE TO FLUCTUATIONS (noise) in the training data
High Bias + Low Variance: Underfitting (simpler models)
Low Bias + High Variance: Overfitting (complex models)
### **Training error high = Underfitting**
### **Testing error >> Training error = Overfitting**
## Cross Validation
###### by sharpsightlabs.com
### Splitting data into distinct subsets. Each subset used once as a test set while the remaining as training set. Results from all splits are averaged.
Why use?- Better Generalization: If our models are not generalizing well (Generalization refers to a model's ability to perform well on new, unseen data, not just the data it was trained on)
- Reliable Evaluation
- Efficient use of data (if we have limited data)
Types:
1. **cross_val_score**
2. **Leave-one-out-cross-validation (LOOCV)**Use when data is limited, but computationally expensive
**Each data point is used as a test set**`cv = X.shape[0]`
##### [LinkedIn](https://www.linkedin.com/in/sujay-bhaumik-d12/) | [email protected] | [Research Works](https://github.com/s1dewalker/Research-Works)