https://github.com/markphamm/data-mining
https://github.com/markphamm/data-mining
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/markphamm/data-mining
- Owner: MarkPhamm
- Created: 2025-01-22T22:09:17.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-02-19T22:54:36.000Z (3 months ago)
- Last Synced: 2025-02-19T23:29:40.997Z (3 months ago)
- Language: Python
- Size: 1.6 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data-Mining
# Note
Replace empty cell with legit data (mean, median, mode)
Not looking for precise result, but better result
Encoding textual data: one hot encoder, label encoder## Fit vs Transform
fit: Calculating parameters, store the info within the object
transform: Calculating new data# Simple Linear Regression
Simple linear regression can be represented by the following equation:y = β₀ + β₁x + ε
Where:
- y is the dependent variable
- x is the independent variable
- β₀ is the y-intercept
- β₁ is the slope (coefficient)
- ε is the error term## Assumptions for linear regression:
1. Linearity: The relationship between the independent and dependent variables should be linear.
2. Homoscedasticity: The residuals (errors) should have constant variance at every level of the independent variable.
3. Multivariate normality: The residuals should be normally distributed.
4. No multicollinearity: Independent variables should not be too highly correlated with each other.
5. Independence of errors: The residuals should be independent of each other.## To ensure these assumptions are met, it is important to:
- Visualize relationships using scatter plots to check for linearity.
- Use residual plots to assess homoscedasticity.
- Conduct normality tests (e.g., Shapiro-Wilk test) on residuals.
- Calculate Variance Inflation Factor (VIF) to detect multicollinearity.
- Analyze residuals over time or across observations to confirm independence.# Multiple Linear Regression
Multiple linear regression can be represented by the following equation:y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where:
- y is the dependent variable
- x₁, x₂, ..., xₙ are the independent variables
- β₀ is the y-intercept
- β₁, β₂, ..., βₙ are the coefficients
- ε is the error term## Model Building Methods
1. **All-in**: Use all the predictors available in the dataset.
2. **Backward Elimination**: Start with all predictors and remove the least significant one at each step.
3. **Forward Selection**: Start with no predictors and add the most significant one at each step.
4. **Bidirectional Elimination**: Combine forward selection and backward elimination.
5. **Score Comparison**: Compare different models using a scoring metric like AIC or BIC to select the best one.### **Backward Elimination**
1. **Step 1**: Select a significant level to stay in the model (e.g. SL = 0.05)
2. **Step 2**: Fit the full model with all possible predictors
3. **Step 3**: Consider the predictor with the highest P-value. If P > SL, remove the predictor
4. **Step 4**: Fit the model without this variable and repeat step 3
5. **Step 5**: Stop when no more predictors can be removed### **Forward Selection**
1. **Step 1**: Select a significant level to stay in the model (e.g. SL = 0.05)
2. **Step 2**: Fit all simple regression models y~Xn and select the one with the lowest P-value
3. **Step 3**: Keep this variable and fit all possible models with one extra predictor to the one(s) you already have
4. **Step 4**: Consider the predictor with the lowest P-value. If P < SL, go to step 3, else go to FIN### **Bidirectional Elimination**
1. **Step 1**: Select a significant level to enter and stay in the model: e.g., SLEnter = 0.05, SLStay = 0.05
2. **Step 2**: Perform the steps of Forward Selection:
- Start with no predictors in the model.
- Add predictors one by one based on the lowest P-value, provided it is less than SLEnter.
3. **Step 3**: After adding a new predictor, perform Backward Elimination:
- Check all predictors in the model.
- Remove the predictor with the highest P-value if it is greater than SLStay.
4. **Step 4**: Repeat steps 2 and 3 until no new predictors can be added and no existing predictors can be removed.
5. **Step 5**: Finalize the model when no further changes can be made.## Rule: Dealing with Backward Elimination
1. Don't remove coefficients arbitrarily
2. Keep all categorical data in a column or remove the entire column. If P-values of all levels are insignificant, then you can remove all levels# Polynomial Regression
Polynomial regression can be represented by the following equation:y = β₀ + β₁x + β₂x² + β₃x³ + ... + βₙxⁿ + ε
Where:
- y is the dependent variable
- x is the independent variable
- β₀ is the y-intercept
- β₁, β₂, ..., βₙ are the coefficients of the polynomial
- n is the degree of the polynomial
- ε is the error term