https://github.com/chris-santiago/steps
A SciKit-Learn style feature selector using best subsets and stepwise regression.
https://github.com/chris-santiago/steps
best-subset-selection data-science python scikit-learn stepwise-selection
Last synced: 11 months ago
JSON representation
A SciKit-Learn style feature selector using best subsets and stepwise regression.
- Host: GitHub
- URL: https://github.com/chris-santiago/steps
- Owner: chris-santiago
- License: mit
- Created: 2021-07-31T18:10:06.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-03-27T13:50:15.000Z (about 2 years ago)
- Last Synced: 2025-06-27T19:13:25.783Z (11 months ago)
- Topics: best-subset-selection, data-science, python, scikit-learn, stepwise-selection
- Language: Jupyter Notebook
- Homepage: https://chris-santiago.github.io/steps/
- Size: 781 KB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# step-select
[](https://www.python.org)
[](https://app.travis-ci.com/chris-santiago/steps)
[](https://codecov.io/gh/chris-santiago/steps)
A SciKit-Learn style feature selector using best subsets and stepwise regression.
## Install
Create a virtual environment with Python 3.8 and install from PyPi:
```bash
pip install step-select
```
## Use
### Preliminaries
*Note: this example requires two additional packages*: `pandas` and `statsmodels`.
In this example we'll show how the `ForwardSelector` and `SubsetSelector` classes can be used on their own or in conjuction with a Scikit-Learn `Pipeline` object.
```python
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import statsmodels.datasets
from statsmodels.api import OLS
from statsmodels.tools import add_constant
from steps.forward import ForwardSelector
from steps.subset import SubsetSelector
```
We'll download the `auto` dataset via `Statsmodels`; we'll use `mpg` as the endogenous variable and the remaining variables as exongenous. We won't use `make`, as that will create several dummies and increase the number of paramters to 12+, which is too many for the `SubsetSelector` class; we'll also drop `price`.
```python
data = statsmodels.datasets.webuse('auto')
data['foreign'] = pd.Series([x == 'Foreign' for x in data['foreign']]).astype(int)
data.fillna(0, inplace=True)
data.head()
```
make
price
mpg
rep78
headroom
trunk
weight
length
turn
displacement
gear_ratio
foreign
0
AMC Concord
4099
22
3.0
2.5
11
2930
186
40
121
3.58
0
1
AMC Pacer
4749
17
3.0
3.0
11
3350
173
40
258
2.53
0
2
AMC Spirit
3799
22
0.0
3.0
12
2640
168
35
121
3.08
0
3
Buick Century
4816
20
3.0
4.5
16
3250
196
40
196
2.93
0
4
Buick Electra
7827
15
4.0
4.0
20
4080
222
43
350
2.41
0
```python
X = data.iloc[:, 3:]
y = data['mpg']
```
### Forward Stepwise Selection
The `ForwardSelector` follows the standard stepwise regression algorithm: begin with a null model, iteratively test each variable and select the one that gives the most statistically significant improvement of the fit, and repeat. This greedy algorithm continues until the fit no longer improves.
The `ForwardSelector` is instantiated with two parameters: `normalize` and `metric`. `Normalize` defaults to `False`, assuming that this class is part of a larger pipeline; `metric` defaults to AIC.
|Parameter|Type|Description|
|---------|----|-----------|
|normalize|bool|Whether to normalize features; default `False`|
|metric|str|Optimization metric to use; must be one of `aic` or `bic`; default `aic`|
The `ForwardSelector` class follows the Scikit-Learn API. After fitting the selector using the `.fit()` method, the selected features can be accessed using the boolean mask under the `.best_support_` attribute.
```python
selector = ForwardSelector(normalize=True, metric='aic')
selector.fit(X, y)
```
ForwardSelector(normalize=True)
```python
X.loc[:, selector.best_support_]
```
rep78
weight
length
gear_ratio
foreign
0
3.0
2930
186
3.58
0
1
3.0
3350
173
2.53
0
2
0.0
2640
168
3.08
0
3
3.0
3250
196
2.93
0
4
4.0
4080
222
2.41
0
...
...
...
...
...
...
69
4.0
2160
172
3.74
1
70
5.0
2040
155
3.78
1
71
4.0
1930
155
3.78
1
72
4.0
1990
156
3.78
1
73
5.0
3170
193
2.98
1
74 rows × 5 columns
### Best Subset Selection
The `SubsetSelector` follows a very simple algorithm: compare all possible models with $k$ predictors, and select the model that minimizes our selection criteria. This algorithm is only appropriate for $k<=12$ features, as it becomes computationally expensive: there are $\frac{k!}{(p-k)!}$possible models, where $p$ is the total number of paramters and $k$ is the number of features included in the model.
The `SubsetSelector` is instantiated with two parameters: `normalize` and `metric`. `Normalize` defaults to `False`, assuming that this class is part of a larger pipeline; `metric` defaults to AIC.
|Parameter|Type|Description|
|---------|----|-----------|
|normalize|bool|Whether to normalize features; default `False`|
|metric|str|Optimization metric to use; must be one of `aic` or `bic`; default `aic`|
The `SubsetSelector` class follows the Scikit-Learn API. After fitting the selector using the `.fit()` method, the selected features can be accessed using the boolean mask under the `.best_support_` attribute.
```python
selector = SubsetSelector(normalize=True, metric='aic')
selector.fit(X, y)
```
SubsetSelector(normalize=True)
```python
X.loc[:, selector.get_support()]
```
rep78
weight
length
gear_ratio
foreign
0
3.0
2930
186
3.58
0
1
3.0
3350
173
2.53
0
2
0.0
2640
168
3.08
0
3
3.0
3250
196
2.93
0
4
4.0
4080
222
2.41
0
...
...
...
...
...
...
69
4.0
2160
172
3.74
1
70
5.0
2040
155
3.78
1
71
4.0
1930
155
3.78
1
72
4.0
1990
156
3.78
1
73
5.0
3170
193
2.98
1
74 rows × 5 columns
### Comparing the full model
Using the `SubsetSelector` selected features yields a model with 4 fewer parameters and slightly improved AIC and BIC metrics. The summaries indicate possible multicollinearity in both models, likely caused by `weight`, `length`, `displacement` and other features that are all related to the weight of a vehicle.
*Note: Selection using BIC as the optimization metric yields a model where `weight` is the only selected feature. Bayesian information criteria penalizes additional parameters more then AIC.*
```python
mod = OLS(endog=y, exog=add_constant(X)).fit()
mod.summary()
```
OLS Regression Results
Dep. Variable: mpg R-squared: 0.720
Model: OLS Adj. R-squared: 0.681
Method: Least Squares F-statistic: 18.33
Date: Sat, 07 Aug 2021 Prob (F-statistic): 1.29e-14
Time: 15:37:36 Log-Likelihood: -187.23
No. Observations: 74 AIC: 394.5
Df Residuals: 64 BIC: 417.5
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 39.0871 9.100 4.295 0.000 20.907 57.267
rep78 1.0021 0.357 2.809 0.007 0.290 1.715
headroom -0.0167 0.611 -0.027 0.978 -1.237 1.204
trunk -0.0772 0.154 -0.503 0.617 -0.384 0.230
weight -0.0037 0.002 -1.928 0.058 -0.008 0.000
length -0.0752 0.061 -1.229 0.223 -0.197 0.047
turn -0.1762 0.187 -0.941 0.350 -0.550 0.198
displacement 0.0131 0.011 1.180 0.243 -0.009 0.035
gear_ratio 3.7067 1.751 2.116 0.038 0.208 7.206
foreign -4.4633 1.385 -3.222 0.002 -7.230 -1.696
Omnibus: 28.364 Durbin-Watson: 2.523
Prob(Omnibus): 0.000 Jarque-Bera (JB): 52.945
Skew: 1.389 Prob(JB): 3.18e-12
Kurtosis: 6.074 Cond. No. 7.55e+04
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.55e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
```python
mod = OLS(endog=y, exog=add_constant(X.loc[:, selector.best_support_])).fit()
mod.summary()
```
OLS Regression Results
Dep. Variable: mpg R-squared: 0.710
Model: OLS Adj. R-squared: 0.688
Method: Least Squares F-statistic: 33.25
Date: Sat, 07 Aug 2021 Prob (F-statistic): 5.22e-17
Time: 15:37:40 Log-Likelihood: -188.63
No. Observations: 74 AIC: 389.3
Df Residuals: 68 BIC: 403.1
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 40.3703 7.860 5.136 0.000 24.687 56.054
rep78 0.9040 0.342 2.647 0.010 0.223 1.586
weight -0.0030 0.002 -1.770 0.081 -0.006 0.000
length -0.1058 0.053 -1.990 0.051 -0.212 0.000
gear_ratio 2.6905 1.511 1.780 0.079 -0.325 5.706
foreign -4.0123 1.320 -3.040 0.003 -6.646 -1.379
Omnibus: 24.257 Durbin-Watson: 2.442
Prob(Omnibus): 0.000 Jarque-Bera (JB): 39.774
Skew: 1.252 Prob(JB): 2.31e-09
Kurtosis: 5.576 Cond. No. 6.59e+04
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.59e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
### Use in Scikit-Learn Pipeline
Both `ForwardSelector` and `SubsetSelector` objects are compatible with Scikit-Learn `Pipeline` objects, and can be used as feature selection steps:
```python
pl = Pipeline([
('feature_selection', SubsetSelector(normalize=True)),
('regression', LinearRegression())
])
pl.fit(X, y)
```
Pipeline(steps=[('feature_selection', SubsetSelector(normalize=True)),
('regression', LinearRegression())])
```python
pl.score(X, y)
```
0.7097132531085899