An open API service indexing awesome lists of open source software.

https://github.com/chris-santiago/steps

A SciKit-Learn style feature selector using best subsets and stepwise regression.
https://github.com/chris-santiago/steps

best-subset-selection data-science python scikit-learn stepwise-selection

Last synced: 11 months ago
JSON representation

A SciKit-Learn style feature selector using best subsets and stepwise regression.

Awesome Lists containing this project

README

          

# step-select
[![image](https://img.shields.io/badge/python-3.7--3.10-blue.svg)](https://www.python.org)
[![Build Status](https://app.travis-ci.com/chris-santiago/steps.svg?branch=master)](https://app.travis-ci.com/chris-santiago/steps)
[![codecov](https://codecov.io/gh/chris-santiago/steps/branch/master/graph/badge.svg?token=RIB2YFGWFX)](https://codecov.io/gh/chris-santiago/steps)

A SciKit-Learn style feature selector using best subsets and stepwise regression.

## Install

Create a virtual environment with Python 3.8 and install from PyPi:

```bash
pip install step-select
```

## Use

### Preliminaries

*Note: this example requires two additional packages*: `pandas` and `statsmodels`.

In this example we'll show how the `ForwardSelector` and `SubsetSelector` classes can be used on their own or in conjuction with a Scikit-Learn `Pipeline` object.

```python
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import statsmodels.datasets
from statsmodels.api import OLS
from statsmodels.tools import add_constant

from steps.forward import ForwardSelector
from steps.subset import SubsetSelector
```

We'll download the `auto` dataset via `Statsmodels`; we'll use `mpg` as the endogenous variable and the remaining variables as exongenous. We won't use `make`, as that will create several dummies and increase the number of paramters to 12+, which is too many for the `SubsetSelector` class; we'll also drop `price`.

```python
data = statsmodels.datasets.webuse('auto')
data['foreign'] = pd.Series([x == 'Foreign' for x in data['foreign']]).astype(int)
data.fillna(0, inplace=True)
data.head()
```




make
price
mpg
rep78
headroom
trunk
weight
length
turn
displacement
gear_ratio
foreign




0
AMC Concord
4099
22
3.0
2.5
11
2930
186
40
121
3.58
0


1
AMC Pacer
4749
17
3.0
3.0
11
3350
173
40
258
2.53
0


2
AMC Spirit
3799
22
0.0
3.0
12
2640
168
35
121
3.08
0


3
Buick Century
4816
20
3.0
4.5
16
3250
196
40
196
2.93
0


4
Buick Electra
7827
15
4.0
4.0
20
4080
222
43
350
2.41
0

```python
X = data.iloc[:, 3:]
y = data['mpg']
```

### Forward Stepwise Selection

The `ForwardSelector` follows the standard stepwise regression algorithm: begin with a null model, iteratively test each variable and select the one that gives the most statistically significant improvement of the fit, and repeat. This greedy algorithm continues until the fit no longer improves.

The `ForwardSelector` is instantiated with two parameters: `normalize` and `metric`. `Normalize` defaults to `False`, assuming that this class is part of a larger pipeline; `metric` defaults to AIC.

|Parameter|Type|Description|
|---------|----|-----------|
|normalize|bool|Whether to normalize features; default `False`|
|metric|str|Optimization metric to use; must be one of `aic` or `bic`; default `aic`|

The `ForwardSelector` class follows the Scikit-Learn API. After fitting the selector using the `.fit()` method, the selected features can be accessed using the boolean mask under the `.best_support_` attribute.

```python
selector = ForwardSelector(normalize=True, metric='aic')
selector.fit(X, y)
```

ForwardSelector(normalize=True)

```python
X.loc[:, selector.best_support_]
```




rep78
weight
length
gear_ratio
foreign




0
3.0
2930
186
3.58
0


1
3.0
3350
173
2.53
0


2
0.0
2640
168
3.08
0


3
3.0
3250
196
2.93
0


4
4.0
4080
222
2.41
0


...
...
...
...
...
...


69
4.0
2160
172
3.74
1


70
5.0
2040
155
3.78
1


71
4.0
1930
155
3.78
1


72
4.0
1990
156
3.78
1


73
5.0
3170
193
2.98
1

74 rows × 5 columns


### Best Subset Selection

The `SubsetSelector` follows a very simple algorithm: compare all possible models with $k$ predictors, and select the model that minimizes our selection criteria. This algorithm is only appropriate for $k<=12$ features, as it becomes computationally expensive: there are $\frac{k!}{(p-k)!}$possible models, where $p$ is the total number of paramters and $k$ is the number of features included in the model.

The `SubsetSelector` is instantiated with two parameters: `normalize` and `metric`. `Normalize` defaults to `False`, assuming that this class is part of a larger pipeline; `metric` defaults to AIC.

|Parameter|Type|Description|
|---------|----|-----------|
|normalize|bool|Whether to normalize features; default `False`|
|metric|str|Optimization metric to use; must be one of `aic` or `bic`; default `aic`|

The `SubsetSelector` class follows the Scikit-Learn API. After fitting the selector using the `.fit()` method, the selected features can be accessed using the boolean mask under the `.best_support_` attribute.

```python
selector = SubsetSelector(normalize=True, metric='aic')
selector.fit(X, y)
```

SubsetSelector(normalize=True)

```python
X.loc[:, selector.get_support()]
```




rep78
weight
length
gear_ratio
foreign




0
3.0
2930
186
3.58
0


1
3.0
3350
173
2.53
0


2
0.0
2640
168
3.08
0


3
3.0
3250
196
2.93
0


4
4.0
4080
222
2.41
0


...
...
...
...
...
...


69
4.0
2160
172
3.74
1


70
5.0
2040
155
3.78
1


71
4.0
1930
155
3.78
1


72
4.0
1990
156
3.78
1


73
5.0
3170
193
2.98
1

74 rows × 5 columns


### Comparing the full model

Using the `SubsetSelector` selected features yields a model with 4 fewer parameters and slightly improved AIC and BIC metrics. The summaries indicate possible multicollinearity in both models, likely caused by `weight`, `length`, `displacement` and other features that are all related to the weight of a vehicle.

*Note: Selection using BIC as the optimization metric yields a model where `weight` is the only selected feature. Bayesian information criteria penalizes additional parameters more then AIC.*

```python
mod = OLS(endog=y, exog=add_constant(X)).fit()
mod.summary()
```

OLS Regression Results

Dep. Variable: mpg R-squared: 0.720

Model: OLS Adj. R-squared: 0.681

Method: Least Squares F-statistic: 18.33

Date: Sat, 07 Aug 2021 Prob (F-statistic): 1.29e-14

Time: 15:37:36 Log-Likelihood: -187.23

No. Observations: 74 AIC: 394.5

Df Residuals: 64 BIC: 417.5

Df Model: 9

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

const 39.0871 9.100 4.295 0.000 20.907 57.267

rep78 1.0021 0.357 2.809 0.007 0.290 1.715

headroom -0.0167 0.611 -0.027 0.978 -1.237 1.204

trunk -0.0772 0.154 -0.503 0.617 -0.384 0.230

weight -0.0037 0.002 -1.928 0.058 -0.008 0.000

length -0.0752 0.061 -1.229 0.223 -0.197 0.047

turn -0.1762 0.187 -0.941 0.350 -0.550 0.198

displacement 0.0131 0.011 1.180 0.243 -0.009 0.035

gear_ratio 3.7067 1.751 2.116 0.038 0.208 7.206

foreign -4.4633 1.385 -3.222 0.002 -7.230 -1.696

Omnibus: 28.364 Durbin-Watson: 2.523

Prob(Omnibus): 0.000 Jarque-Bera (JB): 52.945

Skew: 1.389 Prob(JB): 3.18e-12

Kurtosis: 6.074 Cond. No. 7.55e+04



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.55e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

```python
mod = OLS(endog=y, exog=add_constant(X.loc[:, selector.best_support_])).fit()
mod.summary()
```

OLS Regression Results

Dep. Variable: mpg R-squared: 0.710

Model: OLS Adj. R-squared: 0.688

Method: Least Squares F-statistic: 33.25

Date: Sat, 07 Aug 2021 Prob (F-statistic): 5.22e-17

Time: 15:37:40 Log-Likelihood: -188.63

No. Observations: 74 AIC: 389.3

Df Residuals: 68 BIC: 403.1

Df Model: 5

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

const 40.3703 7.860 5.136 0.000 24.687 56.054

rep78 0.9040 0.342 2.647 0.010 0.223 1.586

weight -0.0030 0.002 -1.770 0.081 -0.006 0.000

length -0.1058 0.053 -1.990 0.051 -0.212 0.000

gear_ratio 2.6905 1.511 1.780 0.079 -0.325 5.706

foreign -4.0123 1.320 -3.040 0.003 -6.646 -1.379

Omnibus: 24.257 Durbin-Watson: 2.442

Prob(Omnibus): 0.000 Jarque-Bera (JB): 39.774

Skew: 1.252 Prob(JB): 2.31e-09

Kurtosis: 5.576 Cond. No. 6.59e+04



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.59e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

### Use in Scikit-Learn Pipeline

Both `ForwardSelector` and `SubsetSelector` objects are compatible with Scikit-Learn `Pipeline` objects, and can be used as feature selection steps:

```python
pl = Pipeline([
('feature_selection', SubsetSelector(normalize=True)),
('regression', LinearRegression())
])
pl.fit(X, y)
```

Pipeline(steps=[('feature_selection', SubsetSelector(normalize=True)),
('regression', LinearRegression())])

```python
pl.score(X, y)
```

0.7097132531085899