https://github.com/chris-santiago/steps

A SciKit-Learn style feature selector using best subsets and stepwise regression.
https://github.com/chris-santiago/steps
best-subset-selection data-science python scikit-learn stepwise-selection
Last synced: about 1 year ago
JSON representation
A SciKit-Learn style feature selector using best subsets and stepwise regression.
Host: GitHub
URL: https://github.com/chris-santiago/steps
Owner: chris-santiago
License: mit
Created: 2021-07-31T18:10:06.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2024-03-27T13:50:15.000Z (over 2 years ago)
Last Synced: 2025-06-27T19:13:25.783Z (about 1 year ago)
Topics: best-subset-selection, data-science, python, scikit-learn, stepwise-selection
Language: Jupyter Notebook
Homepage: https://chris-santiago.github.io/steps/
Size: 781 KB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project

README

          # step-select

[![image](https://img.shields.io/badge/python-3.7--3.10-blue.svg)](https://www.python.org)

[![Build Status](https://app.travis-ci.com/chris-santiago/steps.svg?branch=master)](https://app.travis-ci.com/chris-santiago/steps)

[![codecov](https://codecov.io/gh/chris-santiago/steps/branch/master/graph/badge.svg?token=RIB2YFGWFX)](https://codecov.io/gh/chris-santiago/steps)

A SciKit-Learn style feature selector using best subsets and stepwise regression.

## Install

Create a virtual environment with Python 3.8 and install from PyPi:

```bash

pip install step-select

```

## Use

### Preliminaries

*Note: this example requires two additional packages*: `pandas` and `statsmodels`.

In this example we'll show how the `ForwardSelector` and `SubsetSelector` classes can be used on their own or in conjuction with a Scikit-Learn `Pipeline` object.

```python

import pandas as pd

from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression

import statsmodels.datasets

from statsmodels.api import OLS

from statsmodels.tools import add_constant

from steps.forward import ForwardSelector

from steps.subset import SubsetSelector

```

We'll download the `auto` dataset via `Statsmodels`; we'll use `mpg` as the endogenous variable and the remaining variables as exongenous.  We won't use `make`, as that will create several dummies and increase the number of paramters to 12+, which is too many for the `SubsetSelector` class; we'll also drop `price`.

```python

data = statsmodels.datasets.webuse('auto')

data['foreign'] = pd.Series([x == 'Foreign' for x in data['foreign']]).astype(int)

data.fillna(0, inplace=True)

data.head()

```



  

    

      

      make

      price

      mpg

      rep78

      headroom

      trunk

      weight

      length

      turn

      displacement

      gear_ratio

      foreign

    

  

  

    

      0

      AMC Concord

      4099

      22

      3.0

      2.5

      11

      2930

      186

      40

      121

      3.58

      0

    

    

      1

      AMC Pacer

      4749

      17

      3.0

      3.0

      11

      3350

      173

      40

      258

      2.53

      0

    

    

      2

      AMC Spirit

      3799

      22

      0.0

      3.0

      12

      2640

      168

      35

      121

      3.08

      0

    

    

      3

      Buick Century

      4816

      20

      3.0

      4.5

      16

      3250

      196

      40

      196

      2.93

      0

    

    

      4

      Buick Electra

      7827

      15

      4.0

      4.0

      20

      4080

      222

      43

      350

      2.41

      0

    

  



```python

X = data.iloc[:, 3:]

y = data['mpg']

```

### Forward Stepwise Selection

The `ForwardSelector` follows the standard stepwise regression algorithm: begin with a null model, iteratively test each variable and select the one that gives the most statistically significant improvement of the fit, and repeat. This greedy algorithm continues until the fit no longer improves.

The `ForwardSelector` is instantiated with two parameters: `normalize` and `metric`. `Normalize` defaults to `False`, assuming that this class is part of a larger pipeline; `metric` defaults to AIC.

|Parameter|Type|Description|

|---------|----|-----------|

|normalize|bool|Whether to normalize features; default `False`|

|metric|str|Optimization metric to use; must be one of `aic` or `bic`; default `aic`|

The `ForwardSelector` class follows the Scikit-Learn API.  After fitting the selector using the `.fit()` method, the selected features can be accessed using the boolean mask under the `.best_support_` attribute.

```python

selector = ForwardSelector(normalize=True, metric='aic')

selector.fit(X, y)

```

    ForwardSelector(normalize=True)

```python

X.loc[:, selector.best_support_]

```



  

    

      

      rep78

      weight

      length

      gear_ratio

      foreign

    

  

  

    

      0

      3.0

      2930

      186

      3.58

      0

    

    

      1

      3.0

      3350

      173

      2.53

      0

    

    

      2

      0.0

      2640

      168

      3.08

      0

    

    

      3

      3.0

      3250

      196

      2.93

      0

    

    

      4

      4.0

      4080

      222

      2.41

      0

    

    

      ...

      ...

      ...

      ...

      ...

      ...

    

    

      69

      4.0

      2160

      172

      3.74

      1

    

    

      70

      5.0

      2040

      155

      3.78

      1

    

    

      71

      4.0

      1930

      155

      3.78

      1

    

    

      72

      4.0

      1990

      156

      3.78

      1

    

    

      73

      5.0

      3170

      193

      2.98

      1

    

  

74 rows × 5 columns



### Best Subset Selection

The `SubsetSelector` follows a very simple algorithm: compare all possible models with $k$ predictors, and select the model that minimizes our selection criteria. This algorithm is only appropriate for $k<=12$ features, as it becomes computationally expensive: there are $\frac{k!}{(p-k)!}$possible models, where $p$ is the total number of paramters and $k$ is the number of features included in the model.

The `SubsetSelector` is instantiated with two parameters: `normalize` and `metric`. `Normalize` defaults to `False`, assuming that this class is part of a larger pipeline; `metric` defaults to AIC.

|Parameter|Type|Description|

|---------|----|-----------|

|normalize|bool|Whether to normalize features; default `False`|

|metric|str|Optimization metric to use; must be one of `aic` or `bic`; default `aic`|

The `SubsetSelector` class follows the Scikit-Learn API.  After fitting the selector using the `.fit()` method, the selected features can be accessed using the boolean mask under the `.best_support_` attribute.

```python

selector = SubsetSelector(normalize=True, metric='aic')

selector.fit(X, y)

```

    SubsetSelector(normalize=True)

```python

X.loc[:, selector.get_support()]

```



  

    

      

      rep78

      weight

      length

      gear_ratio

      foreign

    

  

  

    

      0

      3.0

      2930

      186

      3.58

      0

    

    

      1

      3.0

      3350

      173

      2.53

      0

    

    

      2

      0.0

      2640

      168

      3.08

      0

    

    

      3

      3.0

      3250

      196

      2.93

      0

    

    

      4

      4.0

      4080

      222

      2.41

      0

    

    

      ...

      ...

      ...

      ...

      ...

      ...

    

    

      69

      4.0

      2160

      172

      3.74

      1

    

    

      70

      5.0

      2040

      155

      3.78

      1

    

    

      71

      4.0

      1930

      155

      3.78

      1

    

    

      72

      4.0

      1990

      156

      3.78

      1

    

    

      73

      5.0

      3170

      193

      2.98

      1

    

  

74 rows × 5 columns



### Comparing the full model 

Using the `SubsetSelector` selected features yields a model with 4 fewer parameters and slightly improved AIC and BIC metrics. The summaries indicate possible multicollinearity in both models, likely caused by `weight`, `length`, `displacement` and other features that are all related to the weight of a vehicle. 

*Note: Selection using BIC as the optimization metric yields a model where `weight` is the only selected feature. Bayesian information criteria penalizes additional parameters more then AIC.*

```python

mod = OLS(endog=y, exog=add_constant(X)).fit()

mod.summary()

```

OLS Regression Results

  Dep. Variable:           mpg         R-squared:             0.720

  Model:                   OLS         Adj. R-squared:        0.681

  Method:             Least Squares    F-statistic:           18.33

  Date:             Sat, 07 Aug 2021   Prob (F-statistic): 1.29e-14

  Time:                 15:37:36       Log-Likelihood:      -187.23

  No. Observations:          74        AIC:                   394.5

  Df Residuals:              64        BIC:                   417.5

  Df Model:                   9                                    

  Covariance Type:      nonrobust                                  

                  coef     std err      t      P>|t|  [0.025    0.975]  

  const           39.0871     9.100     4.295  0.000    20.907    57.267

  rep78            1.0021     0.357     2.809  0.007     0.290     1.715

  headroom        -0.0167     0.611    -0.027  0.978    -1.237     1.204

  trunk           -0.0772     0.154    -0.503  0.617    -0.384     0.230

  weight          -0.0037     0.002    -1.928  0.058    -0.008     0.000

  length          -0.0752     0.061    -1.229  0.223    -0.197     0.047

  turn            -0.1762     0.187    -0.941  0.350    -0.550     0.198

  displacement     0.0131     0.011     1.180  0.243    -0.009     0.035

  gear_ratio       3.7067     1.751     2.116  0.038     0.208     7.206

  foreign         -4.4633     1.385    -3.222  0.002    -7.230    -1.696

  Omnibus:       28.364   Durbin-Watson:         2.523

  Prob(Omnibus):  0.000   Jarque-Bera (JB):     52.945

  Skew:           1.389   Prob(JB):           3.18e-12

  Kurtosis:       6.074   Cond. No.           7.55e+04



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.55e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

```python

mod = OLS(endog=y, exog=add_constant(X.loc[:, selector.best_support_])).fit()

mod.summary()

```

OLS Regression Results

  Dep. Variable:           mpg         R-squared:             0.710

  Model:                   OLS         Adj. R-squared:        0.688

  Method:             Least Squares    F-statistic:           33.25

  Date:             Sat, 07 Aug 2021   Prob (F-statistic): 5.22e-17

  Time:                 15:37:40       Log-Likelihood:      -188.63

  No. Observations:          74        AIC:                   389.3

  Df Residuals:              68        BIC:                   403.1

  Df Model:                   5                                    

  Covariance Type:      nonrobust                                  

                coef     std err      t      P>|t|  [0.025    0.975]  

  const         40.3703     7.860     5.136  0.000    24.687    56.054

  rep78          0.9040     0.342     2.647  0.010     0.223     1.586

  weight        -0.0030     0.002    -1.770  0.081    -0.006     0.000

  length        -0.1058     0.053    -1.990  0.051    -0.212     0.000

  gear_ratio     2.6905     1.511     1.780  0.079    -0.325     5.706

  foreign       -4.0123     1.320    -3.040  0.003    -6.646    -1.379

  Omnibus:       24.257   Durbin-Watson:         2.442

  Prob(Omnibus):  0.000   Jarque-Bera (JB):     39.774

  Skew:           1.252   Prob(JB):           2.31e-09

  Kurtosis:       5.576   Cond. No.           6.59e+04



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.59e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

### Use in Scikit-Learn Pipeline

Both `ForwardSelector` and `SubsetSelector` objects are compatible with Scikit-Learn `Pipeline` objects, and can be used as feature selection steps:

```python

pl = Pipeline([

    ('feature_selection', SubsetSelector(normalize=True)),

    ('regression', LinearRegression())

])

pl.fit(X, y)

```

    Pipeline(steps=[('feature_selection', SubsetSelector(normalize=True)),

                    ('regression', LinearRegression())])

```python

pl.score(X, y)

```

    0.7097132531085899
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chris-santiago/steps

Awesome Lists containing this project

README