Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/transferwise/shap-select

A library for feature selection for gradient boosting models using regression on feature Shapley values
https://github.com/transferwise/shap-select

Last synced: about 1 month ago
JSON representation

A library for feature selection for gradient boosting models using regression on feature Shapley values

Awesome Lists containing this project

README

        

## Overview
`shap-select` implements a heuristic for fast feature selection, for tabular regression and classification models.

The basic idea is running a linear or logistic regression of the target on the Shapley values of
the original features, on the validation set,
discarding the features with negative coefficients, and ranking/filtering the rest according to their
statistical significance. For motivation and details, refer to our [research paper](https://arxiv.org/abs/2410.06815) see the [example notebook](https://github.com/transferwise/shap-select/blob/main/docs/Quick%20feature%20selection%20through%20regression%20on%20Shapley%20values.ipynb)

Earlier packages using Shapley values for feature selection exist, the advantages of this one are
* Regression on the **validation set** to combat overfitting
* Only a single fit of the original model needed
* A single intuitive hyperparameter for feature selection: statistical significance
* Bonferroni correction for multiclass classification
* Address collinearity of (Shapley value) features by repeated (linear/logistic) regression

## Usage
```python
from shap_select import shap_select
# Here model is any model supported by the shap library, fitted on a different (train) dataset
# Task can be regression, binary, or multiclass
selected_features_df = shap_select(model, X_val, y_val, task="multiclass", threshold=0.05)
```



 
feature name
t-value
stat.significance
coefficient
selected




0
x5
20.211299
0.000000
1.052030
1


1
x4
18.315144
0.000000
0.952416
1


2
x3
6.835690
0.000000
1.098154
1


3
x2
6.457140
0.000000
1.044842
1


4
x1
5.530556
0.000000
0.917242
1


5
x6
2.390868
0.016827
1.497983
1


6
x7
0.901098
0.367558
2.865508
0


7
x8
0.563214
0.573302
1.933632
0


8
x9
-1.607814
0.107908
-4.537098
-1

## Citation

If you use `shap-select` in your research, please cite our paper:

```bibtex
@misc{kraev2024shapselectlightweightfeatureselection,
title={Shap-Select: Lightweight Feature Selection Using SHAP Values and Regression},
author={Egor Kraev and Baran Koseoglu and Luca Traverso and Mohammed Topiwalla},
year={2024},
eprint={2410.06815},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.06815},
}