Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/transferwise/shap-select
A library for feature selection for gradient boosting models using regression on feature Shapley values
https://github.com/transferwise/shap-select
Last synced: about 1 month ago
JSON representation
A library for feature selection for gradient boosting models using regression on feature Shapley values
- Host: GitHub
- URL: https://github.com/transferwise/shap-select
- Owner: transferwise
- License: apache-2.0
- Created: 2024-09-19T08:07:37.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-11-21T13:59:07.000Z (2 months ago)
- Last Synced: 2024-12-21T17:23:55.136Z (about 1 month ago)
- Language: Python
- Size: 159 KB
- Stars: 22
- Watchers: 26
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
## Overview
`shap-select` implements a heuristic for fast feature selection, for tabular regression and classification models.The basic idea is running a linear or logistic regression of the target on the Shapley values of
the original features, on the validation set,
discarding the features with negative coefficients, and ranking/filtering the rest according to their
statistical significance. For motivation and details, refer to our [research paper](https://arxiv.org/abs/2410.06815) see the [example notebook](https://github.com/transferwise/shap-select/blob/main/docs/Quick%20feature%20selection%20through%20regression%20on%20Shapley%20values.ipynb)Earlier packages using Shapley values for feature selection exist, the advantages of this one are
* Regression on the **validation set** to combat overfitting
* Only a single fit of the original model needed
* A single intuitive hyperparameter for feature selection: statistical significance
* Bonferroni correction for multiclass classification
* Address collinearity of (Shapley value) features by repeated (linear/logistic) regression## Usage
```python
from shap_select import shap_select
# Here model is any model supported by the shap library, fitted on a different (train) dataset
# Task can be regression, binary, or multiclass
selected_features_df = shap_select(model, X_val, y_val, task="multiclass", threshold=0.05)
```
feature name
t-value
stat.significance
coefficient
selected
0
x5
20.211299
0.000000
1.052030
1
1
x4
18.315144
0.000000
0.952416
1
2
x3
6.835690
0.000000
1.098154
1
3
x2
6.457140
0.000000
1.044842
1
4
x1
5.530556
0.000000
0.917242
1
5
x6
2.390868
0.016827
1.497983
1
6
x7
0.901098
0.367558
2.865508
0
7
x8
0.563214
0.573302
1.933632
0
8
x9
-1.607814
0.107908
-4.537098
-1
## Citation
If you use `shap-select` in your research, please cite our paper:
```bibtex
@misc{kraev2024shapselectlightweightfeatureselection,
title={Shap-Select: Lightweight Feature Selection Using SHAP Values and Regression},
author={Egor Kraev and Baran Koseoglu and Luca Traverso and Mohammed Topiwalla},
year={2024},
eprint={2410.06815},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.06815},
}