Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/transferwise/hisel
Feature selection tool based on Hilbert-Schmidt Independence Criterion
https://github.com/transferwise/hisel
Last synced: 2 months ago
JSON representation
Feature selection tool based on Hilbert-Schmidt Independence Criterion
- Host: GitHub
- URL: https://github.com/transferwise/hisel
- Owner: transferwise
- License: apache-2.0
- Created: 2023-03-28T18:07:26.000Z (almost 2 years ago)
- Default Branch: trunk
- Last Pushed: 2024-05-03T21:37:35.000Z (8 months ago)
- Last Synced: 2024-08-08T20:56:40.237Z (5 months ago)
- Language: Jupyter Notebook
- Size: 358 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
# HISEL
## Feature selection tool based on Hilbert-Schmidt Independence Criterion
Feature selection is
the machine learning
task
of selecting from a data set
the features
that are relevant
for the prediction of a given target.
The `hisel` package
provides feature selection methods
based on
Hilbert-Schmidt Independence Criterion.
In particular,
it provides an implementation of the HSIC Lasso algorithm of
[Yamada, M. et al. (2012)](https://arxiv.org/abs/1202.0515).## Why is `hisel` cool?
#### `hisel` is accurate
HSIC Lasso is an excellent algorihtm for feature selection.
This makes `hisel` an accurate tool in your machine learning modelling.
Moreover,
`hisel` implements clever routines
that address common causes of poor accuracy in other feature selection methods.Examples of where `hisel` outperforms the methods in
[sklearn.feature\_selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)
are given in the notebooks
`ensemble-example.ipynb`
and
`nonlinear-transform.ipynb`.#### `hisel` is fast
A crucial step in the HSIC Lasso algorithm
is the computation of
certain Gram matrices.
`hisel` implemets such computations
in a highly vectorised and performant way.
Moreover,
`hisel` allows you to
accelerate these computations
using a GPU.
The image below shows
the average run time
of the computations
of Gram matrices
via
`hisel` on CPU,
via
`hisel` on GPU,
and
via
[pyHSICLasso](https://pypi.org/project/pyHSICLasso/).
The performance has been measured
on the computation
of Gram matrices required
by HSIC Lasso
for the selection
from a dataset of 300 features
with as many samples as reported on the x-axis.![gramtimes](gramtimes.png)
#### `hisel` has a friendly user interface
Getting started with `hisel` is as straightforward as the following code snippet:
```
>>> import pandas as pd
>>> import hisel
>>> df = pd.read_csv('mydata.csv')
>>> xdf = df.iloc[:, :-1]
>>> yser = df.iloc[:, -1]
>>> hisel.feature_selection.select_features(xdf, yser)
['d2', 'd7', 'c3', 'c10', 'c12', 'c24', 'c22', 'c21', 'c5']
```
If you are not interested in more details,
please read no further.
If you would like to
explore more about
how to tune the hyper-parameters used by `hisel`
or
how to have more advanced control on `hisel`'s selection,
please browse the examples in
[examples/](https://github.com/transferwise/hisel/tree/trunk/examples)
and in
[notebooks](https://github.com/transferwise/hisel/tree/trunk/notebooks).## Installation
### Install via `pip`
The package [hisel](https://pypi.org/project/hisel/) is available from [PyPi](https://pypi.org/).
You can install it via `pip`:
```
pip install hisel
```If you want to install the extra support for GPU computations, you can do
```
pip install hisel[cudaXXX]
```
where `cudaXXX` is one of the following:
`cuda102` if you have version 10.2 of cuda-toolkit;
`cuda110` if you have version 11.0 of cuda-toolkit;
`cuda111` if you have version 11.1 of cuda-toolkit;
`cuda11x` if you have version 11.2 - 11.8 of cuda-toolkit;
`cuda12x` if you have version 12.x of cuda-toolkit.### Install from source
#### Basic installation:
Checkout the repo and navigate to the root directory. Then,
```
poetry install
```#### Installation with GPU support
You need to have cuda-toolkit installed and you need to know its version.
To know that, you can do
```
nvidia-smi
```
and read the cuda version from the top right corner of the table that is printed out.
Once you know your version of `cuda`, do
```
poetry install -E cudaXXX
```
where `cudaXXX` is one of the following:
`cuda102` if you have version 10.2;
`cuda110` if you have version 11.0;
`cuda111` if you have version 11.1;
`cuda11x` if you have version 11.2 - 11.8;
`cuda12x` if you have version 12.x.
This aligns to the [installation guide of CuPy](https://docs.cupy.dev/en/stable/install.html#installing-cupy).