Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ulf1/maxjoshua

Feature selection for hard voting classifier and NN sparse weight initialization.
https://github.com/ulf1/maxjoshua

binary-classification bootstrapping ensemble-learning feature-selection hard-voting-classifier neural-network-initialization python sparse-neural-network

Last synced: 23 days ago
JSON representation

Feature selection for hard voting classifier and NN sparse weight initialization.

Host: GitHub
URL: https://github.com/ulf1/maxjoshua
Owner: ulf1
License: apache-2.0
Created: 2019-06-29T12:37:47.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2023-08-12T19:48:45.000Z (over 1 year ago)
Last Synced: 2024-10-07T16:03:58.707Z (about 1 month ago)
Topics: binary-classification, bootstrapping, ensemble-learning, feature-selection, hard-voting-classifier, neural-network-initialization, python, sparse-neural-network
Language: Python
Homepage:
Size: 821 KB
Stars: 2
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

        [![PyPI version](https://badge.fury.io/py/maxjoshua.svg)](https://badge.fury.io/py/maxjoshua)

# maxjoshua

Feature selection for hard voting classifier and NN sparse weight initialization.

## Preface

I am naming this software package in memory of my late nephew Max Joshua Hamster (* 2005 to † June 18, 2022).

## Usage

### Forward Selection for Hard Voting Classifier

Load toy data set and convert features to binary.

```py

from sklearn.datasets import load_breast_cancer

from sklearn.preprocessing import scale

X = scale(load_breast_cancer().data, axis=0) > 0  # convert to binary features

y = load_breast_cancer().target

```

Select binary features. Each row in the `results` list contains the `n_select` column indices of `X`, the notice if the binary features were negated, and the sum of absolute MCC correlation coeffcients between the selected features.

```py

import maxjoshua as mh

idx, neg, rho, results = mh.binsel(

    X, y, preselect=0.8, oob_score=True, subsample=0.5, 

    n_select=5, unique=True, n_draws=100, random_state=42)

```

**Algorithm**. 

The task is to select e.g. `n_select` features from a pool of many features.

These features might be the prediction of binary classifiers. 

The selected features are then combined into one hard-voting classifier.

A voting classifier should have the following properties

* each voter (a binary feature) should be highly correlated to the target variable

* the selected features should be uncorrelated.

The algorithm works as follows 

1. Generate multiple correlation matrices by bootstrapping. This includes `corr(X_i, X_j)` as well as `corr(Y, X_i)` computation. Also store the oob samples for evaluation.

2. For each correlation matrix do ...

    a. Preselect the `i*` with the highest `abs(corr(Y, X_i))` estimates (e.g. pick the `n_pre=?` highest absolute correlations)

    b. Slice a correlation matrix `corr(X_i*, X_j*)` and find the least correlated combination of `n_select` features. (see [`korr.mincorr`](https://github.com/kmedian/korr/blob/master/korr/mincorr.py))

    c. Compute the out-of-bag (OOB) performance (see step 1) of the hard-voter with the selected `n_select=?` features

3. Select the feature combination with the best OOB performance as final model.

### Forward Selection for Linear Regression

Load toy dataset.

```py

from sklearn.preprocessing import scale

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()

X = scale(housing["data"], axis=0)

y = scale(housing["target"])

```

Select real-numbered features. Each row in the `results` list contains the `n_select` column indices of `X`, the ridge regression coefficents `beta` and the RMSE `loss`.

Warning! Please note that the features `X` and the target `y` must be scaled because `mh.fltsel` uses an L2-penalty on `beta` coefficients, and doesn't used an intercept term to shift `y`.

```py

import maxjoshua as mh

from sklearn.preprocessing import scale

idx, beta, loss, results = mh.fltsel(

    scale(X), scale(y), preselect=0.8, oob_score=True, subsample=0.5, 

    n_select=5, unique=True, n_draws=100, random_state=42, l2=0.01)

```

### Initialize Sparse NN Layer

The idea is to run `mh.fltsel` to generate an ensemble of linear models, and combine them in a sparse linear neural network layer, i.e., the number of output neurons is the ensemble size.

In case of small datasets, the sparse NN layer is non-trainable because because each submodel was already estimated and selected with two-way data splits in `mh.fltsel` (see `oob_scores` and `subsample`). 

The sparse NN layers basically produces submodel predictions for meta model in the next layer, i.e., a simple dense linear layer.

The inputs of the sparse NN layer must be normalized for which a layer normalization layers is trained.

```py

import maxjoshua as mh

import tensorflow as tf

import sklearn.preprocessing

# create toy dataset

import sklearn.datasets

X, y = sklearn.datasets.make_regression(

    n_samples=1000, n_features=100, n_informative=20, n_targets=3)

# feature selection

# - always scale the inputs and targets -

indices, values, num_in, num_out = mh.pretrain_submodels(

    sklearn.preprocessing.scale(X), 

    sklearn.preprocessing.scale(y), 

    num_out=64, n_select=3)

# specify the model

model = tf.keras.models.Sequential([

    # sub-models

    mh.SparseLayerAsEnsemble(

        num_in=num_in, 

        num_out=num_out, 

        sp_indices=indices, 

        sp_values=values,

        sp_trainable=False,

        norm_trainable=True,

    ),

    # meta model

    tf.keras.layers.Dense(

        units=3, use_bias=False,

        # kernel_constraint=tf.keras.constraints.NonNeg()

    ),

    # scale up

    mh.InverseTransformer(

        units=3,

        init_bias=y.mean(), 

        init_scale=y.std()

    )

])

model.compile(

    optimizer=tf.keras.optimizers.Adam(

        learning_rate=3e-4, beta_1=.9, beta_2=.999, epsilon=1e-7, amsgrad=True),

    loss='mean_squared_error'

)

# train

history = model.fit(X, y, epochs=3)

```

## Appendix

### Installation

The `maxjoshua` [git repo](http://github.com/ulf1/maxjoshua) is available as [PyPi package](https://pypi.org/project/maxjoshua)

```sh

pip install maxjoshua

```

### Install a virtual environment

```sh

python3.7 -m venv .venv

source .venv/bin/activate

pip install --upgrade pip

pip install -r requirements.txt

pip install -r requirements-dev.txt

pip install -r requirements-demo.txt

```

(If your git repo is stored in a folder with whitespaces, then don't use the subfolder `.venv`. Use an absolute path without whitespaces.)

### Python commands

* Jupyter for the examples: `jupyter lab`

* Check syntax: `flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')`

* Run Unit Tests: `pytest`

Publish

```sh

pandoc README.md --from markdown --to rst -s -o README.rst

python setup.py sdist 

twine upload -r pypi dist/*

```

### Clean up 

```

find . -type f -name "*.pyc" | xargs rm

find . -type d -name "__pycache__" | xargs rm -r

rm -r .venv

```

## Support

Please [open an issue](https://github.com/ulf1/maxjoshua/issues/new) for support.

## Contributing

Please contribute using [Github Flow](https://guides.github.com/introduction/flow/). Create a branch, add commits, and [open a pull request](https://github.com/ulf1/maxjoshua/compare/).