Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/simon-larsson/extrakit-learn

scikit-learn inspired extensions for machine learning
https://github.com/simon-larsson/extrakit-learn
Last synced: about 2 months ago
JSON representation
scikit-learn inspired extensions for machine learning
Host: GitHub
URL: https://github.com/simon-larsson/extrakit-learn
Owner: simon-larsson
License: mit
Created: 2019-04-25T12:27:53.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2023-05-31T18:17:00.000Z (over 1 year ago)
Last Synced: 2024-10-13T17:22:14.393Z (3 months ago)
Language: Python
Size: 85.9 KB
Stars: 4
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # extrakit-learn

[![PyPI version](https://badge.fury.io/py/xklearn.svg)](https://pypi.python.org/pypi/xklearn/) 

[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/simon-larsson/extrakit-learn/blob/master/LICENSE)

Machine learnings components built to extend scikit-learn. All components use scikit's [object API](https://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects) to work interchangably with scikit components. It is mostly a collection of tools that have been useful for [Kaggle](https://www.kaggle.com) competitions.

## Installation

    pip install xklearn

## Components

- [CategoryEncoder](https://github.com/simon-larsson/extrakit-learn#categoryencoder) - Like scikit's LabelEncoder but supports NaNs and unseen values.

- [CountEncoder](https://github.com/simon-larsson/extrakit-learn#countencoder) - Categorical feature engineering on a column based on value counts.

- [TargetEncoder](https://github.com/simon-larsson/extrakit-learn#targetencoder) - Categorical feature engineering on a column based on target means.

- [MultiColumnEncoder](https://github.com/simon-larsson/extrakit-learn#multicolumnencoder) - Apply a column encoder to multiple columns.

- [FoldEstimator](https://github.com/simon-larsson/extrakit-learn#foldestimator) - K-fold ensamble of scikit estimators wrapped into an estimator.

- [FoldLightGBM](https://github.com/simon-larsson/extrakit-learn#foldlightgbm) - K-fold ensamble of LGBMs wrapped into an estimator.

- [FoldXGBoost](https://github.com/simon-larsson/extrakit-learn#foldxgboost) - K-fold ensamble of XGBoosts wrapped into an estimator.

- [StackClassifier](https://github.com/simon-larsson/extrakit-learn#stackclassifier) - Stack an ensemble of classifiers with a meta classifier.

- [StackRegressor](https://github.com/simon-larsson/extrakit-learn#stackregressor) - Stack an ensemble of regressors with a meta regressor.

- [compress_dataframe](https://github.com/simon-larsson/extrakit-learn#compress_dataframe) - Reduce memory of a Pandas dataframe.

### Hierachy

    xklearn

    │

    ├── preprocessing

    │   ├── CategoryEncoder

    │   ├── CountEncoder

    │   ├── TargetEncoder      

    │   └── MultiColumnEncoder

    │

    ├── models

    │   ├── FoldEstimator

    │   ├── FoldLightGBM

    |   ├── FoldXGBoost

    |   ├── StackClassifier

    |   └── StackRegressor

    |

    └── utils

##### Example

    from xklearn.models import FoldEstimator

### CategoryEncoder

Wraps scikit's LabelEncoder, allowing missing and unseen values to be handled.

#### Arguments

`unseen` - Strategy for handling unseen values. See replacement strategies below for options.

`missing` - Strategy for handling missing values. See replacement strategies below for options.

##### Replacement strategies

`'encode'` - Replace value with -1.

`'nan'` - Replace value with np.nan.

`'error'` - Raise ValueError.

#### Example

```python

from xklearn.preprocessing import CategoryEncoder

...

ce = CategoryEncoder(unseen='nan', missing='nan')

X[:, 0] = ce.fit_transform(X[:, 0])

```

### CountEncoder

Replaces categorical values with their respective value count during training. Classes with a count of one and previously unseen classes during prediction are encoded as either one or NaN.

#### Arguments

`unseen` - Strategy for handling unseen values. See replacement strategies below for options.

`missing` - Strategy for handling missing values. See replacement strategies below for options.

##### Replacement strategies

`'one'` - Replace value with 1.

`'nan'` - Replace value with np.nan.

`'error'` - Raise ValueError.

#### Example

```python

from xklearn.preprocessing import CountEncoder

...

ce = CountEncoder(unseen='one')

X[:, 0] = ce.fit_transform(X[:, 0])

```

### TargetEncoder

Performs target mean encoding of categorical features with optional smoothing.

#### Arguments

`smoothing` - Smoothing weight.

`unseen` - Strategy for handling unseen values. See replacement strategies below for options.

`missing` - Strategy for handling missing values. See replacement strategies below for options.

##### Replacement strategies

`'global'` - Replace value with global target mean.

`'nan'` - Replace value with np.nan.

`'error'` - Raise ValueError.

#### Example

```python

from xklearn.preprocessing import TargetEncoder

...

te = TargetEncoder(smoothing=10)

X[:, 0] = te.fit_transform(X[:, 0], y)

```

### MultiColumnEncoder

Applies a column encoder over multiple columns.

#### Arguments

`enc` - Base encoder that will be applied to selected columns

`columns` - Column selection, either bool-mask, indices or None (default=None).

#### Example

```python

from xklearn.preprocessing import CountEncoder

from xklearn.preprocessing import MultiColumnEncoder

...

columns = [1, 3, 4]

enc = CountEncoder()

mce = MultiColumnEncoder(enc, columns)

X = mce.fit_transform(X)

```

### FoldEstimator

K-fold wrapped into an estimator that performs cross validation over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.

#### Arguments

`est` - Base estimator.

`fold` - Folding cross validation object, i.e KFold and StratifedKfold.

`metric` - Evaluation metric.

`refit_full` - Flag indicting post fit behaviour. True will do a full refit on the full data, False will make it a stacked ensemble trained on the different folds.

`verbose` - Flag for printing fold scores during fit.

#### Example

```python

from xklearn.models import FoldEstimator

...

base = RandomForestRegressor(n_estimators=10)

fold = KFold(n_splits=5)

est = FoldEstimator(base, fold=fold, metric=mean_squared_error, verbose=1)

est.fit(X_train, y_train)

est.predict(X_test)

```

Output:

```

Finished fold 1 with score: 200.8023

Finished fold 2 with score: 261.2365

Finished fold 3 with score: 169.2404

Finished fold 4 with score: 186.7915

Finished fold 5 with score: 205.0894

Finished with a total score of: 204.6813

```

### FoldLightGBM

K-fold wrapped into an estimator that performs cross validation on a LGBM over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.

#### Arguments

`lgbm` - Base estimator.

`fold` - Folding cross validation object, i.e KFold and StratifedKfold.

`metric` - Evaluation metric.

`fit_params` - Dictionary of parameter that should be fed to the fit method.

`refit_full` - Flag indicting post fit behaviour. True will do a full refit on the full data, False will make it a stacked ensemble trained on the different folds.

`refit_params` - Dictionary of parameter that should be fed to the refit if refit_full=False.

`verbose` - Flag for printing fold scores during fit.

#### Example

```python

from xklearn.models import FoldLightGBM

...

base = LGBMClassifier(n_estimators=1000)

fold = KFold(n_splits=5)

fit_params = {'eval_metric': 'auc',

              'early_stopping_rounds': 50,

              'verbose': 0}

              

fold_lgbm = FoldLightGBM(base, 

                         fold=fold, 

                         metric=roc_auc_score,

                         fit_params=fit_params,

                         verbose=1)

               

fold_lgbm.fit(X_train, y_train)

fold_lgbm.predict(X_test)

```

Output:

```

Finished fold 1 with score: 0.9114

Finished fold 2 with score: 0.9265

Finished fold 3 with score: 0.9419

Finished fold 4 with score: 0.9189

Finished fold 5 with score: 0.9152

Finished with a total score of: 0.9225

```

### FoldXGBoost

K-fold wrapped into an estimator that performs cross validation on a XGBoost over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.

#### Arguments

`xgb` - Base estimator.

`fold` - Folding cross validation object, i.e KFold and StratifedKfold.

`metric` - Evaluation metric.

`fit_params` - Dictionary of parameter that should be fed to the fit method.

`refit_full` - Flag indicting post fit behaviour. True will do a full refit on the full data, False will make it a stacked ensemble trained on the different folds.

`refit_params` - Dictionary of parameter that should be fed to the refit if refit_full=False.

`verbose` - Flag for printing fold scores during fit.

#### Example

```python

from xklearn.models import FoldXGBoost

...

base = XGBRegressor(objective="reg:linear", random_state=42)

fold = KFold(n_splits=5)

fit_params = {'eval_metric': 'mse',

              'early_stopping_rounds': 5,

              'verbose': 0}

              

fold_xgb = FoldXGBoost(base, 

                       fold=fold, 

                       metric=mean_squared_error,

                       fit_params=fit_params,

                       verbose=1)

               

fold_xgb.fit(X_train, y_train)

fold_xgb.predict(X_test)

```

Output:

```

Finished fold 1 with score: 3212.8362

Finished fold 2 with score: 2179.7843

Finished fold 3 with score: 2707.8460

Finished fold 4 with score: 2988.6643

Finished fold 5 with score: 3281.4299

Finished with a total score of: 3274.9001

```

### StackClassifier

Ensemble classifier that stacks an ensemble of classifiers by using their outputs as input features.

#### Arguments

`clfs` - List of ensemble of classifiers.

`meta_clf` - Meta classifier that stacks the predictions of the ensemble.

`keep_features` - Flag to train the meta classifier on the original features too.

`refit` - Flag to retrain the ensemble of classifiers during fit.

#### Example

```python

from xklearn.models import StackClassifier

...

meta_clf = RidgeClassifier()

ensemble = [RandomForestClassifier(), KNeighborsClassifier(), SVC()]

stack_clf = StackClassifier(clfs=ensemble, meta_clf=meta_clf, refit=True)

stack_clf.fit(X_train, y_train)

y_ = stack_clf.predict(X_test)

```

### StackRegressor

Ensemble regressor that stacks an ensemble of regressors by using their outputs as input features.

#### Arguments

`regs` - List of ensemble of regressors.

`meta_reg` - Meta regressor that stacks the predictions of the ensemble.

`drop_first` : Drop first class probability to avoid multi-collinearity.

`keep_features` - Flag to train the meta regressor on the original features too.

`refit` - Flag to retrain the ensemble of regressors during fit.

#### Example

```python

from xklearn.models import StackRegressor

...

meta_reg = RidgeRegressor()

ensemble = [RandomForestRegressor(), KNeighborsRegressor(), SVR()]

stack_reg = StackRegressor(regs=ensemble, meta_reg=meta_reg, refit=True)

stack_reg.fit(X_train, y_train)

y_ = stack_reg.predict(X_test)

```

### compress_dataframe

Reduce memory usage of a Pandas dataframe by finding columns that use larger variable types than unnecessary.

#### Arguments

`df` - Dataframe for memory reduction.

`verbose` - Flag for printing result of memory reduction.

#### Example

```python

from xklearn.utils import compress_dataframe

...

train = compress_dataframe(train, verbose=1)

```

Output:

```

Dataframe memory decreased to 169.60 MB (64.6% reduction)

```