Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/kengz/feature_transform

Build ColumnTransformers (Scikit or DaskML) for feature transformation by specifying configs.
https://github.com/kengz/feature_transform
column-transformer dask-ml dataset feature-engineering feature-transformation machine-learning scikit-learn
Last synced: 13 days ago
JSON representation
Build ColumnTransformers (Scikit or DaskML) for feature transformation by specifying configs.
Host: GitHub
URL: https://github.com/kengz/feature_transform
Owner: kengz
License: mit
Created: 2022-06-15T23:46:26.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-08-08T18:24:51.000Z (over 2 years ago)
Last Synced: 2024-11-09T06:45:01.219Z (2 months ago)
Topics: column-transformer, dask-ml, dataset, feature-engineering, feature-transformation, machine-learning, scikit-learn
Language: Python
Homepage:
Size: 41 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # feature_transform ![CI](https://github.com/kengz/feature_transform/workflows/CI/badge.svg)

Build ColumnTransformers ([Scikit](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) or [DaskML](https://ml.dask.org/modules/generated/dask_ml.compose.ColumnTransformer.html)) for feature transformation by specifying configs.

> For quickly building PyTorch models, see also [TorchArc](https://github.com/kengz/torcharc).

## Installation

```bash

pip install feature_transform

```

Installing this will also install Scikit Learn, but if you need parallelization, install Dask ML separately:

```bash

pip install dask-ml

```

## Usage

The ColumnTransformer class of [Scikit](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) / [DaskML](https://ml.dask.org/modules/generated/dask_ml.compose.ColumnTransformer.html) allows us to build a complex pipeline of feature preprocessors/transformers that takes dataframe as input and outputs numpy arrays. However, using it requires maintaining Python code.

This project started with the vision of building the entire feature transformation pipeline by just specifying what preprocessors to apply to a dataframe's column.

For example, take the iris dataset with columns: `sepal length (cm), sepal width (cm), petal length (cm), petal width (cm), target`. We want the first 4 columns to be the features for our input `x`, where each feature goes through a `StandardScaler`; and `target` to be the feature of our output `y`, where it is one-hot encoded. Then, use this directly to fit_transform the iris dataframe and obtain numpy arrays `xs, ys`. Here's the code:

```python

from feature_transform import transform

from sklearn import datasets

import pandas as pd

# specify transform for each feature

spec = {

    'dataset': {

        'transform': {'module': 'sklearn', 'n_jobs': 1}

    },

    'transform': {

        'x': { # the "mode"

            'sepal length (cm)': {'StandardScaler': None}, # the column name and its {preprocessor: kwargs, ...}

            'sepal width (cm)': {'StandardScaler': None},

            'petal length (cm)': {'StandardScaler': None},

            'petal width (cm)': {'StandardScaler': None},

        },

        'y': {

            'target': {'OneHotEncoder': {'sparse': False, 'handle_unknown': 'ignore'}}

        }

    }

}

# load iris dataframe

data_df = pd.concat(datasets.load_iris(return_X_y=True, as_frame=True), axis=1)

# transform into numpy arrays ready for model

mode2data = transform.fit_transform(spec, stage='fit', df=data_df)

xs, ys = mode2data['x'], mode2data['y']

# to reload the fitted transformers for validation/test, specify stage='validate' or 'test'

val_df = data_df.copy()

mode2val_data = transform.fit_transform(spec, stage='validate', df=val_df)

val_xs, val_ys = mode2val_data['x'], mode2val_data['y']

# artifacts to get the column transformers and transformed names directly

artifacts = transform.get_artifacts(spec)

artifacts['mode2col_transfmr']

# {'x': ColumnTransformer(n_jobs=1, sparse_threshold=0, transformers=[('sepal length (cm)', Pipeline(steps=[('standardscaler',...

artifacts['mode2transformed_names']

# {'x': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],

#  'y': ['target_0', 'target_1', 'target_2']}

```

What happens in the background is as follows:

- for each `mode` in `spec.transform`

  - for each `column` in `mode`, create a pipeline of `[preprocessor(**kwargs)]`, and compose them into a `ColumnTransformer` for the mode.

  - during `fit_transform`, each mode runs its `ColumnTransformer.fit_transform`

  - then it saves the fitted `ColumnTransformer` to `./data/{hash}-{mode}-col_transfmr.pkl`.

  - these filenames will be logged. These files are the ones loaded in `transform.get_artifacts` for uses such as test/validation.

### Using YAML config

The goal of this library is to make feature transform configuration, so let's do the same as above, but with a YAML config file. The spec format is:

```yaml

dataset:

  transform:

    module: {str} # options: 'sklearn' (serial-row) or 'dask_ml' (parallel-row)

    n_jobs: {null|int} # parallelization; -1 to use all cores

transform:

  {mode}:

    {column}:

      {preprocessor}: {null|kwargs} # optional kwargs for preprocessor

      {preprocessor}: {null|kwargs}

      ...

```

The `{preprocessor}` value can be any of the preprocessor classes [Scikit](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) or [DaskML](https://ml.dask.org/modules/api.html#module-dask_ml.preprocessing). Additional custom ones are also registered in [feature_transform/transform.py](./feature_transform/transform.py).

For example, the earlier spec can be rewritten in YAML as:

```yaml

# transform.yaml

dataset:

  transform:

    module: sklearn

    n_jobs: null

transform:

  x:

    sepal length (cm):

      StandardScaler:

    sepal width (cm):

      StandardScaler:

    petal length (cm):

      StandardScaler:

    petal width (cm):

      StandardScaler:

  y:

    target:

      OneHotEncoder:

        sparse: false

        handle_unknown: ignore

```

Now, our code simplifies to:

```python

from feature_transform import transform, util

from sklearn import datasets

import pandas as pd

# convenient method to read YAML

spec = util.read('transform.yaml')

# load iris dataframe

data_df = pd.concat(datasets.load_iris(return_X_y=True, as_frame=True), axis=1)

# transform into numpy arrays ready for model

mode2data = transform.fit_transform(spec, stage='fit', df=data_df)

xs, ys = mode2data['x'], mode2data['y']

# to reload the fitted transformers for validation/test, specify stage='validate' or 'test'

val_df = data_df.copy()

mode2val_data = transform.fit_transform(spec, stage='validate', df=val_df)

val_xs, val_ys = mode2val_data['x'], mode2val_data['y']

```

### Chain Preprocessors

To chain multiple preprocessors, simply add more steps:

```yaml

dataset:

  transform:

    module: sklearn

    n_jobs: null

transform:

  x:

    sepal length (cm):

      Log1pScaler: # custom preprocessor for np.log1p

      StandardScaler:

    sepal width (cm):

      Clipper: # custom preprocessor to clip values

        a_min: 0

        a_max: 10

      StandardScaler:

    petal length (cm):

      StandardScaler:

    petal width (cm):

      StandardScaler:

  y:

    target:

      OneHotEncoder:

        sparse: false

        handle_unknown: ignore

```

### Specify any module

By default the config refers to classes in the `preprocessing` module of sklearn/dask-ml. Use dot-notation to specify other modules:

```yaml

dataset:

  transform:

    module: sklearn

    n_jobs: null

transform:

  x:

    a_float_column:

      StandardScaler:

    a_column_with_dict_values:

      feature_extraction.DictVectorizer:

    a_column_with_na:

      StandardScaler:

      impute.SimpleImputer: # handle na values

        strategy: constant

        fill_value: -1

  y:

    a_target_column:

      Identity:

```

### Specify any modes

The modes can be any names other than `x, y`:

```yaml

dataset:

  transform:

    module: sklearn

    n_jobs: null

transform:

  foo:

    column_foo_1:

      StandardScaler:

    column_foo_2:

      Log1pScaler:

      StandardScaler:

  bar:

    column_bar_1:

      OneHotEncoder:

  baz:

    column_baz_1:

      Identity:

```

### Parallelization

> NOTE run `pip install dask-ml` first.

```yaml

dataset:

  transform:

    module: dask_ml

    n_jobs: -1 # use all cores

transform:

  # ...

```

## ML Examples

### PyTorch DataLoader

```python

from feature_transform import transform, util

from sklearn import datasets

from torch.utils.data import TensorDataset, DataLoader

import pandas as pd

import torch

spec = util.read('transform.yaml')

# load iris dataframe

data_df = pd.concat(datasets.load_iris(return_X_y=True, as_frame=True), axis=1)

# transform into numpy arrays ready for model

mode2data = transform.fit_transform(spec, stage='fit', df=data_df)

xs, ys = mode2data['x'], mode2data['y']

train_dataset = TensorDataset(torch.from_numpy(xs), torch.from_numpy(ys)) # create your datset

train_dataloader = DataLoader(train_dataset) # create your dataloader

# suppose this is test/validation set; use stage='validate' or stage='test' to transform

val_df = data_df.copy()

mode2val_data = transform.fit_transform(spec, stage='validate', df=val_df)

val_xs, val_ys = mode2val_data['x'], mode2val_data['y']

val_dataset = TensorDataset(torch.from_numpy(val_xs), torch.from_numpy(val_ys))

val_dataloader = DataLoader(val_dataset) # create your dataloader

```

### Scikit Learn example

```python

from feature_transform import transform, util

from sklearn import datasets, metrics

from sklearn.tree import DecisionTreeClassifier

import pandas as pd

spec = util.read('transform.yaml')

# load iris dataframe

data_df = pd.concat(datasets.load_iris(return_X_y=True, as_frame=True), axis=1)

# transform into numpy arrays ready for model

mode2data = transform.fit_transform(spec, stage='fit', df=data_df)

xs, ys = mode2data['x'], mode2data['y']

# train model

model = DecisionTreeClassifier(max_depth = 3, random_state = 1)

model.fit(xs, ys)

pred_ys = model.predict(xs)

print(f'train accuracy: {metrics.accuracy_score(pred_ys, ys):.3f}')

# train accuracy: 0.973

# suppose this is validation/test data, we use stage='validate' or 'test

test_df = data_df.copy()

mode2test_data = transform.fit_transform(spec, stage='test', df=test_df)

test_xs, test_ys = mode2val_data['x'], mode2val_data['y']

pred_ys = model.predict(test_xs)

print(f'test accuracy: {metrics.accuracy_score(pred_ys, test_ys):.3f}')

# test accuracy: 0.973

```

## Development

### Setup

```bash

# install the dev dependencies

bin/setup

# activate Conda environment

conda activate transform

```

### Unit Tests

```bash

python setup.py test

```