https://github.com/firmai/tflm

Advanced Transformations and Interactions for Linear Models using Hybrid Machine Learning Models and SHapley Additive exPlanations
https://github.com/firmai/tflm
Last synced: 7 months ago
JSON representation
Advanced Transformations and Interactions for Linear Models using Hybrid Machine Learning Models and SHapley Additive exPlanations
Host: GitHub
URL: https://github.com/firmai/tflm
Owner: firmai
Created: 2019-12-27T05:14:12.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2020-01-27T03:09:31.000Z (almost 6 years ago)
Last Synced: 2025-05-05T02:51:40.508Z (8 months ago)
Language: Python
Homepage: https://www.linkedin.com/company/firmai
Size: 266 KB
Stars: 6
Watchers: 3
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Transformations and Interactions for Linear Models 

The first automated package for data driven feature transformation, interaction, and selection to develop fast linear models.

#### Install

```

pip install tflm

```

The package is nice and simple and boils down to one command.

```python

import tflm

target = "target"

Train_lin_X,Train_lin_y, Test_lin_X, Test_lin_y = tflm.runner(Train, Test, Target)

## If you run into obstacles (i.e. too many features for multivariate regression/low memory) ...

## ...adjust the following default parameters lower (between 0.0 and 1.0) for now this is experimental ...

## ...contribution_portion=0.9, final_contribution=0.95, deflator=0.7

## ...and a few others, very experimental: inter_portion=0.8, sqr_portion=0.8, runs=2

```

Now just add it to your linear model

```python

from sklearn import linear_model

lm = linear_model.LinearRegression()

lm = lm.fit(Train_lin_X, Train_lin_y)

```

For now this only works with regression problems (continuous targets)

### Description

This advanced and automated model generates and selects feature tranformations and interactions using MLP neural networks with four single standing transformations (power, log, reciprocal and roots) and a Gradient Boosting Model with two interaction methods (multiplication and division) both using shapley additive contribution scores for selection criteria, the benefit of which is a selection of generated features that immitates neural networks and decision trees which is known to have synergetic ensembling properties. The final selection based on a validation set uses a the Least-angle regression (LARS) algorithm. 

The amount of feature can greatly inflate depending on the qulity of interactive effects and the benefits obtained from transformations. Although eight hyperparameters are made availble, for now I have chosen the parameters automatically based on data characteristics. These parameters in their current development state are fragile and on top of that hyperparameter selection is extremely expensive with this method. Each iteration passes through four selection filters. The data can pass multiple times through the tflm method, for now I have internally capped the iterations at two. This is an extremely slow algorithm, with the purpose of using a lot of time upfront to create good features that you can use a fast linear model in the future as opposed to a slow non-linear model. 

In data analysis, transformation is the replacement of a variable by a function of that variable: for example, replacing a variable x by the square root of x or the logarithm of x. In a stronger sense, a transformation is a replacement that changes the shape of a distribution or relationship. Interactions arise when considering the relationship among three or more variables. It describes a situation in which the effect of one variable on an outcome depends on the state of a second variable. Interaction terms can be created in various ways such as the product of x and y or the ratio of x and y. 

### Use Cases

1. General Automated Feature Generation for Linear Models and Gradient Boosting Models (LightGBM, CatBoost, XGBoost)

1. Transformation of Higher-Dimensional Feature Space to Lower-Dimensional Feature Space.

1. Features Automatically Generated and Selected to Imitate the Performance of Non-linear models

1. Linear Models are Needed at Times When Latency Becomes an Important Concern

### How

1. **MLP** Neural Network Identifies the Most Important Features for Interaction and Selection 

1. All Feature Importance and Feature Interaction Values are Based on **SHAP** (SHapley Additive exPlanations)

1. The Most Important Single Standing Features are Tranformed **POWER_2** (square) **LOG** (log plus 1) **RECIP** (reciprocal) **SQRT** (square root plus 1)

1. **GBM** Gradient Boosting Model uses the **MLP** Identified Important Features to Select a Subset of Important Interaction Pairs

1. The Most Important Interaction Pairs are Interacted **a_X_b** (multiplication) **c_DIV_h** (division) 

1. All Transformations are Fed as Input into an **MLP** model and Selected to **X%** (default 90%) Feature Contribution

1. The Whole Process is Repeated One More Time So That Higher Dimensional Interaction Can Take Place imagine **a_POWER_b_X_c_DIV_h**

1. Finally a **Lasso** Regression Selects Features from a Validation Set Using the **LARS algorithm** 

### To Do

1. Current parameter selection is based on data characteristics and bayesian hyperparameter optimisation could help.

1. The AutoKeras team told me they are working on an automated model for tabular regression problems. 

1. Method for undoing interactions and transformations to identify original feature importance. 

1. Develop a method for classification tasks. 

1. Optimisation for users without access to GPUs (for now, you can use model="LightGBM" paramater).

1. Make each generation a little less random. 

### First Example

We have a dataset of more than 500k songs. The task is to predict which year

### Second Example

Download Dataset and Activate Runner

```python

import tflm

import sklearn.datasets

from sklearn import linear_model

dataset = sklearn.datasets.fetch_california_housing()

X = pd.DataFrame(dataset['data'])

X["target"] = dataset["target"]

first = X.sample(int(len(X)/2))  # random selection leading to different scores

second = X[~X.isin(first)].dropna()

target = "target"

X_train, y_train, X_test, y_test = tflm.runner(first, second, target)

#train_data, train_output, test_data, test_output = runner(first, second, target, contribution_portion=0.7, final_contribution=0.80, deflator=0.6) 

```

Modelling and MSE Score

```python

from sklearn import linear_model

lm = linear_model.LinearRegression()

lm = lm.fit(X_train,y_train)

preds = lm.predict(X_test)

mse = mean_squared_error(y_test, preds)

print(mse)

#Score Achieved = 0.43

```

Compare Performance With Untransformed Features

```python

import pandas as pd

from sklearn import preprocessing

def scaler(df):

  x = df.values #returns a numpy array

  min_max_scaler = preprocessing.MinMaxScaler()

  x_scaled = min_max_scaler.fit_transform(x)

  df = pd.DataFrame(x_scaled)

  return df

add_first_y = first[target]

add_first = scaler(first.drop([target],axis=1))

add_second_y = second[target]

add_second = scaler(second.drop([target],axis=1)) 

from sklearn import linear_model

#clf = linear_model.Lasso(alpha=0.4)

clf = linear_model.LinearRegression()

preds = clf.fit(add_first,add_first_y).predict(add_second)

mse = mean_squared_error(add_second_y, preds)

print(mse)

#Score Achieved = 0.55

```

That is a performance improvement of more than 20% by using exactely the same data !!

That does not mean it always performs better than the standard data format; here is a Google Colab [example](https://colab.research.google.com/drive/1oEnsZ37FW266zdRK2Qa7del0T0ly-xKy) where this method performs poorly because of a lack of data. Here is works okay, [colab](https://colab.research.google.com/drive/1IcTYWvHCAGbNLYJbHSIRHmYZrH2X07UF).  

## Reasons

There are many reasons for transformation. In practice, a transformation often works, serendipitously, to do several of these at once, particularly to reduce skewness, to produce nearly equal spreads and to produce a nearly linear or additive relationship. But this is not guaranteed.

1. **Convenience**:

A transformed scale may be as natural as the original scale and more convenient for a specific purpose (e.g.  percentages rather than original data, sines rather than degrees).

2. **Reducing skewness**:

A transformation may be used to reduce skewness.  A distribution that is symmetric or nearly so is often easier to handle and interpret than a skewed distribution. More specifically, a normal or Gaussian distribution is often regarded as ideal as it is assumed by many statistical methods.To reduce right skewness, take roots or logarithms or reciprocals (roots are weakest). This is the commonest problem in practice. To reduce left skewness, take squares or cubes or higher powers.

3. **Equal spreads**:

A transformation may be used to produce approximately equal spreads, despite marked variations in level, which again makes data easier to handle and interpret. Each data set or subset having about the same spread or variability is a condition called homoscedasticity: its opposite is called heteroscedasticity.  

    

4. **Linear relationships**:

When looking at relationships between variables, it is often far easier to think about patterns that are approximately linear than about patterns that are highly curved. This is vitally important when using linear regression, which amounts to fitting such patterns to data.

    

5. **Additive relationships**

Relationships are often easier to analyse when additive rather than (say) multiplicative. Additivity is a vital issue in analysis of variance.

## Transformations Implemented

The most useful transformations in introductory data analysis are the

    reciprocal, logarithm, cube root, square root, and square. In what

    follows, even when it is not emphasised, it is supposed that

    transformations are used only over ranges on which they yield (finite)

    real numbers as results.

    Reciprocal

    The reciprocal, x to 1/x, with its sibling the negative reciprocal, x to

    -1/x, is a very strong transformation with a drastic effect on

    distribution shape. It can not be applied to zero values.  Although it

    can be applied to negative values, it is not useful unless all values are

    positive. The reciprocal of a ratio may often be interpreted as easily as

    the ratio itself: e.g.

        population density (people per unit area) becomes area per person;

        persons per doctor becomes doctors per person;

        rates of erosion become time to erode a unit depth.

    (In practice, we might want to multiply or divide the results of taking

    the reciprocal by some constant, such as 1000 or 10000, to get numbers

    that are easy to manage, but that itself has no effect on skewness or

    linearity.)

    The reciprocal reverses order among values of the same sign: largest

    becomes smallest, etc. The negative reciprocal preserves order among

    values of the same sign.

    Logarithm

    The logarithm, x to log base 10 of x, or x to log base e of x (ln x), or

    x to log base 2 of x, is a strong transformation with a major effect on

    distribution shape. It is commonly used for reducing right skewness and

    is often appropriate for measured variables. It can not be applied to

    zero or negative values. One unit on a logarithmic scale means a

    multiplication by the base of logarithms being used. Exponential growth

    or decline

        y = a exp(bx)

    is made linear by

        ln y = ln a + bx

    so that the response variable y should be logged. (Here exp() means

    raising to the power e, approximately 2.71828, that is the base of

    natural logarithms.)

    An aside on this exponential growth or decline equation: put x = 0, and

        y = a exp(0) = a,

    so that a is the amount or count when x = 0. If a and b > 0, then y grows

    at a faster and faster rate (e.g. compound interest or unchecked

    population growth), whereas if a > 0 and b < 0, y declines at a slower

    and slower rate (e.g. radioactive decay).

    Power functions y = ax^b are made linear by log y = log a + b log x so

    that both variables y and x should be logged.

    An aside on such power functions: put x = 0, and for b > 0,

        y = ax^b = 0,

                 

    so the power function for positive b goes through the origin, which often

    makes physical or biological or economic sense. Think: does zero for x

    imply zero for y? This kind of power function is a shape that fits many

    data sets rather well.

    Consider ratios y = p / q where p and q are both positive in practice.

    Examples are

        males / females;

        dependants / workers;

        downstream length / downvalley length. 

    Then y is somewhere between 0 and infinity, or in the last case, between

    1 and infinity. If p = q, then y = 1. Such definitions often lead to

    skewed data, because there is a clear lower limit and no clear upper

    limit. The logarithm, however, namely

        log y = log p / q = log p - log q,

    is somewhere between -infinity and infinity and p = q means that log y =

    0. Hence the logarithm of such a ratio is likely to be more symmetrically

    distributed.

    Cube root 

    The cube root, x to x^(1/3). This is a fairly strong transformation with

    a substantial effect on distribution shape: it is weaker than the

    logarithm. It is also used for reducing right skewness, and has the

    advantage that it can be applied to zero and negative values.  Note that

    the cube root of a volume has the units of a length. It is commonly

    applied to rainfall data.

    Applicability to negative values requires a special note. Consider

    (2)(2)(2) = 8 and (-2)(-2)(-2) = -8. These examples show that the cube

    root of a negative number has negative sign and the same absolute value

    as the cube root of the equivalent positive number. A similar property is

    possessed by any other root whose power is the reciprocal of an odd

    positive integer (powers 1/3, 1/5, 1/7, etc.).

    This property is a little delicate. For example, change the power just a

    smidgen from 1/3, and we can no longer define the result as a product of

    precisely three terms. However, the property is there to be exploited if

    useful.

    Square root

    The square root, x to x^(1/2) = sqrt(x), is a transformation with a

    moderate effect on distribution shape: it is weaker than the logarithm

    and the cube root. It is also used for reducing right skewness, and also

    has the advantage that it can be applied to zero values. Note that the

    square root of an area has the units of a length. It is commonly applied

    to counted data, especially if the values are mostly rather small.

    Square 

    The square, x to x^2, has a moderate effect on distribution shape and it

    could be used to reduce left skewness. In practice, the main reason for

    using it is to fit a response by a quadratic function y = a + b x + c

    x^2. Quadratics have a turning point, either a maximum or a minimum,

    although the turning point in a function fitted to data might be far

    beyond the limits of the observations. The distance of a body from an

    origin is a quadratic if that body is moving under constant acceleration,

    which gives a very clear physical justification for using a quadratic.

    Otherwise quadratics are typically used solely because they can mimic a

    relationship within the data region. Outside that region they may behave

    very poorly, because they take on arbitrarily large values for extreme

    values of x, and unless the intercept a is constrained to be 0, they may

    behave unrealistically close to the origin.

    Squaring usually makes sense only if the variable concerned is zero or

    positive, given that (-x)^2 and x^2 are identical.

Additional information on tranformations, and a blog post partly inspiring the use of transformations in this package and the content in this readm can be found [here](http://fmwww.bc.edu/repec/bocode/t/transint.html).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/firmai/tflm

Awesome Lists containing this project

README