Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jianzhnie/autotabular

Automatic machine learning for tabular data. ⚡🔥⚡
https://github.com/jianzhnie/autotabular

automl catboost data-science deep-learning feature-engineering hpo lightgbm machine-learning pytorch-lightning scikit-learn structured-data tabular-data xgboost

Last synced: about 2 months ago
JSON representation

Automatic machine learning for tabular data. ⚡🔥⚡

Host: GitHub
URL: https://github.com/jianzhnie/autotabular
Owner: jianzhnie
License: apache-2.0
Created: 2021-05-07T11:09:51.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2021-12-11T12:04:34.000Z (about 3 years ago)
Last Synced: 2024-05-15T04:34:05.862Z (9 months ago)
Topics: automl, catboost, data-science, deep-learning, feature-engineering, hpo, lightgbm, machine-learning, pytorch-lightning, scikit-learn, structured-data, tabular-data, xgboost
Language: Python
Homepage:
Size: 5.91 MB
Stars: 66
Watchers: 4
Forks: 10
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # AutoTabular

[![Paper](http://img.shields.io/badge/paper-arxiv.1001.2234-B31B1B.svg)](https://www.nature.com/articles/nature14539)

[![Conference](http://img.shields.io/badge/NeurIPS-2019-4b44ce.svg)](https://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018)

[![Conference](http://img.shields.io/badge/ICLR-2019-4b44ce.svg)](https://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018)

[![Conference](http://img.shields.io/badge/AnyConference-year-4b44ce.svg)](https://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018)

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.  With just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models tabular data.

![autotabular](./docs/autotabular.png)

[Toc]

## What's good in it?

- It is using the RAPIDS as back-end support, gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.

- It Supports many anomaly detection models: ,

- It using meta learning to accelerate  model selection and parameter tuning.

- It is using many Deep Learning models for tabular data: `Wide&Deep`,  `DCN(Deep & Cross Network)`, `FM`, `DeepFM`, `PNN` ...

- It is using many machine learning algorithms: `Baseline`, `Linear`, `Random Forest`, `Extra Trees`, `LightGBM`, `Xgboost`, `CatBoost`, and `Nearest Neighbors`.

- It can compute Ensemble based on greedy algorithm from [Caruana paper](http://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf).

- It can stack models to build level 2 ensemble (available in `Compete` mode or after setting `stack_models` parameter).

- It can do features preprocessing, like: missing values imputation and converting categoricals. What is more, it can also handle target values preprocessing.

- It can do advanced features engineering, like: [Golden Features](https://supervised.mljar.com/features/golden_features/), [Features Selection](https://supervised.mljar.com/features/features_selection/), Text and Time Transformations.

- It can tune hyper-parameters with `not-so-random-search` algorithm (random-search over defined set of values) and hill climbing to fine-tune final models.

##  Installation

The sources for AutoTabular can be downloaded from the `Github repo`.

You can either clone the public repository:

```bash

# clone project

git clone https://apulis-gitlab.apulis.cn/apulis/AutoTabular/autotabular.git

# First, install dependencies

pip install -r requirements.txt

```

Once you have a copy of the source, you can install it with:

```bash

python setup.py install

```

## Example

Next, navigate to any file and run it.

```bash

# module folder

cd example

# run module (example: mnist as your main contribution)

python binary_classifier_Titanic.py

```

### Auto Feature generate & Selection

```python

TODO

```

### Deep Feature Synthesis

```python

import featuretools as ft

import pandas as pd

from sklearn.datasets import load_iris

# Load data and put into dataframe

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['species'] = iris.target

df['species'] = df['species'].map({

    0: 'setosa',

    1: 'versicolor',

    2: 'virginica'

})

# Make an entityset and add the entity

es = ft.EntitySet()

es.add_dataframe(

    dataframe_name='data', dataframe=df, make_index=True, index='index')

# Run deep feature synthesis with transformation primitives

feature_matrix, feature_defs = ft.dfs(

    entityset=es,

    max_depth=3,

    target_dataframe_name='data',

    agg_primitives=['mode', 'mean', 'max', 'count'],

    trans_primitives=[

        'add_numeric', 'multiply_numeric', 'cum_min', 'cum_mean', 'cum_max'

    ],

    groupby_trans_primitives=['cum_sum'])

print(feature_defs)

print(feature_matrix.head())

print(feature_matrix.ww)

```

### GBDT Feature Generate

```python

from autofe.feature_engineering.gbdt_feature import CatboostFeatureTransformer, GBDTFeatureTransformer, LightGBMFeatureTransformer, XGBoostFeatureTransformer

titanic = pd.read_csv('autotabular/datasets/data/Titanic.csv')

# 'Embarked' is stored as letters, so fit a label encoder to the train set to use in the loop

embarked_encoder = LabelEncoder()

embarked_encoder.fit(titanic['Embarked'].fillna('Null'))

# Record anyone travelling alone

titanic['Alone'] = (titanic['SibSp'] == 0) & (titanic['Parch'] == 0)

# Transform 'Embarked'

titanic['Embarked'].fillna('Null', inplace=True)

titanic['Embarked'] = embarked_encoder.transform(titanic['Embarked'])

# Transform 'Sex'

titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 0

titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 1

titanic['Sex'] = titanic['Sex'].astype('int8')

# Drop features that seem unusable. Save passenger ids if test

titanic.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

trainMeans = titanic.groupby(['Pclass', 'Sex'])['Age'].mean()

def f(x):

    if not np.isnan(x['Age']):  # not NaN

        return x['Age']

    return trainMeans[x['Pclass'], x['Sex']]

titanic['Age'] = titanic.apply(f, axis=1)

rows = titanic.shape[0]

n_train = int(rows * 0.77)

train_data = titanic[:n_train, :]

test_data = titanic[n_train:, :]

X_train = titanic.drop(['Survived'], axis=1)

y_train = titanic['Survived']

clf = XGBoostFeatureTransformer(task='classification')

clf.fit(X_train, y_train)

result = clf.concate_transform(X_train)

print(result)

clf = LightGBMFeatureTransformer(task='classification')

clf.fit(X_train, y_train)

result = clf.concate_transform(X_train)

print(result)

clf = GBDTFeatureTransformer(task='classification')

clf.fit(X_train, y_train)

result = clf.concate_transform(X_train)

print(result)

clf = CatboostFeatureTransformer(task='classification')

clf.fit(X_train, y_train)

result = clf.concate_transform(X_train)

print(result)

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

lr = LogisticRegression()

x_train_gb, x_test_gb, y_train_gb, y_test_gb = train_test_split(

    result, y_train)

x_train, x_test, y_train, y_test = train_test_split(X_train, y_train)

lr.fit(x_train, y_train)

score = roc_auc_score(y_test, lr.predict(x_test))

print('LR with GBDT apply data, train data shape : {0}  auc: {1}'.format(

    x_train.shape, score))

lr = LogisticRegression()

lr.fit(x_train_gb, y_train_gb)

score = roc_auc_score(y_test_gb, lr.predict(x_test_gb))

print('LR with GBDT apply data, train data shape : {0}  auc: {1}'.format(

    x_train_gb.shape, score))

```

### Golden Feature Generate

```python

from autofe import GoldenFeatureTransform

titanic = pd.read_csv('autotabular/datasets/data/Titanic.csv')

embarked_encoder = LabelEncoder()

embarked_encoder.fit(titanic['Embarked'].fillna('Null'))

# Record anyone travelling alone

titanic['Alone'] = (titanic['SibSp'] == 0) & (titanic['Parch'] == 0)

# Transform 'Embarked'

titanic['Embarked'].fillna('Null', inplace=True)

titanic['Embarked'] = embarked_encoder.transform(titanic['Embarked'])

# Transform 'Sex'

titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 0

titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 1

titanic['Sex'] = titanic['Sex'].astype('int8')

# Drop features that seem unusable. Save passenger ids if test

titanic.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

trainMeans = titanic.groupby(['Pclass', 'Sex'])['Age'].mean()

def f(x):

    if not np.isnan(x['Age']):  # not NaN

        return x['Age']

    return trainMeans[x['Pclass'], x['Sex']]

titanic['Age'] = titanic.apply(f, axis=1)

X_train = titanic.drop(['Survived'], axis=1)

y_train = titanic['Survived']

print(X_train)

gbdt_model = GoldenFeatureTransform(

    results_path='./', ml_task='BINARY_CLASSIFICATION')

gbdt_model.fit(X_train, y_train)

results = gbdt_model.transform(X_train)

print(results)

```

### Neural Network Embeddings

```python

# data url

"""https://www.kaggle.com/c/house-prices-advanced-regression-techniques."""

data_dir = '/media/robin/DATA/datatsets/structure_data/house_price/train.csv'

data = pd.read_csv(

    data_dir,

    usecols=[

        'SalePrice', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea',

        'Street', 'YearBuilt', 'LotShape', '1stFlrSF', '2ndFlrSF'

    ]).dropna()

categorical_features = [

    'MSSubClass', 'MSZoning', 'Street', 'LotShape', 'YearBuilt'

]

output_feature = 'SalePrice'

label_encoders = {}

for cat_col in categorical_features:

    label_encoders[cat_col] = LabelEncoder()

    data[cat_col] = label_encoders[cat_col].fit_transform(data[cat_col])

dataset = TabularDataset(

    data=data, cat_cols=categorical_features, output_col=output_feature)

batchsize = 64

dataloader = DataLoader(dataset, batchsize, shuffle=True, num_workers=1)

cat_dims = [int(data[col].nunique()) for col in categorical_features]

emb_dims = [(x, min(50, (x + 1) // 2)) for x in cat_dims]

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = FeedForwardNN(

    emb_dims,

    no_of_cont=4,

    lin_layer_sizes=[50, 100],

    output_size=1,

    emb_dropout=0.04,

    lin_layer_dropouts=[0.001, 0.01]).to(device)

print(model)

num_epochs = 100

criterion = nn.MSELoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

for epoch in range(num_epochs):

    for y, cont_x, cat_x in dataloader:

        cat_x = cat_x.to(device)

        cont_x = cont_x.to(device)

        y = y.to(device)

        # Forward Pass

        preds = model(cont_x, cat_x)

        loss = criterion(preds, y)

        # Backward Pass and Optimization

        optimizer.zero_grad()

        loss.backward()

        optimizer.step()

    print('loss:', loss)

```

## License

This library is licensed under the Apache 2.0 License.

## Contributing to AutoTabular

We are actively accepting code contributions to the AutoTabular project. If you are interested in contributing to AutoTabular, please contact me.