Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/pfnet-research/autogbt-alt

An experimental Python package that reimplements AutoGBT using LightGBM and Optuna.
https://github.com/pfnet-research/autogbt-alt

automl gradient-boosting kaggle lightgbm machine-learning python

Last synced: 2 months ago
JSON representation

An experimental Python package that reimplements AutoGBT using LightGBM and Optuna.

Host: GitHub
URL: https://github.com/pfnet-research/autogbt-alt
Owner: pfnet-research
License: mit
Created: 2019-04-01T05:45:06.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2019-04-01T06:05:28.000Z (almost 6 years ago)
Last Synced: 2024-11-02T18:41:44.967Z (3 months ago)
Topics: automl, gradient-boosting, kaggle, lightgbm, machine-learning, python
Language: Python
Size: 171 KB
Stars: 82
Watchers: 20
Forks: 11
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

automl - Homepage

README

        ## About

This is an experimental Python package that reimplements [AutoGBT](https://github.com/flytxtds/AutoGBT) using [LightGBM](https://github.com/Microsoft/LightGBM) and [Optuna](https://github.com/pfnet/optuna/). AutoGBT is an automatically tuned machine learning classifier which won the first prize at [NeurIPS'18 AutoML Challenge](https://competitions.codalab.org/competitions/19836). AutoGBT has the following features:

* Automatic Hyperparameter Tuning: the hyperparameters of LightGBM are automatically optimized,

* Automatic Feature Engineering: simple feature engineering is applied for categorical and datetime features, and

* Automatic Sampling: data rows are sampled for handling imbalanced and large datasets.

This implementation has the following differences from original AutoGBT:

1. This implementation uses Optuna for the hyperparameter tuning of LightGBM instead of [Hyperopt](https://github.com/hyperopt/hyperopt),

1. it optimizes k-fold cross-validation AUC score, and

1. it equips simplified scikit-learn-like API interface.

## Installation

```

$ pip install git+https://github.com/pfnet-research/autogbt-alt.git

```

or

```

$ pip install git+ssh://[email protected]/pfnet-research/autogbt-alt.git

```

## Usage

### Basic Usage: LightGBM with Automatic Hyperparameter Tuning

```python3

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

from autogbt import AutoGBTClassifier

X, y = load_breast_cancer(return_X_y=True)

train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.1)

model = AutoGBTClassifier()

model.fit(train_X, train_y)

print('valid AUC: %.3f' % (roc_auc_score(valid_y, model.predict(valid_X))))

print('CV AUC: %.3f' % (model.best_score))

```

### Feature Engineering

```python3

from autogbt import Preprocessor

preprocessor = Preprocessor(train_frac=0.5, test_frac=0.5)

train_X, valid_X, train_y = preprocessor.transform(train_X, valid_X, train_y)

```

### Training with Sampling

```python3

from autogbt import TrainDataSampler

sampler = TrainDataSampler(train_frac=0.5, valid_frac=0.5)

model = AutoGBTClassifier(sampler=sampler)

model.fit(train_X, train_y)

model.predict(test_X)

```

## Experimental Evaluation

Please see `benchmark` directory for the details.

### Comparison against Vanilla XGBoost and LightGBM

The default values are used for all hyperparameters of AutoGBT, XGBoost and LightGBM.

#### [Airline Dataset](https://www.openml.org/d/1240)

| model    | duration[s]      | CV AUC      |

|:---------|:-----------------|:------------|

| AutoGBT  | 6515.254±340.231 | 0.900±0.001 |

| Xgboost  | 78.561±7.265     | 0.872±0.000 |

| LightGBM | 34.000±2.285     | 0.891±0.000 |

#### [Amazon Challenge](https://www.kaggle.com/c/amazon-employee-access-challenge)

| model    | duration[s]    | CV AUC      |

|:---------|:---------------|:------------|

| AutoGBT  | 359.834±29.188 | 0.832±0.002 |

| Xgboost  | 2.558±0.661    | 0.749±0.002 |

| LightGBM | 1.789±0.165    | 0.834±0.002 |

#### [Avazu CTR Prediction](https://www.kaggle.com/c/avazu-ctr-prediction)

| model    | duration[s]       | CV AUC      |

|:---------|:------------------|:------------|

| AutoGBT  | 20322.601±676.702 | 0.744±0.000 |

| Xgboost  | OoM               | OoM         |

| LightGBM | OoM               | OoM         |

#### [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing)

| model    | duration[s]    | CV AUC      |

|:---------|:---------------|:------------|

| AutoGBT  | 372.090±32.857 | 0.925±0.001 |

| Xgboost  | 2.683±0.204    | 0.912±0.001 |

| LightGBM | 2.406±0.236    | 0.927±0.001 |

### Parameter Comparison

Performance on various `train_frac` and `n_trials` parameters

![](./assets/param-airline.png)

![](./assets/param-amazon.png)

![](./assets/param-bank.png)

## Testing

```

$ ./test.sh

```

## Reference

Jobin Wilson and Amit Kumar Meher and Bivin Vinodkumar Bindu and Manoj Sharma and Vishakha Pareek and Santanu Chaudhury and Brejesh Lall, **AutoGBT: Automatically Optimized Gradient Boosting Trees for Classifying Large Volume High Cardinality Data Streams under Concept-Drift,** 2018, https://github.com/flytxtds/AutoGBT.