Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pfnet-research/autogbt-alt
An experimental Python package that reimplements AutoGBT using LightGBM and Optuna.
https://github.com/pfnet-research/autogbt-alt
automl gradient-boosting kaggle lightgbm machine-learning python
Last synced: 2 months ago
JSON representation
An experimental Python package that reimplements AutoGBT using LightGBM and Optuna.
- Host: GitHub
- URL: https://github.com/pfnet-research/autogbt-alt
- Owner: pfnet-research
- License: mit
- Created: 2019-04-01T05:45:06.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-04-01T06:05:28.000Z (almost 6 years ago)
- Last Synced: 2024-11-02T18:41:44.967Z (3 months ago)
- Topics: automl, gradient-boosting, kaggle, lightgbm, machine-learning, python
- Language: Python
- Size: 171 KB
- Stars: 82
- Watchers: 20
- Forks: 11
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- automl - Homepage
README
## About
This is an experimental Python package that reimplements [AutoGBT](https://github.com/flytxtds/AutoGBT) using [LightGBM](https://github.com/Microsoft/LightGBM) and [Optuna](https://github.com/pfnet/optuna/). AutoGBT is an automatically tuned machine learning classifier which won the first prize at [NeurIPS'18 AutoML Challenge](https://competitions.codalab.org/competitions/19836). AutoGBT has the following features:
* Automatic Hyperparameter Tuning: the hyperparameters of LightGBM are automatically optimized,
* Automatic Feature Engineering: simple feature engineering is applied for categorical and datetime features, and
* Automatic Sampling: data rows are sampled for handling imbalanced and large datasets.This implementation has the following differences from original AutoGBT:
1. This implementation uses Optuna for the hyperparameter tuning of LightGBM instead of [Hyperopt](https://github.com/hyperopt/hyperopt),
1. it optimizes k-fold cross-validation AUC score, and
1. it equips simplified scikit-learn-like API interface.## Installation
```
$ pip install git+https://github.com/pfnet-research/autogbt-alt.git
```
or
```
$ pip install git+ssh://[email protected]/pfnet-research/autogbt-alt.git
```## Usage
### Basic Usage: LightGBM with Automatic Hyperparameter Tuning
```python3
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from autogbt import AutoGBTClassifierX, y = load_breast_cancer(return_X_y=True)
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.1)
model = AutoGBTClassifier()
model.fit(train_X, train_y)
print('valid AUC: %.3f' % (roc_auc_score(valid_y, model.predict(valid_X))))
print('CV AUC: %.3f' % (model.best_score))
```### Feature Engineering
```python3
from autogbt import Preprocessorpreprocessor = Preprocessor(train_frac=0.5, test_frac=0.5)
train_X, valid_X, train_y = preprocessor.transform(train_X, valid_X, train_y)
```### Training with Sampling
```python3
from autogbt import TrainDataSamplersampler = TrainDataSampler(train_frac=0.5, valid_frac=0.5)
model = AutoGBTClassifier(sampler=sampler)
model.fit(train_X, train_y)
model.predict(test_X)
```## Experimental Evaluation
Please see `benchmark` directory for the details.
### Comparison against Vanilla XGBoost and LightGBM
The default values are used for all hyperparameters of AutoGBT, XGBoost and LightGBM.
#### [Airline Dataset](https://www.openml.org/d/1240)
| model | duration[s] | CV AUC |
|:---------|:-----------------|:------------|
| AutoGBT | 6515.254±340.231 | 0.900±0.001 |
| Xgboost | 78.561±7.265 | 0.872±0.000 |
| LightGBM | 34.000±2.285 | 0.891±0.000 |#### [Amazon Challenge](https://www.kaggle.com/c/amazon-employee-access-challenge)
| model | duration[s] | CV AUC |
|:---------|:---------------|:------------|
| AutoGBT | 359.834±29.188 | 0.832±0.002 |
| Xgboost | 2.558±0.661 | 0.749±0.002 |
| LightGBM | 1.789±0.165 | 0.834±0.002 |#### [Avazu CTR Prediction](https://www.kaggle.com/c/avazu-ctr-prediction)
| model | duration[s] | CV AUC |
|:---------|:------------------|:------------|
| AutoGBT | 20322.601±676.702 | 0.744±0.000 |
| Xgboost | OoM | OoM |
| LightGBM | OoM | OoM |#### [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing)
| model | duration[s] | CV AUC |
|:---------|:---------------|:------------|
| AutoGBT | 372.090±32.857 | 0.925±0.001 |
| Xgboost | 2.683±0.204 | 0.912±0.001 |
| LightGBM | 2.406±0.236 | 0.927±0.001 |### Parameter Comparison
Performance on various `train_frac` and `n_trials` parameters
![](./assets/param-airline.png)
![](./assets/param-amazon.png)
![](./assets/param-bank.png)## Testing
```
$ ./test.sh
```## Reference
Jobin Wilson and Amit Kumar Meher and Bivin Vinodkumar Bindu and Manoj Sharma and Vishakha Pareek and Santanu Chaudhury and Brejesh Lall, **AutoGBT: Automatically Optimized Gradient Boosting Trees for Classifying Large Volume High Cardinality Data Streams under Concept-Drift,** 2018, https://github.com/flytxtds/AutoGBT.