https://github.com/aaronjanse/stick-bug-ml

Framework for supervised machine learning systems
https://github.com/aaronjanse/stick-bug-ml

decorators framework kaggle machine-learning organization python python3

Last synced: 6 months ago
JSON representation

Framework for supervised machine learning systems

Host: GitHub
URL: https://github.com/aaronjanse/stick-bug-ml
Owner: aaronjanse
License: apache-2.0
Created: 2017-07-13T19:38:43.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2017-07-13T20:45:06.000Z (about 9 years ago)
Last Synced: 2024-10-30T02:58:41.212Z (over 1 year ago)
Topics: decorators, framework, kaggle, machine-learning, organization, python, python3
Language: Jupyter Notebook
Homepage:
Size: 44.9 KB
Stars: 5
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # stick-bug-ml

A framework to ease the burden of organizing code of a supervised machine learning system.

It provides decorators that manage data & pass it between common steps in building a machine learning system, such as:

- loading the dataset

- preprocessing

- feature generation

- model definition

While doing this, it keeps the global namespace free of clutter such as that from an endless chain of features and models.

In addition, it makes it easy to put new, real life, data through the exact same process that training data goes through.

## Installation

Install simply via `pip` (Python 3):

```bash

$ pip install stickbugml

```

Dependencies:

- Python 3

- sklearn

- pandas

- numpy

## Documenration

The documentation can be found at [docs/README.md](https://github.com/Aaronduino/stick-bug-ml/blob/master/docs/README.md)

## Example

Note: there is also a great [example for use in Jupyter Notebooks](demo.ipynb)

First, import this library:

```python

import stickbugml

from stickbugml.decorators import dataset, feature, model

```

Load your dataset:

```python

import seaborn.apionly as sns

import pandas as pd

@dataset(train_valid_test=(0.6, 0.2, 0.2)) # define your train/test/validation data splits

def raw_dataset():

    titanic_dataset = sns.load_dataset('titanic')

    # Drop NaN rows for simplicity

    titanic_dataset.dropna(inplace=True)

    # Extract X and y

    X = titanic_dataset.drop('survived', axis=1)

    y = titanic_dataset['survived']

    return X, y

print(raw_dataset.head()) # yes, this does work! raw_dataset is now a pandas DataFrame

```

(Optionally) do some pre-processing:

```python

@preprocess

def preprocessed_dataset(X):

    # Encode categorical columns

    categorical_column_names = [

            'sex', 'embarked', 'class',

            'who', 'adult_male', 'deck',

            'embark_town', 'alive', 'alone']

    X = pd.get_dummies(X,

                       columns=categorical_column_names,

                       prefix=categorical_column_names)

    return X

print(preprocessed_dataset.head()) # See the first code block for explaination

```

Generate some features:

```python

from sklearn import decomposition

import numpy as np

@feature('pca')

def pca_feature(X):

    pca = decomposition.PCA(n_components=3)

    pca.fit(X)

    pca_out = pca.transform(X)

    pca_out = np.transpose(pca_out, (1, 0))

    return pd.DataFrame(pca_out)

# let's preview

print(pca_feature.head()) # See the first code block for explaination

# you can add more features, btw

```

And define your (machine learning) model(s):

```python

import xgboost as xgb

@model('xgboost')

def xgboost_model():

    def define(num_columns):

        return None # xgboost models aren't pre-defined

    def train(model, params, train, validation):

        params['objective'] = 'binary:logistic' # Static parameters can be defined here

        params['eval_metric'] = 'logloss'

        d_train = xgb.DMatrix(train['X'], label=train['y'])

        d_valid = xgb.DMatrix(validation['X'], label=validation['y'])

        watchlist = [(d_train, 'train'), (d_valid, 'valid')]

        trained_model = xgb.train(params, d_train, 2000, watchlist, early_stopping_rounds=50, verbose_eval=10)

        return trained_model

    def predict(model, X):

        return model.predict(xgb.DMatrix(X))

    return define, train, predict

```

Now you can train your model, trying out different parameters if you want:

```python

stickbugml.train('xgboost', {

    'max_depth': 7,

    'eta': 0.01

})

```

The library keeps the test data's ground truth values locked away so your models won't train on it.

After you train your model, have the framework evaluate it for you:

```python

logloss_score = stickbugml.evaluate('xgboost')

print(logloss_score)

```

You can add more models and features if so desired.

Since this library is built with reality in mind, you can easily get predictions for new/real-life data:

```python

raw_X = pd.read_csv('2018_titanic_manifesto.csv') # It will probably sink, but we don't know who will survive

processed_X = stickbugml.process(raw_X) # Process the data

del raw_X # Gotta keep that namespace clean, right?

y = stickbugml.predict('xgboost', processed_X) # Make predictions

print(y)

```

## Contributing & Feedback

If you have any problems, or would like a new feature, submit an Issue.

If you want to help out, feel free to submit a Pull Request.

## License

This project uses the Apache 2.0 License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aaronjanse/stick-bug-ml

Awesome Lists containing this project

README