https://github.com/alegonz/baikal
A graph-based functional API for building complex scikit-learn pipelines.
https://github.com/alegonz/baikal
data-science graph-based machine-learning python scikit-learn
Last synced: 10 days ago
JSON representation
A graph-based functional API for building complex scikit-learn pipelines.
- Host: GitHub
- URL: https://github.com/alegonz/baikal
- Owner: alegonz
- License: bsd-3-clause
- Created: 2019-01-21T12:59:02.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T03:38:11.000Z (over 2 years ago)
- Last Synced: 2025-04-21T13:04:19.112Z (27 days ago)
- Topics: data-science, graph-based, machine-learning, python, scikit-learn
- Language: Python
- Homepage: https://baikal.readthedocs.io
- Size: 650 KB
- Stars: 590
- Watchers: 17
- Forks: 30
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README

# A graph-based functional API for building complex scikit-learn pipelines
[](https://baikal.readthedocs.io)
[](https://circleci.com/gh/alegonz/baikal/tree/master)
[](https://codecov.io/gh/alegonz/baikal)
[](https://lgtm.com/projects/g/alegonz/baikal/context:python)
[](https://github.com/psf/black)
[](https://pypi.org/project/baikal)
[](https://anaconda.org/conda-forge/baikal)
[](https://github.com/alegonz/baikal/blob/master/LICENSE)**baikal** is written in pure Python. It supports Python 3.5 and above.
Note: **baikal** is still a young project and there might be backward incompatible changes.
The next development steps and backwards-incompatible changes are announced and discussed
in [this issue](https://github.com/alegonz/baikal/issues/16). Please subscribe to it if
you use **baikal**.### What is baikal?
**baikal is a graph-based, functional API for building complex machine learning pipelines
of objects that implement the** [scikit-learn API](https://scikit-learn.org/stable/developers/contributing.html#different-objects).
It is mostly inspired on the excellent [Keras](https://keras.io) API for Deep Learning,
and borrows a few concepts from the [TensorFlow](https://www.tensorflow.org) framework
and the (perhaps lesser known) [graphkit](https://github.com/yahoo/graphkit) package.**baikal** aims to provide an API that allows to build complex, non-linear machine learning
pipelines that look like this:
with code that looks like this:
```python
x1 = Input()
x2 = Input()
y_t = Input()y1 = ExtraTreesClassifier()(x1, y_t)
y2 = RandomForestClassifier()(x2, y_t)
z = PowerTransformer()(x2)
z = PCA()(z)
y3 = LogisticRegression()(z, y_t)ensemble_features = Stack()([y1, y2, y3])
y = SVC()(ensemble_features, y_t)model = Model([x1, x2], y, y_t)
```### What can I do with it?
With **baikal** you can
- build non-linear pipelines effortlessly
- handle multiple inputs and outputs
- add steps that operate on targets as part of the pipeline
- nest pipelines
- use prediction probabilities (or any other kind of output) as inputs to other steps in the pipeline
- query intermediate outputs, easing debugging
- freeze steps that do not require fitting
- define and add custom steps easily
- plot pipelinesAll with boilerplate-free, readable code.
### Why baikal?
The pipeline above (to the best of the author's knowledge) cannot be easily built using
[scikit-learn's composite estimators API](https://scikit-learn.org/stable/modules/compose.html#pipelines-and-composite-estimators)
as you encounter some limitations:1. It is aimed at linear pipelines
- You could add some step parallelism with the [ColumnTransformer](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data)
API, but this is limited to transformer objects.
2. Classifiers/Regressors can only be used at the end of the pipeline.
- This means we cannot use the predicted labels (or their probabilities) as features
to other classifiers/regressors.
- You could leverage mlxtend's [StackingClassifier](http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier)
and come up with some clever combination of the above composite estimators
(`Pipeline`s, `ColumnTransformer`s, and `StackingClassifier`s, etc), but you might
end up with code that feels hard-to-follow and verbose.
3. Cannot handle multiple input/multiple output models.Perhaps you could instead define a big, composite estimator class that integrates each of
the pipeline steps through composition. This, however, most likely will require
* writing big `__init__` methods to control each of the internal steps' knobs;
* being careful with `get_params` and `set_params` if you want to use, say, `GridSearchCV`;
* and adding some boilerplate code if you want to access the outputs of intermediate
steps for debugging.By using **baikal** as shown in the example above, code can be more readable, less verbose
and closer to our mental representation of the pipeline. **baikal** also provides an API
to fit, predict with, and query the entire pipeline with single commands.