https://github.com/CDonnerer/xgboost-distribution

Probabilistic prediction with XGBoost.
https://github.com/CDonnerer/xgboost-distribution

Last synced: 2 months ago
JSON representation

Probabilistic prediction with XGBoost.

Host: GitHub
URL: https://github.com/CDonnerer/xgboost-distribution
Owner: CDonnerer
License: mit
Created: 2021-05-23T21:06:20.000Z (almost 5 years ago)
Default Branch: main
Last Pushed: 2025-03-22T17:50:22.000Z (about 1 year ago)
Last Synced: 2025-03-22T18:28:47.663Z (about 1 year ago)
Language: Python
Size: 470 KB
Stars: 106
Watchers: 4
Forks: 18
Open Issues: 12
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.rst
- License: LICENSE.txt
- Authors: AUTHORS.rst

Awesome Lists containing this project

awesome-gradient-boosting-machines - XGBoost-Distribution - Probabilistic prediction with XGBoost via MLE. Like NGBoost but ~15x faster, with full XGBoost features (monotonic constraints, GPU). ⭐ 120+ (Implementations / Other Frameworks)

README

          .. image:: https://github.com/CDonnerer/xgboost-distribution/actions/workflows/test.yml/badge.svg?branch=main

  :target: https://github.com/CDonnerer/xgboost-distribution/actions/workflows/test.yml

.. image:: https://coveralls.io/repos/github/CDonnerer/xgboost-distribution/badge.svg?branch=main

  :target: https://coveralls.io/github/CDonnerer/xgboost-distribution?branch=main

.. image:: https://img.shields.io/badge/code%20style-black-000000.svg

  :target: https://github.com/psf/black

.. image:: https://readthedocs.org/projects/xgboost-distribution/badge/?version=latest

  :target: https://xgboost-distribution.readthedocs.io/en/latest/?badge=latest

  :alt: Documentation Status

.. image:: https://img.shields.io/pypi/v/xgboost-distribution.svg

  :alt: PyPI-Server

  :target: https://pypi.org/project/xgboost-distribution/

====================

xgboost-distribution

====================

XGBoost for probabilistic prediction. Like `NGBoost`_, but `faster`_, and in the `XGBoost scikit-learn API`_.

.. image:: https://raw.githubusercontent.com/CDonnerer/xgboost-distribution/main/imgs/xgb_dist.png

    :align: center

    :width: 600px

    :alt: XGBDistribution example

Installation

============

.. code-block:: console

    $ pip install xgboost-distribution

    $ pip install xgboost-distribution[gpu]  # for GPU support

`Dependencies`_:

.. code-block::

    python_requires = >=3.10

    install_requires =

        scikit-learn

        xgboost>=3.0.0

Usage

===========

``XGBDistribution`` follows the `XGBoost scikit-learn API`_, with an additional keyword

argument specifying the distribution, which is fit via `Maximum Likelihood Estimation`_:

.. code-block:: python

      from sklearn.datasets import fetch_california_housing

      from sklearn.model_selection import train_test_split

      from xgboost_distribution import XGBDistribution

      data = fetch_california_housing()

      X, y = data.data, data.target

      X_train, X_test, y_train, y_test = train_test_split(X, y)

      model = XGBDistribution(

          distribution="normal",

          n_estimators=500,

          early_stopping_rounds=10

      )

      model.fit(X_train, y_train, eval_set=[(X_test, y_test)])

See the `documentation`_ for all available distributions.

Note: We recommend using early stopping to avoid overfitting, as otherwise the

determined distribution parameters can become unreliable.

After fitting, we can predict the parameters of the distribution:

.. code-block:: python

      preds = model.predict(X_test)

      mean, std = preds.loc, preds.scale

Note that this returned a `namedtuple`_ of `numpy arrays`_ for each parameter of the

distribution (we use the `scipy stats`_ naming conventions for the parameters, see e.g.

`scipy.stats.norm`_ for the normal distribution).

NGBoost performance comparison

===============================

``XGBDistribution`` follows the method shown in the `NGBoost`_ library, using natural

gradients to estimate the parameters of the distribution.

Below, we show a performance comparison of ``XGBDistribution`` and the `NGBoost`_

``NGBRegressor``, using the California Housing dataset, estimating normal distributions.

While the performance of the two models is fairly similar (measured on negative

log-likelihood of a normal distribution and the RMSE), ``XGBDistribution`` is around

**15x faster** (timed on both fit and predict steps):

.. image:: https://raw.githubusercontent.com/CDonnerer/xgboost-distribution/main/imgs/performance_comparison.png

          :align: center

          :width: 600px

          :alt: XGBDistribution vs NGBoost

Please see the `experiments page`_ for results across various datasets.

Full XGBoost features

======================

``XGBDistribution`` offers the full set of XGBoost features available in the

`XGBoost scikit-learn API`_, allowing, for example, probabilistic regression

with `monotonic constraints`_:

.. image:: https://raw.githubusercontent.com/CDonnerer/xgboost-distribution/main/imgs/monotone_constraint.png

          :align: center

          :width: 600px

          :alt: XGBDistribution monotonic constraints

Acknowledgements

=================

This package would not exist without the excellent work from:

- `NGBoost`_ - Which demonstrated how gradient boosting with natural gradients

  can be used to estimate parameters of distributions. Much of the gradient

  calculations code were adapted from there.

- `XGBoost`_ - Which provides the gradient boosting algorithms used here, in

  particular the ``sklearn`` APIs were taken as a blue-print.

.. _pyscaffold-notes:

Note

====

This project has been set up using PyScaffold 4.0.1. For details and usage

information on PyScaffold see https://pyscaffold.org/.

.. _ngboost: https://github.com/stanfordmlgroup/ngboost

.. _faster:  https://xgboost-distribution.readthedocs.io/en/latest/experiments.html

.. _xgboost scikit-learn api: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

.. _dependencies: https://github.com/CDonnerer/xgboost-distribution/blob/feature/update-linting/setup.cfg#L37

.. _monotonic constraints: https://xgboost.readthedocs.io/en/latest/tutorials/monotonic.html

.. _scipy.stats.norm: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html

.. _LAPACK gesv: https://www.netlib.org/lapack/lug/node71.html

.. _xgboost: https://github.com/dmlc/xgboost

.. _documentation: https://xgboost-distribution.readthedocs.io/en/latest/api/xgboost_distribution.XGBDistribution.html#xgboost_distribution.XGBDistribution

.. _experiments page: https://xgboost-distribution.readthedocs.io/en/latest/experiments.html

.. _numpy arrays: https://numpy.org/doc/stable/reference/generated/numpy.array.html

.. _scipy stats: https://docs.scipy.org/doc/scipy/reference/stats.html

.. _namedtuple: https://docs.python.org/3/library/collections.html#collections.namedtuple

.. _maximum likelihood estimation: https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/CDonnerer/xgboost-distribution

Awesome Lists containing this project

README