https://github.com/quantco/glum
High performance Python GLMs with all the features!
https://github.com/quantco/glum
elastic-net gamma glm lasso logit poisson ridge tweedie
Last synced: about 1 month ago
JSON representation
High performance Python GLMs with all the features!
- Host: GitHub
- URL: https://github.com/quantco/glum
- Owner: Quantco
- License: bsd-3-clause
- Created: 2020-03-25T19:37:22.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2025-05-14T12:39:48.000Z (about 1 month ago)
- Last Synced: 2025-05-16T13:01:33.748Z (about 1 month ago)
- Topics: elastic-net, gamma, glm, lasso, logit, poisson, ridge, tweedie
- Language: Python
- Homepage: https://glum.readthedocs.io/
- Size: 30 MB
- Stars: 334
- Watchers: 16
- Forks: 28
- Open Issues: 36
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.rst
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
# glum
[](https://github.com/Quantco/glum/actions)
[](https://github.com/Quantco/glum/actions/workflows/daily.yml)
[](https://glum.readthedocs.io/)
[](https://anaconda.org/conda-forge/glum)
[](https://pypi.org/project/glum)
[](https://pypi.org/project/glum)
[](https://doi.org/10.5281/zenodo.14991108)[Documentation](https://glum.readthedocs.io/en/latest/)
Generalized linear models (GLM) are a core statistical tool that include many common methods like least-squares regression, Poisson regression and logistic regression as special cases. At QuantCo, we have used GLMs in e-commerce pricing, insurance claims prediction and more. We have developed `glum`, a fast Python-first GLM library. The development was based on [a fork of scikit-learn](https://github.com/scikit-learn/scikit-learn/pull/9405), so it has a scikit-learn-like API. We are thankful for the starting point provided by Christian Lorentzen in that PR!
The goal of `glum` is to be at least as feature-complete as existing GLM libraries like `glmnet` or `h2o`. It supports
* Built-in cross validation for optimal regularization, efficiently exploiting a “regularization path”
* L1 regularization, which produces sparse and easily interpretable solutions
* L2 regularization, including variable matrix-valued (Tikhonov) penalties, which are useful in modeling correlated effects
* Elastic net regularization
* Normal, Poisson, logistic, gamma, and Tweedie distributions, plus varied and customizable link functions
* Box constraints, linear inequality constraints, sample weights, offsetsThis repo also includes tools for benchmarking GLM implementations in the `glum_benchmarks` module. For details on the benchmarking, [see here](src/glum_benchmarks/README.md). Although the performance of `glum` relative to `glmnet` and `h2o` depends on the specific problem, we find that when N >> K (there are more observations than predictors), it is consistently much faster for a wide range of problems.

For more information on `glum`, including tutorials and API reference, please see [the documentation](https://glum.readthedocs.io/en/latest/).
Why did we choose the name `glum`? We wanted a name that had the letters GLM and wasn't easily confused with any existing implementation. And we thought glum sounded like a funny name (and not glum at all!). If you need a more professional sounding name, feel free to pronounce it as G-L-um. Or maybe it stands for "Generalized linear... ummm... modeling?"
# A classic example predicting housing prices
```python
>>> import pandas as pd
>>> from sklearn.datasets import fetch_openml
>>> from glum import GeneralizedLinearRegressor
>>>
>>> # This dataset contains house sale prices for King County, which includes
>>> # Seattle. It includes homes sold between May 2014 and May 2015.
>>> # The full version of this dataset can be found at:
>>> # https://www.openml.org/search?type=data&status=active&id=42092
>>> house_data = pd.read_parquet("data/housing.parquet")
>>>
>>> # Use only select features
>>> X = house_data[
... [
... "bedrooms",
... "bathrooms",
... "sqft_living",
... "floors",
... "waterfront",
... "view",
... "condition",
... "grade",
... "yr_built",
... "yr_renovated",
... ]
... ].copy()
>>>
>>>
>>> # Model whether a house had an above or below median price via a Binomial
>>> # distribution. We'll be doing L1-regularized logistic regression.
>>> price = house_data["price"]
>>> y = (price < price.median()).values.astype(int)
>>> model = GeneralizedLinearRegressor(
... family='binomial',
... l1_ratio=1.0,
... alpha=0.001
... )
>>>
>>> _ = model.fit(X=X, y=y)
>>>
>>> # .report_diagnostics shows details about the steps taken by the iterative solver.
>>> diags = model.get_formatted_diagnostics(full_report=True)
>>> diags[['objective_fct']]
objective_fct
n_iter
0 0.693091
1 0.489500
2 0.449585
3 0.443681
4 0.443498
5 0.443497
>>>
>>> # Models can also be built with formulas from formulaic.
>>> model_formula = GeneralizedLinearRegressor(
... family='binomial',
... l1_ratio=1.0,
... alpha=0.001,
... formula="bedrooms + np.log(bathrooms + 1) + bs(sqft_living, 3) + C(waterfront)"
... )
>>> _ = model_formula.fit(X=house_data, y=y)```
# Installation
Please install the package through conda-forge:
```bash
conda install glum -c conda-forge
```# Performance
For optimal performance on an x86_64 architecture, we recommend using the MKL library
(`conda install mkl`). By default, conda usually installs the openblas version, which
is slower, but supported on all major architecture and OS.