Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/matthewwardrop/formulaic
A high-performance implementation of Wilkinson formulas for Python.
https://github.com/matthewwardrop/formulaic
Last synced: 4 days ago
JSON representation
A high-performance implementation of Wilkinson formulas for Python.
- Host: GitHub
- URL: https://github.com/matthewwardrop/formulaic
- Owner: matthewwardrop
- License: mit
- Created: 2019-09-02T03:23:35.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2025-01-09T23:18:47.000Z (about 1 month ago)
- Last Synced: 2025-02-09T21:01:13.240Z (11 days ago)
- Language: Python
- Homepage:
- Size: 3.03 MB
- Stars: 367
- Watchers: 12
- Forks: 27
- Open Issues: 19
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-list - Formulaic - A high-performance implementation of Wilkinson formulas for Python. (Linear Algebra / Statistics Toolkit / General Purpose Tensor Library)
README
#
[](https://pypi.org/project/formulaic/)


[](https://github.com/matthewwardrop/formulaic/actions?query=workflow%3A%22Run+Tox+Tests%22)
[](https://matthewwardrop.github.io/formulaic/)
[](https://codecov.io/gh/matthewwardrop/formulaic)
[](https://github.com/psf/black)Formulaic is a high-performance implementation of Wilkinson formulas for Python.
- **Documentation**: https://matthewwardrop.github.io/formulaic
- **Source Code**: https://github.com/matthewwardrop/formulaic
- **Issue tracker**: https://github.com/matthewwardrop/formulaic/issuesIt provides:
- high-performance dataframe to model-matrix conversions.
- support for reusing the encoding choices made during conversion of one data-set on other datasets.
- extensible formula parsing.
- extensible data input/output plugins, with implementations for:
- input:
- `pandas.DataFrame`
- `pyarrow.Table`
- output:
- `pandas.DataFrame`
- `numpy.ndarray`
- `scipy.sparse.CSCMatrix`
- support for symbolic differentiation of formulas (and hence model matrices).
- and much more.## Example code
```
import pandas
from formulaic import Formuladf = pandas.DataFrame({
'y': [0, 1, 2],
'x': ['A', 'B', 'C'],
'z': [0.3, 0.1, 0.2],
})y, X = Formula('y ~ x + z').get_model_matrix(df)
````y = `
y
0
0
1
1
2
2
`X = `
Intercept
x[T.B]
x[T.C]
z
0
1.0
0
0
0.3
1
1.0
1
0
0.1
2
1.0
0
1
0.2
Note that the above can be short-handed to:
```
from formulaic import model_matrix
model_matrix('y ~ x + z', df)
```## Benchmarks
Formulaic typically outperforms R for both dense and sparse model matrices, and vastly outperforms `patsy` (the existing implementation for Python) for dense matrices (`patsy` does not support sparse model matrix output).

For more details, see [here](benchmarks/README.md).
## Related projects and prior art
- [Patsy](https://github.com/pydata/patsy): a prior implementation of Wilkinson formulas for Python, which is widely used (e.g. in statsmodels). It has fantastic documentation (which helped bootstrap this project), and a rich array of features.
- [StatsModels.jl `@formula`](https://juliastats.org/StatsModels.jl/stable/formula/): The implementation of Wilkinson formulas for Julia.
- [R Formulas](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/formula): The implementation of Wilkinson formulas for R, which is thoroughly introduced [here](https://cran.r-project.org/web/packages/Formula/vignettes/Formula.pdf). [R itself is an implementation of [S](https://en.wikipedia.org/wiki/S_%28programming_language%29), in which formulas were first made popular].
- The work that started it all: Wilkinson, G. N., and C. E. Rogers. Symbolic description of factorial models for analysis of variance. J. Royal Statistics Society 22, pp. 392–399, 1973.## Used by
Below are some of the projects that use Formulaic:
- [Glum](https://github.com/Quantco/glum): High performance Python GLM's with all the features.
- [Lifelines](https://github.com/camDavidsonPilon/lifelines): Survival analysis in Python.
- [Linearmodels](https://github.com/bashtage/linearmodels): Additional linear models including instrumental variable and panel data models that are missing from statsmodels.
- [Pyfixest](https://github.com/s3alfisc/pyfixest): Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax.
- [Tabmat](https://github.com/Quantco/tabmat): Efficient matrix representations for working with tabular data.
- Add your project here!