Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aseyboldt/covadapt
Low rank adaptation of covariance matrices for nuts sampling in pymc3
https://github.com/aseyboldt/covadapt
Last synced: about 1 month ago
JSON representation
Low rank adaptation of covariance matrices for nuts sampling in pymc3
- Host: GitHub
- URL: https://github.com/aseyboldt/covadapt
- Owner: aseyboldt
- License: mit
- Created: 2019-08-27T16:13:02.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2022-06-07T07:34:17.000Z (over 2 years ago)
- Last Synced: 2024-08-01T16:48:08.373Z (3 months ago)
- Language: Jupyter Notebook
- Size: 667 KB
- Stars: 14
- Watchers: 5
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Better mass matrices for NUTS
This is an experimental implementation of a low rank approximation of
mass matrices for Hamiltonian MCMC samplers, specifically for PyMC.*This is for experimentation only! Do not use for actual work (yet)!*
But feel welcome to try it out, and tell me how it worked for your models!
## Install
```
pip install git+https://github.com/aseyboldt/covadapt.git
```## Usage
See the `notebooks/covadapt_intro.ipynb` notebook.
## (Draft of an) Overview
When we use preconditioning with a mass matrix to improve performance of HMC
based on previous draws, we often ignore information that we already computed:
the gradients of the posterior density at those samples. But those gradients
contain a lot of information about the posterior geometry and as such also
about possible preconditioners. If for example we assume that the posterior is
an $n$-dimensional normal distribution, then knowing the gradient at $n + 1$
locations identifies the covariance matrix – and as such the optimal
preconditioner of the posterior – *exactly*.We can evaluate a precondition matrix $\hat{\Sigma}$ by thinking of it and a
mean $\hat{\mu}$ as a normal distribution
$p(x) = N(x\mid \hat{\mu}, \hat{\Sigma})$ that approximates the posterior distribution with density $p$ such that$$
F(p \mid q) = \int p(x) \cdot \lVert \nabla p(x) - \nabla q(x)\rVert_{\hat{\Sigma}}^2 dx
$$is small. (Where $\lVert x\rVert_{\hat{\Sigma}}$ is the norm defined by the
preconditioner). Equivalently as an affine transformation
$T(x) = \hat{\Sigma}^\tfrac{1}{2}x + \mu$
such that$$
F(p, T) = \int p(x) \cdot \lVert\nabla T(x) - \nabla N(x\mid 0, I)\rVert ^ 2 dx
$$is minimal.
Given an arbitrary but sufficiently nice posterior $p$, this is minimal if
$\hat{\Sigma}$ is the geodesic mean of the covariance of $p$ and the inverse
of the covariance of $\nabla p$. If $p$ is normal, then $Cov(\nabla p) = Cov(p)^{-1}$, so the minimum is reached at the covariance matrix.If we only allow diagonal preconditioning matrices, we can find the minimum
analytically as$$
C = \text{diag}\left(\sqrt{\frac{\text{Var}(p)}{\text{Var}(\nabla p)}}\right).
$$This diagonal preconditioner is already implemented in PyMC and nuts-rs.
If we approximate the integral in $F$ with a finite number of samples using a Monte Carlo estimate, we find that $F$ is minimal if
$$
\text{Cov}(x_i) = \hat{\Sigma} \text{Cov}(\nabla x_i) \hat{\Sigma}
$$If we have more dimensions than draws this does not have a unique solution,
so we introduce regularization. Some regularization methods based on the logdet or trace of $\Sigma$ or $\Sigma^{-1}$ still allow more or less explicit solutions as a algebraic Riccati equations that sometimes can be made to scale reasonably with
the dimension, but in my experiments the geodesic distance to $I$, $R(\hat\Sigma)=\sum\log(\sigma_i) ^ 2$ seems to work better.To avoid quadratic memory and computational costs with the dimensionality,
we write $\hat{\Sigma} = D(I + Q\Sigma Q^T - QQ^T)D$ where $Q\in\mathbb{R}^{N\times k}$ orthogonal and $D, \Sigma$ diagonal, so that we can perform
all operations necessary for HMC or NUTS in $O(Nk)$.We can now define a Riemannian metric on the space of all $(D, Q, \Sigma)$
as a pullback of the fisher information metric of $N(0, \hat\Sigma)$
and minimize $F$ using natural gradient descent. If we do this during tuning, we get similar behavior as in a stochastic natural descent, and
can avoid the saddle points during optimization.## Acknowledgment
A lot of the work that went into this package was during my time at Quantopian,
while trying to improve sampling of a (pretty awesome) model for portfolio
optimization. Thanks a lot for making that possible!![Quantopian logo](https://raw.githubusercontent.com/pymc-devs/pymc3/master/docs/quantopianlogo.jpg)