Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aseyboldt/covadapt

Low rank adaptation of covariance matrices for nuts sampling in pymc3
https://github.com/aseyboldt/covadapt

Last synced: about 1 month ago
JSON representation

Low rank adaptation of covariance matrices for nuts sampling in pymc3

Host: GitHub
URL: https://github.com/aseyboldt/covadapt
Owner: aseyboldt
License: mit
Created: 2019-08-27T16:13:02.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2022-06-07T07:34:17.000Z (over 2 years ago)
Last Synced: 2024-08-01T16:48:08.373Z (3 months ago)
Language: Jupyter Notebook
Size: 667 KB
Stars: 14
Watchers: 5
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Better mass matrices for NUTS

This is an experimental implementation of a low rank approximation of

mass matrices for Hamiltonian MCMC samplers, specifically for PyMC.

*This is for experimentation only! Do not use for actual work (yet)!*

But feel welcome to try it out, and tell me how it worked for your models!

## Install

```

pip install git+https://github.com/aseyboldt/covadapt.git

```

## Usage

See the `notebooks/covadapt_intro.ipynb` notebook.

## (Draft of an) Overview

When we use preconditioning with a mass matrix to improve performance of HMC

based on previous draws, we often ignore information that we already computed:

the gradients of the posterior density at those samples. But those gradients

contain a lot of information about the posterior geometry and as such also

about possible preconditioners. If for example we assume that the posterior is

an $n$-dimensional normal distribution, then knowing the gradient at $n + 1$

locations identifies the covariance matrix – and as such the optimal

preconditioner of the posterior – *exactly*.

We can evaluate a precondition matrix $\hat{\Sigma}$ by thinking of it and a

mean $\hat{\mu}$ as a normal distribution

$p(x) = N(x\mid \hat{\mu}, \hat{\Sigma})$ that approximates the posterior distribution with density $p$ such that

$$

F(p \mid q) = \int p(x) \cdot \lVert \nabla p(x) - \nabla q(x)\rVert_{\hat{\Sigma}}^2 dx

$$

is small. (Where $\lVert x\rVert_{\hat{\Sigma}}$ is the norm defined by the

preconditioner). Equivalently as an affine transformation

$T(x) = \hat{\Sigma}^\tfrac{1}{2}x + \mu$

such that

$$

F(p, T) = \int p(x) \cdot \lVert\nabla T(x) - \nabla N(x\mid 0, I)\rVert ^ 2 dx

$$

is minimal.

Given an arbitrary but sufficiently nice posterior $p$, this is minimal if

$\hat{\Sigma}$ is the geodesic mean of the covariance of $p$ and the inverse

of the covariance of $\nabla p$. If $p$ is normal, then $Cov(\nabla p) = Cov(p)^{-1}$, so the minimum is reached at the covariance matrix.

If we only allow diagonal preconditioning matrices, we can find the minimum

analytically as

$$

C = \text{diag}\left(\sqrt{\frac{\text{Var}(p)}{\text{Var}(\nabla p)}}\right).

$$

This diagonal preconditioner is already implemented in PyMC and nuts-rs.

If we approximate the integral in $F$ with a finite number of samples using a Monte Carlo estimate, we find that $F$ is minimal if

$$

\text{Cov}(x_i) = \hat{\Sigma} \text{Cov}(\nabla x_i) \hat{\Sigma}

$$

If we have more dimensions than draws this does not have a unique solution,

so we introduce regularization. Some regularization methods based on the logdet or trace of $\Sigma$ or $\Sigma^{-1}$ still allow more or less explicit solutions as a algebraic Riccati equations that sometimes can be made to scale reasonably with

the dimension, but in my experiments the geodesic distance to $I$, $R(\hat\Sigma)=\sum\log(\sigma_i) ^ 2$ seems to work better.

To avoid quadratic memory and computational costs with the dimensionality,

we write $\hat{\Sigma} = D(I + Q\Sigma Q^T - QQ^T)D$ where $Q\in\mathbb{R}^{N\times k}$ orthogonal and $D, \Sigma$ diagonal, so that we can perform

all operations necessary for HMC or NUTS in $O(Nk)$.

We can now define a Riemannian metric on the space of all $(D, Q, \Sigma)$

as a pullback of the fisher information metric of $N(0, \hat\Sigma)$

and minimize $F$ using natural gradient descent. If we do this during tuning, we get similar behavior as in a stochastic natural descent, and

can avoid the saddle points during optimization.

## Acknowledgment

A lot of the work that went into this package was during my time at Quantopian,

while trying to improve sampling of a (pretty awesome) model for portfolio

optimization. Thanks a lot for making that possible!

![Quantopian logo](https://raw.githubusercontent.com/pymc-devs/pymc3/master/docs/quantopianlogo.jpg)