An open API service indexing awesome lists of open source software.

https://github.com/boennecd/mssm

R package for multivariate state space models
https://github.com/boennecd/mssm

Last synced: 4 months ago
JSON representation

R package for multivariate state space models

Awesome Lists containing this project

README

          

---
output:
github_document:
pandoc_args: --webtex=https://latex.codecogs.com/svg.latex?
bibliography: README.bib
nocite: |
@Polyak92
---

# Multivariate State Space Models
[![R-CMD-check](https://github.com/boennecd/mssm/workflows/R-CMD-check/badge.svg)](https://github.com/boennecd/mssm/actions)
[![](https://www.r-pkg.org/badges/version/mssm)](https://www.r-pkg.org/badges/version/mssm)
[![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/mssm)](https://cran.r-project.org/package=mssm)

```{r setup, include=FALSE}
knitr::opts_chunk$set(
error = FALSE, cache = "./README-cache/", fig.path = "man/figures/README-",
echo = TRUE)
options(digits = 4, scipen = 7)
.fig_height_small <- 4
source("render_toc.R")
palette(c("#000000", "#009E73", "#e79f00", "#9ad0f3", "#0072B2", "#D55E00",
"#CC79A7", "#F0E442"))
```

This package provides methods to estimate models of the form

$$y_{it} \sim g(\eta_{it}),\qquad i\in I_t$$
$$\eta_{it} = \vec\gamma^\top\vec x_{it} +\vec\beta_t^\top\vec z_{it}$$
$$\vec\beta_t = F\vec\beta_{t-1}+\vec\epsilon_t, \qquad \vec\epsilon_t\sim N(\vec 0, Q)$$

where $g$ is simple distribution, we observe $t=1,\dots,T$ periods, and $I_t$,
$y_{it}$, $\vec x_{it}$, and
$\vec z_{it}$ are known. What is multivariate is
$\vec y_t = \{y_{it}\}_{i\in I_t}$ (though, $\vec \beta_t$ can also be
multivariate) and this package is written to scale well
in the cardinality of $I_t$. The package uses independent
particle filters as suggested by @Lin05. This particular type of filter
can be used in the method suggested by @Poyiadjis11. I will show an example
of how to use the package through the rest of the document and highlight some
implementation details.

The package can be installed from Github e.g., by
calling

```{r github_dl, eval = FALSE}
devtools::install_github("boennecd/mssm")
```

or from CRAN by calling

```{r cran_inst, eval = FALSE}
install.packages("mssm")
```

## Table of Contents

```{r echo = FALSE}
render_toc("README.Rmd", toc_header_name = "Table of Contents", toc_depth = 3L, base_level = 2L)
```

## Poisson Example

We simulate data as follows.

```{r simulate, fig.height = .fig_height_small}
# simulate path of state variables
set.seed(78727269)
n_periods <- 312L
(F. <- matrix(c(.5, .1, 0, .8), 2L))
(Q <- matrix(c(.5^2, .1, .1, .7^2), 2L))
(Q_0 <- matrix(c(0.333, 0.194, 0.194, 1.46), 2L))

betas <- cbind(crossprod(chol(Q_0), rnorm(2L) ),
crossprod(chol(Q ), matrix(rnorm((n_periods - 1L) * 2), 2)))
betas <- t(betas)
for(i in 2:nrow(betas))
betas[i, ] <- betas[i, ] + F. %*% betas[i - 1L, ]
par(mar = c(5, 4, 1, 1))
matplot(betas, lty = 1, type = "l")

# simulate observations
cfix <- c(-1, .2, .5, -1) # gamma
n_obs <- 100L
dat <- lapply(1:n_obs, function(id){
x <- runif(n_periods, -1, 1)
X <- cbind(X1 = x, X2 = runif(1, -1, 1))
z <- runif(n_periods, -1, 1)

eta <- drop(cbind(1, X, z) %*% cfix + rowSums(cbind(1, z) * betas))
y <- rpois(n_periods, lambda = exp(eta))

# randomly drop some
keep <- .2 > runif(n_periods)

data.frame(y = y, X, Z = z, id = id, time_idx = 1:n_periods)[keep, ]
})
dat <- do.call(rbind, dat)

# show some properties
nrow(dat)
head(dat)
table(dat$y)

# quick smooth of number of events vs. time
par(mar = c(5, 4, 1, 1))
plot(smooth.spline(dat$time_idx, dat$y), type = "l", xlab = "Time",
ylab = "Number of events")

# and split by those with `Z` above and below 0
with(dat, {
z_large <- ifelse(Z > 0, "large", "small")
smooths <- lapply(split(cbind(dat, z_large), z_large), function(x){
plot(smooth.spline(x$time_idx, x$y), type = "l", xlab = "Time",
ylab = paste("Number of events -", unique(x$z_large)))
})
})
```

In the above, we simulate `r n_periods` (`n_periods`) with `r n_obs` (`n_obs`)
individuals. Each individual has a fixed covariate, `X2`, and two time-varying
covariates, `X1` and `Z`. One of the time-varying covariates, `Z`, has a
random slope. Further, the intercept is also random.

### Log-Likelihood Approximations

We start by estimating a generalized linear model without random effects.

```{r fit_glm}
glm_fit <- glm(y ~ X1 + X2 + Z, poisson(), dat)
summary(glm_fit)
logLik(glm_fit)
```

Next, we make a log-likelihood approximation with the implemented particle at
the true parameters with the `mssm` function.

```{r fit_mssm}
library(mssm)
ll_func <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
# make it explict that there is an intercept (not needed)
random = ~ 1 + Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = 500L, what = "log_density"))

system.time(
mssm_obj <- ll_func$pf_filter(
cfix = cfix, disp = numeric(), F. = F., Q = Q))

# returns the log-likelihood approximation
logLik(mssm_obj)

# also shown by print
mssm_obj
```

We get a much larger log-likelihood as expected. We can plot the predicted
values of state variables from the filter distribution.

```{r plot_filter, fig.height = .fig_height_small, fig.width = 12L}
# get predicted mean and prediction interval
filter_means <- plot(mssm_obj, do_plot = FALSE)

# plot with which also contains the true paths
for(i in 1:ncol(betas)){
be <- betas[, i]
me <- filter_means$means[i, ]
lb <- filter_means$lbs[i, ]
ub <- filter_means$ubs[i, ]

# dashed: true paths
# continuous: predicted mean from filter distribution
# dotted: prediction interval
par(mar = c(5, 4, 1, 1))
matplot(cbind(be, me, lb, ub), lty = c(2, 1, 3, 3), type = "l",
col = "black", ylab = rownames(filter_means$lbs)[i])
}
```

We can also get predicted values from the smoothing distribution.

```{r show_smooths, fig.height = .fig_height_small, fig.width = 12L}
# get smoothing weights
mssm_obj <- ll_func$smoother(mssm_obj)

# get predicted mean and prediction interval from smoothing distribution
smooth_means <- plot(mssm_obj, do_plot = FALSE, which_weights = "smooth")

for(i in 1:ncol(betas)){
be <- betas[, i]
me <- filter_means$means[i, ]
lb <- filter_means$lbs[i, ]
ub <- filter_means$ubs[i, ]
mes <- smooth_means$means[i, ]
lbs <- smooth_means$lbs[i, ]
ubs <- smooth_means$ubs[i, ]

# dashed: true paths
# continuous: predicted mean from filter distribution
# dotted: prediction interval
#
# smooth predictions are in a different color
par(mar = c(5, 4, 1, 1))
matplot(cbind(be, me, lb, ub, mes, lbs, ubs),
lty = c(2, 1, 3, 3, 1, 3, 3), type = "l",
col = c(rep(1, 4), rep(2, 3)), ylab = rownames(filter_means$lbs)[i])
}

# compare mean square error of the two means
rbind(filter = colMeans((t(filter_means$means) - betas)^2),
smooth = colMeans((t(smooth_means$means) - betas)^2))
```

We can get the effective sample size at each point in time with the `get_ess`
function.

```{r show_ess, fig.height = .fig_height_small}
(ess <- get_ess(mssm_obj))
plot(ess)
```

We can compare this what we get by using a so-called bootstrap (like) filter
instead.

```{r comp_boot, fig.height = .fig_height_small}
local({
ll_boot <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = 500L, what = "log_density",
which_sampler = "bootstrap"))

print(system.time(
boot_fit <- ll_boot$pf_filter(
cfix = coef(glm_fit), disp = numeric(), F. = F., Q = Q)))

plot(get_ess(boot_fit))
})
```

The above is not much faster (and maybe slower in this run) as the bulk of
the computation is not in the sampling step. We can also compare the
log-likelihood approximation with what we get if we choose parameters close
to the GLM estimates.

```{r comp_close_glm}
mssm_glm <- ll_func$pf_filter(
cfix = coef(glm_fit), disp = numeric(), F. = diag(1e-8, 2),
Q = diag(1e-4^2, 2))
logLik(mssm_glm)
```

### Antithetic Variables

One way to reduce the variance of the Monte Carlo estimate is to use
[antithetic variables](https://en.wikipedia.org/wiki/Antithetic_variates).
Two types of antithetic variables are implemented as in @Durbin97.
That is, one balanced for location and two balanced for scale. This is currently
only implemented with a t-distribution as the proposal distribution.

We start by giving some details on the locations balanced variable. Suppose
we use a t-distribution with $\nu$ degrees of freedom, a $d$ dimensional mean
of $\mu$ and a scale matrix $\Sigma$. We can then generate a sample by setting

$$\begin{aligned} \vec x &= \vec\mu + C \frac{\vec z}{\sqrt{a / \nu}} & \Sigma &= CC^\top \\ \vec z &\sim N(\vec 0, I) & a &\sim \chi^2_\nu \end{aligned}$$

Then the location balanced variable is

$$\widehat{\vec x} = \vec\mu - C \frac{\vec z}{\sqrt{a / \nu}}$$

For the scaled balanced variables we use that

$$u = \frac{\vec z^\top\vec z/ d}{a / \nu} \sim F(d, \nu)$$

We then define the cumulative distribution function

$$k = P(U \leq u) = Q(u)$$

and set

$$u' = Q^{-1}(1 - k)$$

Then the two scaled balanced variables are

$$\begin{aligned} \widetilde{\vec x}_1 &= \vec\mu + \sqrt{u'/u} \cdot C \frac{\vec z}{\sqrt{a / \nu}} \\ \widetilde{\vec x}_2 &= \vec\mu - \sqrt{u'/u} \cdot C \frac{\vec z}{\sqrt{a / \nu}} \end{aligned}$$

We will illustrate the reduction in variance of the log-likelihood estimate.
To do so, we run the particle filter with and without antithetic variables
multiple times below
to get an estimate of the error of the approximation.

```{r anti_ex, cache = 1}
set.seed(12769550)
compare_anti <- local({
ll_func_no_anti <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = 500L, what = "log_density"))

ll_func_anti <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = 500L, what = "log_density",
use_antithetic = TRUE))

no_anti <- replicate(
100, {
ti <- system.time(x <- logLik(ll_func_no_anti$pf_filter(
cfix = cfix, disp = numeric(), F. = F., Q = Q, seed = NULL)))
list(ti = ti, x = x)
}, simplify = FALSE)
w_anti <- replicate(
100, {
ti <- system.time(x <- logLik(ll_func_anti$pf_filter(
cfix = cfix, disp = numeric(), F. = F., Q = Q, seed = NULL)))
list(ti = ti, x = x)
}, simplify = FALSE)

list(no_anti = no_anti, w_anti = w_anti)
})
```

The mean estimate of the log-likelihood and standard error of the estimate is
shown below with and without antithetic variables.

```{r show_anti_ex}
sapply(compare_anti, function(x){
x <- sapply(x, "[[", "x")
c(mean = mean(x), se = sd(x) / sqrt(length(x)))
})
```

Using antithetic variables is slower. Below we show summary statistics for the
elapsed time without using antithetic variables and with antithetic variables.

```{r run_anti_ex}
sapply(compare_anti, function(x){
x <- sapply(x, "[[", "ti")
z <- x[c("elapsed"), ]
c(mean = mean(z), quantile(z, c(.5, .25, .75, 0, 1)))
})
```

### Parameter Estimation

We will need to estimate the parameters for real applications. We could do
this e.g., with a Monte Carlo expectation-maximization algorithm or by using
a Monte Carlo approximation of the gradient. Currently, the latter is only
available and the user will have to write a custom function to perform
the estimation.
I will provide an example below. The `sgd` function is not a part of the
package. Instead the package provides a way to approximate the gradient and
allows the user to perform subsequent maximization (e.g., with constraints or
penalties). The definition of the
`sgd` is given at the end of this file as it is somewhat long. We start
by using a Laplace approximation to get the starting values.

```{r mc_grad_est, echo = FALSE}
# Stochastic gradient descent for mssm object. The function assumes that the
# state vector is stationary. The default values are rather arbitrary.
#
# Args:
# object: an object of class mssmFunc.
# n_it: number of iterations.
# lrs: learning rates to use. Must have n_it elements. Like problem specific.
# avg_start: index to start averaging. See Polyak et al. (1992) for arguments
# for averaging.
# cfix: starting values for fixed coefficients.
# F.: starting value for transition matrix in conditional distribution of the
# current state given the previous state.
# Q: starting value for covariance matrix in conditional distribution of the
# current state given the previous.
# verbose: TRUE if output should be printed during estmation.
# disp: starting value for dispersion parameter.
#
# Returns:
# List with estimates and the log-likelihood approximation at each iteration.
sgd <- function(
object, n_it = 150L,
lrs = 1e-2 * (1:n_it)^(-1/2), avg_start = max(1L, as.integer(n_it * 4L / 5L)),
cfix, F., Q, verbose = FALSE, disp = numeric())
{
# make checks
stopifnot(
inherits(object, "mssmFunc"), object$control$what == "gradient", n_it > 0L,
all(lrs > 0.), avg_start > 1L, length(lrs) == n_it)
n_fix <- nrow(object$X)
n_rng <- nrow(object$Z)

# objects for estimates at each iteration, gradient norms, and
# log-likelihood approximations
has_dispersion <-
any(sapply(c("^Gamma", "^gaussian"), grepl, x = object$family))
n_params <-
has_dispersion + n_fix + n_rng * n_rng + n_rng * (n_rng + 1L) / 2L
ests <- matrix(NA_real_, n_it + 1L, n_params)
ests[1L, ] <- c(
cfix, if(has_dispersion) disp else NULL, F.,
Q[lower.tri(Q, diag = TRUE)])
grad_norm <- lls <- rep(NA_real_, n_it)

# indices of the different components
idx_fix <- 1:n_fix
if(has_dispersion)
idx_dip <- n_fix + 1L
idx_F <- 1:(n_rng * n_rng) + n_fix + has_dispersion
idx_Q <- 1:(n_rng * (n_rng + 1L) / 2L) + n_fix + n_rng * n_rng +
has_dispersion

# function to set the parameters
library(matrixcalc) # TODO: get rid of this
dup_mat <- duplication.matrix(ncol(Q))
set_parems <- function(i){
# select whether or not to average
idx <- if(i > avg_start) avg_start:i else i

# set new parameters
cfix <<- colMeans(ests[idx, idx_fix, drop = FALSE])
if(has_dispersion)
disp <<- colMeans(ests[idx, idx_dip, drop = FALSE])
F.[] <<- colMeans(ests[idx, idx_F , drop = FALSE])
Q[] <<- dup_mat %*% colMeans(ests[idx, idx_Q , drop = FALSE])

}

# run algorithm
max_half <- 25L
for(i in 1:n_it + 1L){
# get gradient. First, run the particle filter
filter_out <- object$pf_filter(
cfix = cfix, disp = disp, F. = F., Q = Q, seed = NULL)
lls[i - 1L] <- c(logLik(filter_out))

# then get the gradient associated with each particle and the log
# normalized weight of the particles
grads <- tail(filter_out$pf_output, 1L)[[1L]]$stats
ws <- tail(filter_out$pf_output, 1L)[[1L]]$ws_normalized

# compute the gradient and take a small step
grad <- colSums(t(grads) * drop(exp(ws)))
grad_norm[i - 1L] <- norm(t(grad))

lr_i <- lrs[i - 1L]
k <- 0L
while(k < max_half){
ests[i, ] <- ests[i - 1L, ] + lr_i * grad
set_parems(i)

# check that Q is positive definite and the system is stationary
c1 <- all(abs(eigen(F.)$values) < 1)
c2 <- all(eigen(Q)$values > 0)
c3 <- !has_dispersion || disp > 0.
if(c1 && c2 && c3)
break

# decrease learning rate
lr_i <- lr_i * .5
k <- k + 1L
}

# check if we failed to find a value within our constraints
if(k == max_half)
stop("failed to find solution within constraints")

# print information if requested
if(verbose){
cat(sprintf(
"\nIt %5d: log-likelihood (current, max) %12.2f, %12.2f\n",
i - 1L, logLik(filter_out), max(lls, na.rm = TRUE)),
rep("-", 66), "\n", sep = "")
cat("cfix\n")
print(cfix)
if(has_dispersion)
cat(sprintf("Dispersion: %20.8f\n", disp))
cat("F\n")
print(F.)
cat("Q\n")
print(Q)
cat(sprintf("Gradient norm: %10.4f\n", grad_norm[i - 1L]))
print(get_ess(filter_out))

}
}

list(estimates = ests, logLik = lls, F. = F., Q = Q, cfix = cfix,
disp = disp, grad_norm = grad_norm)
}

# Stochastic gradient descent for mssm object using the Adam algorithm. The
# function assumes that the state vector is stationary.
#
# Args:
# object: an object of class mssmFunc.
# n_it: number of iterations.
# mp: decay rate for first moment.
# vp: decay rate for secod moment.
# lr: learning rate.
# cfix: starting values for fixed coefficients.
# F.: starting value for transition matrix in conditional distribution of the
# current state given the previous state.
# Q: starting value for covariance matrix in conditional distribution of the
# current state given the previous.
# verbose: TRUE if output should be printed during estmation.
# disp: starting value for dispersion parameter.
#
# Returns:
# List with estimates and the log-likelihood approximation at each iteration.
adam <- function(
object, n_it = 150L, mp = .9, vp = .999, lr = .01, cfix, F., Q,
verbose = FALSE, disp = numeric())
{
# make checks
stopifnot(
inherits(object, "mssmFunc"), object$control$what == "gradient", n_it > 0L,
lr > 0., mp > 0, mp < 1, vp > 0, vp < 1)
n_fix <- nrow(object$X)
n_rng <- nrow(object$Z)

# objects for estimates at each iteration, gradient norms, and
# log-likelihood approximations
has_dispersion <-
any(sapply(c("^Gamma", "^gaussian"), grepl, x = object$family))
n_params <-
has_dispersion + n_fix + n_rng * n_rng + n_rng * (n_rng + 1L) / 2L
ests <- matrix(NA_real_, n_it + 1L, n_params)
ests[1L, ] <- c(
cfix, if(has_dispersion) disp else NULL, F.,
Q[lower.tri(Q, diag = TRUE)])
grad_norm <- lls <- rep(NA_real_, n_it)

# indices of the different components
idx_fix <- 1:n_fix
if(has_dispersion)
idx_dip <- n_fix + 1L
idx_F <- 1:(n_rng * n_rng) + n_fix + has_dispersion
idx_Q <- 1:(n_rng * (n_rng + 1L) / 2L) + n_fix + n_rng * n_rng +
has_dispersion

# function to set the parameters
library(matrixcalc) # TODO: get rid of this
dup_mat <- duplication.matrix(ncol(Q))
set_parems <- function(i){
cfix <<- ests[i, idx_fix]
if(has_dispersion)
disp <<- ests[i, idx_dip]
F.[] <<- ests[i, idx_F ]
Q[] <<- dup_mat %*% ests[i, idx_Q ]

}

# run algorithm
max_half <- 25L
m <- NULL
v <- NULL
failed <- FALSE
for(i in 1:n_it + 1L){
# get gradient. First, run the particle filter
filter_out <- object$pf_filter(
cfix = cfix, disp = disp, F. = F., Q = Q, seed = NULL)
lls[i - 1L] <- c(logLik(filter_out))

# then get the gradient associated with each particle and the log
# normalized weight of the particles
grads <- tail(filter_out$pf_output, 1L)[[1L]]$stats
ws <- tail(filter_out$pf_output, 1L)[[1L]]$ws_normalized

# compute the gradient and take a small step
grad <- colSums(t(grads) * drop(exp(ws)))
grad_norm[i - 1L] <- norm(t(grad))

m <- if(is.null(m)) (1 - mp) * grad else mp * m + (1 - mp) * grad
v <- if(is.null(v)) (1 - vp) * grad^2 else vp * v + (1 - vp) * grad^2
mh <- m / (1 - mp^(i - 1))
vh <- v / (1 - vp^(i - 1))
de <- mh / sqrt(vh + 1e-8)

lr_i <- lr
k <- 0L
while(k < max_half){
ests[i, ] <- ests[i - 1L, ] + lr_i * de
set_parems(i)

# check that Q is positive definite, the dispersion parameter is
# positive, and the system is stationary
c1 <- all(abs(eigen(F.)$values) < 1)
c2 <- all(eigen(Q)$values > 0)
c3 <- !has_dispersion || disp > 0.
if(c1 && c2 && c3)
break

# decrease learning rate
lr_i <- lr_i * .5
k <- k + 1L
}

# check if we failed to find a value within our constraints
if(k == max_half){
warning("failed to find solution within constraints")
failed <- TRUE
break
}

# print information if requested
if(verbose){
cat(sprintf(
"\nIt %5d: log-likelihood (current, max) %12.2f, %12.2f\n",
i - 1L, logLik(filter_out), max(lls, na.rm = TRUE)),
rep("-", 66), "\n", sep = "")
cat("cfix\n")
print(cfix)
if(has_dispersion)
cat(sprintf("Dispersion: %20.8f\n", disp))
cat("F\n")
print(F.)
cat("Q\n")
print(Q)
cat(sprintf("Gradient norm: %10.4f\n", grad_norm[i - 1L]))
print(get_ess(filter_out))

}
}

list(estimates = ests, logLik = lls, F. = F., Q = Q, cfix = cfix,
disp = disp, failed = failed, grad_norm = grad_norm)
}
```

```{r laplace, cache = 1}
# setup mssmFunc object to use
ll_func <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = 200L, what = "gradient", use_antithetic = TRUE))

# use Laplace approximation to get starting values
system.time(
sta <- ll_func$Laplace(
F. = diag(.5, 2), Q = diag(1, 2), cfix = coef(glm_fit), disp = numeric()))
```

```{r show_laplace}
# the function returns an object with the estimated parameters and
# approximation log-likelihood
sta
sta$Q
```

```{r sgd, cache = 1, dependson = "laplace"}
# use stochastic gradient descent with averaging
set.seed(25164416)
system.time(
res <- sgd(
ll_func, F. = sta$F., Q = sta$Q, cfix = sta$cfix,
lrs = .001 * (1:50)^(-1/2), n_it = 50L, avg_start = 30L))

# use Adam algorithm instead
set.seed(25164416)
system.time(
resa <- adam(
ll_func, F. = sta$F., Q = sta$Q, cfix = sta$cfix,
lr = .01, n_it = 50L))
```

Plots of the approximate log-likelihoods at each iteration is shown below
along with the final estimates.

```{r show_use_sgd, fig.height = .fig_height_small}
print(tail(res$logLik), digits = 6) # final log-likelihood approximations
par(mar = c(5, 4, 1, 1))
plot(res$logLik, type = "l", ylab = "log-likelihood approximation")
plot(res$grad_norm, type = "l", ylab = "approximate gradient norm")

# final estimates
res$F.
res$Q
res$cfix

# compare with output from Adam algorithm
print(tail(resa$logLik), digits = 6) # final log-likelihood approximations
plot(resa$logLik, type = "l", ylab = "log-likelihood approximation")
plot(resa$grad_norm, type = "l", ylab = "approximate gradient norm")
resa$F.
resa$Q
resa$cfix
```

We may want to use more particles towards the end when we estimate the
parameters. To do, we use the approximation described in the next section
at the final estimates that we arrived at before.

```{r cont_est, cache = 1, dependson = "use_sgd"}
ll_func <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = 10000L, what = "gradient",
which_ll_cp = "KD", aprx_eps = .01, use_antithetic = TRUE))

set.seed(25164416)
system.time(
res_final <- adam(
ll_func, F. = resa$F., Q = resa$Q, cfix = resa$cfix,
lr = .001, n_it = 25L))
```

```{r show_cont_est}
plot(res_final$logLik, type = "l", ylab = "log-likelihood approximation")
plot(res_final$grad_norm, type = "l", ylab = "approximate gradient norm")
res_final$F.
res_final$Q
res_final$cfix
```

```{r clean_param_ests, echo = FALSE}
# do this to not use them later by an error
rm(res, resa)
```

### Faster Approximation
One drawback with the particle filter we use is that it has $\mathcal{O}(N^2)$
computational complexity where $N$ is the number of particles. We can see
this by changing the number of particles.

```{r comp_w_n_part, cache = 1}
local({
# assign function that returns a function that uses a given number of
# particles
func <- function(N){
ll_func <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = N, what = "log_density",
use_antithetic = TRUE))
function()
ll_func$pf_filter(
cfix = coef(glm_fit), disp = numeric(), F. = diag(1e-8, 2),
Q = diag(1e-4^2, 2))

}

f_100 <- func( 100L)
f_200 <- func( 200L)
f_400 <- func( 400L)
f_800 <- func( 800L)
f_1600 <- func(1600L)

# benchmark. Should ĩncrease at ~ N^2 rate
microbenchmark::microbenchmark(
`100` = f_100(), `200` = f_200(), `400` = f_400(), `800` = f_800(),
`1600` = f_1600(), times = 3L)
})
```

A solution is to use the dual k-d tree method I cover later. The computational
complexity is $\mathcal{O}(N \log N)$ for this method which is somewhat
indicated by the run times shown below.

```{r KD_comp_w_n_part, cache = 1}
local({
# assign function that returns a function that uses a given number of
# particles
func <- function(N){
ll_func <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = N, what = "log_density",
which_ll_cp = "KD", KD_N_max = 6L, aprx_eps = 1e-2,
use_antithetic = TRUE))
function()
ll_func$pf_filter(
cfix = coef(glm_fit), disp = numeric(), F. = diag(1e-8, 2),
Q = diag(1e-4^2, 2))

}

f_100 <- func( 100L)
f_200 <- func( 200L)
f_400 <- func( 400L)
f_800 <- func( 800L)
f_1600 <- func( 1600L)
f_12800 <- func(12800L) # <-- much larger

# benchmark. Should increase at ~ N log N rate
microbenchmark::microbenchmark(
`100` = f_100(), `200` = f_200(), `400` = f_400(), `800` = f_800(),
`1600` = f_1600(), `12800` = f_12800(), times = 3L)
})
```

The `aprx_eps` controls the size of the error. To be precise about what this
value does then we need to some notation for the complete likelihood
(the likelihood where we observe $\vec\beta_1,\dots,\vec\beta_T$s). This is

$$L = \mu_1(\vec \beta_1)g_1(\vec y_1 \mid \vec \beta_1)\prod_{t=2}^Tf(\vec\beta_t \mid\vec\beta_{t-1})g_t(y_t\mid\beta_t)$$

where $g_t$ is conditional distribution $\vec y_t$ given $\vec\beta_t$, $f$ is
the conditional distribution of $\vec\beta_t$ given $\vec\beta_{t-1}$, and
$\mu$ is the time-invariant distribution of $\vec\beta_t$.
Let $w_t^{(j)}$ be the weight of particle
$j$ at time $t$ and $\vec \beta_t^{(j)}$ be the $j$th particle at time $t$.
Then we ensure the error in our evaluation of terms
$w_{t-1}^{(j)}f(\vec\beta_t^{(i)} \mid \vec\beta_{t-1}^{(j)})$ never
exceeds

$$w_{t-1}^{(j)} \frac{u - l}{(u + l)/2}$$
where $u$ and $l$ are respectively an upper and lower bound of
$f(\vec\beta_t^{(i)} \mid \vec\beta_{t-1}^{(j)})$.
The question is how big the error is.
Thus, we consider the error in the log-likelihood approximation at the
true parameters.

```{r comp_all_vs_aprx, cache = 1, fig.height = .fig_height_small, message = FALSE}
ll_compare <- local({
N_use <- 500L
# we alter the seed in each run. First, the exact method
ll_no_approx <- sapply(1:200, function(seed){
ll_func <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = N_use, what = "log_density",
seed = seed, use_antithetic = TRUE))

logLik(ll_func$pf_filter(
cfix = cfix, disp = numeric(), F. = F., Q = Q))
})

# then the approximation
ll_approx <- sapply(1:200, function(seed){
ll_func <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = N_use, what = "log_density",
KD_N_max = 6L, aprx_eps = 1e-2, seed = seed,
which_ll_cp = "KD", use_antithetic = TRUE))

logLik(ll_func$pf_filter(
cfix = cfix, disp = numeric(), F. = F., Q = Q))
})

list(ll_no_approx = ll_no_approx, ll_approx = ll_approx)
})
```

```{r show_comp_arell_aprx, fig.height = .fig_height_small}
par(mar = c(5, 4, 1, 1))
hist(
ll_compare$ll_no_approx, main = "", breaks = 20L,
xlab = "Log-likelihood approximation -- no aprox")
hist(
ll_compare$ll_approx , main = "", breaks = 20L,
xlab = "Log-likelihood approximation -- aprox")
```

We can make a t-test for whether there is a difference between the two
log-likelihood estimates

```{r t_test_comp_ll}
with(ll_compare, t.test(ll_no_approx, ll_approx))
```

The fact that there may only be a small difference if any is nice because
now we can get a much better approximation (in terms of variance) quickly
of e.g., the log-likelihood as shown below.

```{r show_ll_quick, cache = 1, message = FALSE}
ll_approx <- sapply(1:10, function(seed){
N_use <- 10000L # many more particles

ll_func <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = N_use, what = "log_density",
KD_N_max = 100L, aprx_eps = 1e-2, seed = seed,
which_ll_cp = "KD", use_antithetic = TRUE))

logLik(ll_func$pf_filter(
cfix = cfix, disp = numeric(), F. = F., Q = Q))
})

# approximate log-likelihood
sd(ll_approx)
print(mean(ll_approx), digits = 6)

# compare sd with
sd(ll_compare$ll_no_approx)
print(mean(ll_compare$ll_no_approx), digits = 6)
```

### Approximate Observed Information Matrix

Next, we look at approximating the observed information matrix with the method
suggested by @Poyiadjis11.

```{r apprx_obs_info, cache = 1, dependson=c("use_sgd", "cont_est")}
ll_func <- mssm(
fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,
random = ~ Z, ti = time_idx, control = mssm_control(
n_threads = 5L, N_part = 10000L, what = "Hessian",
which_ll_cp = "KD", aprx_eps = .01, use_antithetic = TRUE))

set.seed(46658529)
system.time(
mssm_obs_info <- ll_func$pf_filter(
cfix = res_final$cfix, disp = numeric(), F. = res_final$F.,
Q = res_final$Q))
```

We define a function below to get the approximate gradient and approximate
observed information matrix from the returned object. Then we compare
the output to the GLM we estimated and to the true parameters.

```{r show_apprx_obs_info}
# Function to subtract the approximate gradient elements and approximate
# observed information matrix.
#
# Args:
# object: an object of class mssm.
#
# Returns:
# list with the approximate gradient elements and approximate observed
# information matrix.
get_grad_n_obs_info <- function(object){
stopifnot(inherits(object, "mssm"))

# get dimension of the components
n_rng <- nrow(object$Z)
dim_fix <- nrow(object$X)
dim_rng <- n_rng * n_rng + ((n_rng + 1L) * n_rng) / 2L
has_dispersion <-
any(sapply(c("^Gamma", "^gaussian"), grepl, x = object$family))

# get quantities for each particle
quants <- tail(object$pf_output, 1L)[[1L]]$stats
ws <- exp(tail(object$pf_output, 1L)[[1L]]$ws_normalized)

# get mean estimates
meas <- colSums(t(quants) * drop(ws))

# separate out the different components. Start with the gradient
idx <- dim_fix + dim_rng + has_dispersion
grad <- meas[1:idx]
dim_names <- names(grad)

# then the observed information matrix
I1 <- matrix(
meas[-(1:idx)], dim_fix + dim_rng + has_dispersion,
dimnames = list(dim_names, dim_names))

I2 <- matrix(0., nrow(I1), ncol(I1))
for(i in seq_along(ws))
I2 <- I2 + ws[i] * tcrossprod(quants[1:idx, i])

list(grad = grad, obs_info = tcrossprod(grad) - I1 - I2)
}

# use function
out <- get_grad_n_obs_info(mssm_obs_info)

# approximate gradient
out$grad

# approximate standard errors
(ses <- sqrt(diag(solve(out$obs_info))))

# look at output for parameters in the observational equation. First, compare
# with glm's standard errors
sqrt(diag(vcov(glm_fit)))

# and relative to true parameters vs. estimated
rbind(
true = cfix,
glm = coef(glm_fit),
mssm = res_final$cfix,
`standard error` = ses[1:4])

# next look at parameters in state equation. First four are for F.
rbind(
true = c(F.),
mssm = c(res_final$F.),
`standard error` = ses[5:8])

# next three are w.r.t. the lower diagonal part of Q
rbind(
true = Q[lower.tri(Q, diag = TRUE)],
mssm = res_final$Q[lower.tri(Q, diag = TRUE)],
`standard error` = ses[9:11])
```

## Supported Families
The following families are supported:

- The binomial distribution is supported with logit, probit, and cloglog link.
- The Poisson distribution is supported with square root and log link.
- The gamma distribution is supported with log link.
- The normal distribution with identity link (to compare with e.g., a Kalman
filter), log link, and the inverse link function.

## Fast Sum-Kernel Approximation

This package contains a simple implementation of the dual-tree method like the one
suggested by @Gray03 and shown in @Klaas06. The problem we want to solve is
the sum-kernel problem in @Klaas06. Particularly, we consider the situation
where we have $1,\dots,N_q$ query particles denoted by
$\{\vec Y_i\}_{i=1,\dots,N_q}$ and $1,\dots,N_s$ source particles denoted by
$\{\vec X_j\}_{j=1,\dots,N_s}$. For each query particle, we want to compute
the weights

$$W_i = \frac{\tilde W_i}{\sum_{k = 1}^{N_q} \tilde W_i},\qquad \tilde W_i = \sum_{j=1}^{N_s} \bar W_j K(\vec Y_i, \vec X_j)$$

where each source particle has an associated weight $\bar W_j$ and $K$ is a
kernel. Computing the above is $\mathcal{O}(N_sN_q)$ which is major
bottleneck if $N_s$ and $N_q$ is large. However, one can use a
[k-d tree](https://en.wikipedia.org/wiki/K-d_tree) for the query particles
and source particles and exploit that:

- $W_j K(\vec Y_i, \vec X_j)$ is almost zero for some pairs of nodes in the
two k-d trees.
- $K(\cdot, \vec X_j)$ is almost identical for some nodes in the k-d tree
for the source particles.

Thus, a substantial amount of computation can be skipped or approximated
with e.g., the centroid in the source node with only a minor
loss of precision. The dual-tree approximation method is
$\mathcal{O}(N_s\log N_s)$ and $\mathcal{O}(N_q\log N_q)$.
We start by defining a function to simulate the source
and query particles (we will let the two sets be identical for simplicity).
Further, we plot one draw of simulated points and illustrate the leafs in
the k-d tree.

```{r sim_func}
######
# define function to simulate data
mus <- matrix(c(-1, -1,
1, 1,
-1, 1), 3L, byrow = FALSE)
mus <- mus * .75
Sig <- diag(c(.5^2, .25^2))

get_sims <- function(n_per_grp){
# simulate X
sims <- lapply(1:nrow(mus), function(i){
mu <- mus[i, ]
X <- matrix(rnorm(n_per_grp * 2L), nrow = 2L)
X <- t(crossprod(chol(Sig), X) + mu)

data.frame(X, grp = i)
})
sims <- do.call(rbind, sims)
X <- t(as.matrix(sims[, c("X1", "X2")]))

# simulate weights
ws <- exp(rnorm(ncol(X)))
ws <- ws / sum(ws)

list(sims = sims, X = X, ws = ws)
}

#####
# show example
set.seed(42452654)
invisible(list2env(get_sims(5000L), environment()))

# plot points
par(mar = c(5, 4, 1, 1))
plot(as.matrix(sims[, c("X1", "X2")]), col = sims$grp + 1L)

# find k-d tree and add borders
out <- mssm:::test_KD_note(X, 50L)
out$indices <- out$indices + 1L
n_ele <- drop(out$n_elems)
idx <- mapply(`:`, cumsum(c(1L, head(n_ele, -1))), cumsum(n_ele))
stopifnot(all(sapply(idx, length) == n_ele))
idx <- lapply(idx, function(i) out$indices[i])
stopifnot(!anyDuplicated(unlist(idx)), length(unlist(idx)) == ncol(X))

grps <- lapply(idx, function(i) X[, i])

borders <- lapply(grps, function(x) apply(x, 1, range))
invisible(lapply(borders, function(b)
rect(b[1, "X1"], b[1, "X2"], b[2, "X1"], b[2, "X2"])))
```

Next, we compute the run-times for the previous examples and compare the
approximations of the un-normalized log weights, $\log \tilde W_i$, and
normalized weights, $W_i$. The `n_threads` sets the number of threads to
use in the methods.

```{r comp_run_times, cache = 1}
# run-times
microbenchmark::microbenchmark(
`dual tree 1` = mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L,
eps = 5e-3, n_threads = 1L),
`dual tree 4` = mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L,
eps = 5e-3, n_threads = 4L),
`naive 1` = mssm:::naive(X = X, ws = ws, Y = X, n_threads = 1L),
`naive 4` = mssm:::naive(X = X, ws = ws, Y = X, n_threads = 4L),
times = 10L)

# The functions return the un-normalized log weights. We first compare
# the result on this scale
o1 <- mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L, eps = 5e-3,
n_threads = 1L)
o2 <- mssm:::naive(X = X, ws = ws, Y = X, n_threads = 4L)

all.equal(o1, o2)
par(mar = c(5, 4, 1, 1))
hist((o1 - o2)/ abs((o1 + o2) / 2), breaks = 50, main = "",
xlab = "Delta un-normalized log weights")

# then we compare the normalized weights
func <- function(x){
x_max <- max(x)
x <- exp(x - x_max)
x / sum(x)
}

o1 <- func(o1)
o2 <- func(o2)
all.equal(o1, o2)
hist((o1 - o2)/ abs((o1 + o2) / 2), breaks = 50, main = "",
xlab = "Delta normalized log weights")
```

Finally, we compare the run-times as function of $N = N_s = N_q$. The dashed
line is "naive" method, the continuous line is the dual-tree method, and the
dotted line is dual-tree method using 1 thread.

```{r run_times_N, cache = 1}
Ns <- 2^(7:14)
run_times <- lapply(Ns, function(N){
invisible(list2env(get_sims(N), environment()))
microbenchmark::microbenchmark(
`dual-tree` = mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L, eps = 5e-3,
n_threads = 4L),
naive = mssm:::naive(X = X, ws = ws, Y = X, n_threads = 4L),
`dual-tree 1` = mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L, eps = 5e-3,
n_threads = 1L),
times = 5L)
})

Ns_xtra <- 2^(15:19)
run_times_xtra <- lapply(Ns_xtra, function(N){
invisible(list2env(get_sims(N), environment()))
microbenchmark::microbenchmark(
`dual-tree` = mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L, eps = 5e-3,
n_threads = 4L),
times = 5L)
})
```

```{r plot_run_times_N}
library(microbenchmark)
meds <- t(sapply(run_times, function(x) summary(x, unit = "s")[, "median"]))
meds_xtra <-
sapply(run_times_xtra, function(x) summary(x, unit = "s")[, "median"])
meds <- rbind(meds, cbind(meds_xtra, NA_real_, NA_real_))
dimnames(meds) <- list(
N = c(Ns, Ns_xtra) * 3L, method = c("Dual-tree", "Naive", "Dual-tree 1"))
meds
par(mar = c(5, 4, 1, 1))
matplot(c(Ns, Ns_xtra) * 3L, meds, lty = 1:3, type = "l", log = "xy",
ylab = "seconds", xlab = "N", col = "black")
```

## Function Definitions

```{r mc_grad_est, eval = FALSE}
```

## References