https://github.com/boennecd/mssm

R package for multivariate state space models
https://github.com/boennecd/mssm
Last synced: 4 months ago
JSON representation
R package for multivariate state space models
Host: GitHub
URL: https://github.com/boennecd/mssm
Owner: boennecd
Created: 2019-04-17T01:05:07.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2022-01-31T12:33:06.000Z (over 4 years ago)
Last Synced: 2025-10-22T04:04:42.841Z (8 months ago)
Language: C++
Homepage:
Size: 19.2 MB
Stars: 3
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project

README

          ---

output:

  github_document:

    pandoc_args: --webtex=https://latex.codecogs.com/svg.latex?

bibliography: README.bib

nocite: |

  @Polyak92

---

# Multivariate State Space Models

[![R-CMD-check](https://github.com/boennecd/mssm/workflows/R-CMD-check/badge.svg)](https://github.com/boennecd/mssm/actions)

[![](https://www.r-pkg.org/badges/version/mssm)](https://www.r-pkg.org/badges/version/mssm)

[![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/mssm)](https://cran.r-project.org/package=mssm)

```{r setup, include=FALSE}

knitr::opts_chunk$set(

  error = FALSE, cache = "./README-cache/", fig.path = "man/figures/README-", 

  echo = TRUE)

options(digits = 4, scipen = 7)

.fig_height_small <- 4

source("render_toc.R")

palette(c("#000000", "#009E73", "#e79f00", "#9ad0f3", "#0072B2", "#D55E00", 

          "#CC79A7", "#F0E442"))

```

This package provides methods to estimate models of the form 

$$y_{it} \sim g(\eta_{it}),\qquad i\in I_t$$ 

$$\eta_{it} = \vec\gamma^\top\vec x_{it} +\vec\beta_t^\top\vec z_{it}$$

$$\vec\beta_t = F\vec\beta_{t-1}+\vec\epsilon_t, \qquad \vec\epsilon_t\sim N(\vec 0, Q)$$

where $g$ is simple distribution, we observe $t=1,\dots,T$ periods, and $I_t$, 

$y_{it}$, $\vec x_{it}$, and 

$\vec z_{it}$ are known. What is multivariate is 

$\vec y_t = \{y_{it}\}_{i\in I_t}$ (though, $\vec \beta_t$ can also be 

multivariate) and this package is written to scale well 

in the cardinality of $I_t$. The package uses independent 

particle filters as suggested by @Lin05. This particular type of filter 

can be used in the method suggested by @Poyiadjis11. I will show an example

of how to use the package through the rest of the document and highlight some 

implementation details. 

The package can be installed from Github e.g., by 

calling

```{r github_dl, eval = FALSE}

devtools::install_github("boennecd/mssm")

```

or from CRAN by calling 

```{r cran_inst, eval = FALSE}

install.packages("mssm")

```

## Table of Contents

```{r echo = FALSE}

render_toc("README.Rmd", toc_header_name = "Table of Contents", toc_depth = 3L, base_level = 2L)

```

## Poisson Example

We simulate data as follows.

```{r simulate, fig.height = .fig_height_small}

# simulate path of state variables 

set.seed(78727269)

n_periods <- 312L

(F. <- matrix(c(.5, .1, 0, .8), 2L))

(Q <- matrix(c(.5^2, .1, .1, .7^2), 2L))

(Q_0 <- matrix(c(0.333, 0.194, 0.194, 1.46), 2L))

betas <- cbind(crossprod(chol(Q_0),        rnorm(2L)                      ), 

               crossprod(chol(Q  ), matrix(rnorm((n_periods - 1L) * 2), 2)))

betas <- t(betas)

for(i in 2:nrow(betas))

  betas[i, ] <- betas[i, ] + F. %*% betas[i - 1L, ]

par(mar = c(5, 4, 1, 1))

matplot(betas, lty = 1, type = "l")

# simulate observations

cfix <- c(-1, .2, .5, -1) # gamma

n_obs <- 100L

dat <- lapply(1:n_obs, function(id){

  x <- runif(n_periods, -1, 1)

  X <- cbind(X1 = x, X2 = runif(1, -1, 1))

  z <- runif(n_periods, -1, 1)

  

  eta <- drop(cbind(1, X, z) %*% cfix + rowSums(cbind(1, z) * betas))

  y <- rpois(n_periods, lambda = exp(eta))

  

  # randomly drop some

  keep <- .2 > runif(n_periods)

  

  data.frame(y = y, X, Z = z, id = id, time_idx = 1:n_periods)[keep, ]

})

dat <- do.call(rbind, dat)

# show some properties 

nrow(dat)

head(dat)

table(dat$y)

# quick smooth of number of events vs. time

par(mar = c(5, 4, 1, 1))

plot(smooth.spline(dat$time_idx, dat$y), type = "l", xlab = "Time", 

     ylab = "Number of events")

# and split by those with `Z` above and below 0

with(dat, {

  z_large <- ifelse(Z > 0, "large", "small")

  smooths <- lapply(split(cbind(dat, z_large), z_large), function(x){

    plot(smooth.spline(x$time_idx, x$y), type = "l", xlab = "Time", 

     ylab = paste("Number of events -", unique(x$z_large)))

  })

})

```

In the above, we simulate `r n_periods` (`n_periods`) with `r n_obs` (`n_obs`)

individuals. Each individual has a fixed covariate, `X2`, and two time-varying

covariates, `X1` and `Z`. One of the time-varying covariates, `Z`, has a 

random slope. Further, the intercept is also random. 

### Log-Likelihood Approximations

We start by estimating a generalized linear model without random effects.

```{r fit_glm}

glm_fit <- glm(y ~ X1 + X2 + Z, poisson(), dat)

summary(glm_fit)

logLik(glm_fit)

```

Next, we make a log-likelihood approximation with the implemented particle at 

the true parameters with the `mssm` function.

```{r fit_mssm}

library(mssm)

ll_func <- mssm(

  fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat, 

  # make it explict that there is an intercept (not needed)

  random = ~ 1 + Z, ti = time_idx, control = mssm_control(

    n_threads = 5L, N_part = 500L, what = "log_density"))

system.time(

  mssm_obj <- ll_func$pf_filter(

    cfix = cfix, disp = numeric(), F. = F., Q = Q))

# returns the log-likelihood approximation

logLik(mssm_obj)

# also shown by print

mssm_obj

```

We get a much larger log-likelihood as expected. We can plot the predicted 

values of state variables from the filter distribution.

```{r plot_filter, fig.height = .fig_height_small, fig.width = 12L}

# get predicted mean and prediction interval 

filter_means <- plot(mssm_obj, do_plot = FALSE)

# plot with which also contains the true paths

for(i in 1:ncol(betas)){

  be <- betas[, i]

  me <- filter_means$means[i, ]

  lb <- filter_means$lbs[i, ]

  ub <- filter_means$ubs[i, ]

  

  #     dashed: true paths

  # continuous: predicted mean from filter distribution 

  #     dotted: prediction interval

  par(mar = c(5, 4, 1, 1))

  matplot(cbind(be, me, lb, ub), lty = c(2, 1, 3, 3), type = "l", 

          col = "black", ylab = rownames(filter_means$lbs)[i])

}

```

We can also get predicted values from the smoothing distribution. 

```{r show_smooths, fig.height = .fig_height_small, fig.width = 12L}

# get smoothing weights

mssm_obj <- ll_func$smoother(mssm_obj)

# get predicted mean and prediction interval from smoothing distribution

smooth_means <- plot(mssm_obj, do_plot = FALSE, which_weights = "smooth")

for(i in 1:ncol(betas)){

  be  <- betas[, i]

  me  <- filter_means$means[i, ]

  lb  <- filter_means$lbs[i, ]

  ub  <- filter_means$ubs[i, ]

  mes <- smooth_means$means[i, ]

  lbs <- smooth_means$lbs[i, ]

  ubs <- smooth_means$ubs[i, ]

  

  #     dashed: true paths

  # continuous: predicted mean from filter distribution 

  #     dotted: prediction interval

  # 

  # smooth predictions are in a different color

  par(mar = c(5, 4, 1, 1))

  matplot(cbind(be, me, lb, ub, mes, lbs, ubs), 

          lty = c(2, 1, 3, 3, 1, 3, 3), type = "l", 

          col = c(rep(1, 4), rep(2, 3)), ylab = rownames(filter_means$lbs)[i])

}

# compare mean square error of the two means

rbind(filter = colMeans((t(filter_means$means) - betas)^2), 

      smooth = colMeans((t(smooth_means$means) - betas)^2))

```

We can get the effective sample size at each point in time with the `get_ess`

function.

```{r show_ess, fig.height = .fig_height_small}

(ess <- get_ess(mssm_obj))

plot(ess)

```

We can compare this what we get by using a so-called bootstrap (like) filter

instead.

```{r comp_boot, fig.height = .fig_height_small}

local({

  ll_boot <- mssm(

    fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat, 

    random = ~ Z, ti = time_idx, control = mssm_control(

      n_threads = 5L, N_part = 500L, what = "log_density", 

      which_sampler = "bootstrap"))

  

  print(system.time(

    boot_fit <- ll_boot$pf_filter(

      cfix = coef(glm_fit), disp = numeric(), F. = F., Q = Q)))

  

  plot(get_ess(boot_fit))

})

```

The above is not much faster (and maybe slower in this run) as the bulk of 

the computation is not in the sampling step. We can also compare the 

log-likelihood approximation with what we get if we choose parameters close 

to the GLM estimates.

```{r comp_close_glm}

mssm_glm <- ll_func$pf_filter(

  cfix = coef(glm_fit), disp = numeric(), F. = diag(1e-8, 2), 

  Q = diag(1e-4^2, 2))

logLik(mssm_glm)

```

### Antithetic Variables

One way to reduce the variance of the Monte Carlo estimate is to use 

[antithetic variables](https://en.wikipedia.org/wiki/Antithetic_variates). 

Two types of antithetic variables are implemented as in @Durbin97. 

That is, one balanced for location and two balanced for scale. This is currently 

only implemented with a t-distribution as the proposal distribution. 

We start by giving some details on the locations balanced variable. Suppose 

we use a t-distribution with $\nu$ degrees of freedom, a $d$ dimensional mean 

of $\mu$ and a scale matrix $\Sigma$. We can then generate a sample by setting

$$\begin{aligned}  \vec x &= \vec\mu + C \frac{\vec z}{\sqrt{a / \nu}} & \Sigma &= CC^\top \\ \vec z &\sim N(\vec 0, I) & a &\sim \chi^2_\nu \end{aligned}$$

Then the location balanced variable is

$$\widehat{\vec x} = \vec\mu - C \frac{\vec z}{\sqrt{a / \nu}}$$

For the scaled balanced variables we use that 

$$u = \frac{\vec z^\top\vec z/ d}{a / \nu} \sim F(d, \nu)$$

We then define the cumulative distribution function

$$k = P(U \leq u) = Q(u)$$

and set 

$$u' = Q^{-1}(1 - k)$$

Then the two scaled balanced variables are

$$\begin{aligned} \widetilde{\vec x}_1 &= \vec\mu + \sqrt{u'/u} \cdot C \frac{\vec z}{\sqrt{a / \nu}} \\ \widetilde{\vec x}_2 &= \vec\mu - \sqrt{u'/u} \cdot C \frac{\vec z}{\sqrt{a / \nu}} \end{aligned}$$

We will illustrate the reduction in variance of the log-likelihood estimate. 

To do so, we run the particle filter with and without antithetic variables

multiple times below

to get an estimate of the error of the approximation.

```{r anti_ex, cache = 1}

set.seed(12769550)

compare_anti <- local({

  ll_func_no_anti <- mssm(

    fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat, 

    random = ~ Z, ti = time_idx, control = mssm_control(

      n_threads = 5L, N_part = 500L, what = "log_density")) 

  

  ll_func_anti <- mssm(

    fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,

    random = ~ Z, ti = time_idx, control = mssm_control(

      n_threads = 5L, N_part = 500L, what = "log_density", 

      use_antithetic = TRUE)) 

  

  no_anti <- replicate(

    100, {

      ti <- system.time(x <- logLik(ll_func_no_anti$pf_filter(

        cfix = cfix, disp = numeric(), F. = F., Q = Q, seed = NULL)))

      list(ti = ti, x = x)

    }, simplify = FALSE)

  w_anti  <- replicate(

    100, {

      ti <- system.time(x <- logLik(ll_func_anti$pf_filter(

        cfix = cfix, disp = numeric(), F. = F., Q = Q, seed = NULL)))

      list(ti = ti, x = x)

    }, simplify = FALSE)

  

  list(no_anti = no_anti, w_anti = w_anti)

})

```

The mean estimate of the log-likelihood and standard error of the estimate is

shown below with and without antithetic variables.

```{r show_anti_ex}

sapply(compare_anti, function(x){ 

  x <- sapply(x, "[[", "x")

  c(mean = mean(x), se = sd(x) / sqrt(length(x)))

})

```

Using antithetic variables is slower. Below we show summary statistics for the 

elapsed time without using antithetic variables and with antithetic variables.

```{r run_anti_ex}

sapply(compare_anti, function(x){ 

  x <- sapply(x, "[[", "ti")

  z <- x[c("elapsed"), ]

  c(mean = mean(z), quantile(z, c(.5, .25, .75, 0, 1)))

})

```

### Parameter Estimation

We will need to estimate the parameters for real applications. We could do 

this e.g., with a Monte Carlo expectation-maximization algorithm or by using 

a Monte Carlo approximation of the gradient. Currently, the latter is only 

available and the user will have to write a custom function to perform 

the estimation. 

I will provide an example below. The `sgd` function is not a part of the 

package. Instead the package provides a way to approximate the gradient and 

allows the user to perform subsequent maximization (e.g., with constraints or 

penalties). The definition of the 

`sgd` is given at the end of this file as it is somewhat long. We start 

by using a Laplace approximation to get the starting values.

```{r mc_grad_est, echo = FALSE}

# Stochastic gradient descent for mssm object. The function assumes that the 

# state vector is stationary. The default values are rather arbitrary.

# 

# Args:

#   object: an object of class mssmFunc. 

#   n_it: number of iterations. 

#   lrs: learning rates to use. Must have n_it elements. Like problem specific. 

#   avg_start: index to start averaging. See Polyak et al. (1992) for arguments

#              for averaging.

#   cfix: starting values for fixed coefficients. 

#   F.: starting value for transition matrix in conditional distribution of the 

#       current state given the previous state.

#   Q: starting value for covariance matrix in conditional distribution of the 

#      current state given the previous. 

#   verbose: TRUE if output should be printed during estmation. 

#   disp: starting value for dispersion parameter.

# 

# Returns:

#   List with estimates and the log-likelihood approximation at each iteration. 

sgd <- function(

  object, n_it = 150L, 

  lrs = 1e-2 * (1:n_it)^(-1/2), avg_start = max(1L, as.integer(n_it * 4L / 5L)),

  cfix, F., Q,  verbose = FALSE, disp = numeric())

{

  # make checks

  stopifnot(

    inherits(object, "mssmFunc"), object$control$what == "gradient", n_it > 0L, 

    all(lrs > 0.), avg_start > 1L, length(lrs) == n_it)

  n_fix <- nrow(object$X)

  n_rng <- nrow(object$Z)

  

  # objects for estimates at each iteration, gradient norms, and

  # log-likelihood approximations

  has_dispersion <-

    any(sapply(c("^Gamma", "^gaussian"), grepl, x = object$family))

  n_params <-

    has_dispersion + n_fix + n_rng * n_rng + n_rng * (n_rng  + 1L) / 2L

  ests <- matrix(NA_real_, n_it + 1L, n_params)

  ests[1L, ] <- c(

    cfix,  if(has_dispersion) disp else NULL, F.,

    Q[lower.tri(Q, diag = TRUE)])

  grad_norm <- lls <- rep(NA_real_, n_it)

  

  # indices of the different components

  idx_fix   <- 1:n_fix

  if(has_dispersion)

    idx_dip <- n_fix + 1L

  idx_F     <- 1:(n_rng * n_rng) + n_fix + has_dispersion

  idx_Q     <- 1:(n_rng * (n_rng  + 1L) / 2L) + n_fix + n_rng * n_rng +

    has_dispersion

  

  # function to set the parameters

  library(matrixcalc) # TODO: get rid of this

  dup_mat <- duplication.matrix(ncol(Q))

  set_parems <- function(i){

    # select whether or not to average

    idx <- if(i > avg_start) avg_start:i else i

    

    # set new parameters

    cfix   <<-             colMeans(ests[idx, idx_fix, drop = FALSE])

    if(has_dispersion)

      disp <<-             colMeans(ests[idx, idx_dip, drop = FALSE])

    F.[]   <<-             colMeans(ests[idx, idx_F  , drop = FALSE])

    Q[]    <<- dup_mat %*% colMeans(ests[idx, idx_Q  , drop = FALSE])

    

  }

    

  # run algorithm

  max_half <- 25L

  for(i in 1:n_it + 1L){

    # get gradient. First, run the particle filter

    filter_out <- object$pf_filter(

      cfix = cfix, disp = disp, F. = F., Q = Q, seed = NULL)

    lls[i - 1L] <- c(logLik(filter_out))

    

    # then get the gradient associated with each particle and the log 

    # normalized weight of the particles

    grads <- tail(filter_out$pf_output, 1L)[[1L]]$stats

    ws    <- tail(filter_out$pf_output, 1L)[[1L]]$ws_normalized

    

    # compute the gradient and take a small step

    grad <- colSums(t(grads) * drop(exp(ws)))

    grad_norm[i - 1L] <- norm(t(grad))

    

    lr_i <- lrs[i - 1L]

    k <- 0L

    while(k < max_half){

      ests[i, ] <- ests[i - 1L, ] + lr_i * grad 

      set_parems(i)

      

      # check that Q is positive definite and the system is stationary

      c1 <- all(abs(eigen(F.)$values) < 1)

      c2 <- all(eigen(Q)$values > 0)

      c3 <- !has_dispersion || disp > 0.

      if(c1 && c2 && c3)

        break

      

      # decrease learning rate

      lr_i <- lr_i * .5

      k <- k + 1L

    }

    

    # check if we failed to find a value within our constraints

    if(k == max_half)

      stop("failed to find solution within constraints")

    

    # print information if requested 

    if(verbose){

      cat(sprintf(

        "\nIt %5d: log-likelihood (current, max) %12.2f, %12.2f\n",

        i - 1L, logLik(filter_out), max(lls, na.rm = TRUE)),

        rep("-", 66), "\n", sep = "")

      cat("cfix\n")

      print(cfix)

      if(has_dispersion)

        cat(sprintf("Dispersion: %20.8f\n", disp))

      cat("F\n")

      print(F.)

      cat("Q\n")

      print(Q)

      cat(sprintf("Gradient norm: %10.4f\n", grad_norm[i - 1L]))

      print(get_ess(filter_out))

    }

  } 

  

  list(estimates = ests, logLik = lls, F. = F., Q = Q, cfix = cfix, 

       disp = disp, grad_norm = grad_norm)

}

# Stochastic gradient descent for mssm object using the Adam algorithm. The  

# function assumes that the state vector is stationary.

# 

# Args:

#   object: an object of class mssmFunc. 

#   n_it: number of iterations. 

#   mp: decay rate for first moment.

#   vp: decay rate for secod moment.

#   lr: learning rate.

#   cfix: starting values for fixed coefficients. 

#   F.: starting value for transition matrix in conditional distribution of the 

#       current state given the previous state.

#   Q: starting value for covariance matrix in conditional distribution of the 

#      current state given the previous. 

#   verbose: TRUE if output should be printed during estmation. 

#   disp: starting value for dispersion parameter.

# 

# Returns:

#   List with estimates and the log-likelihood approximation at each iteration. 

adam <- function(

  object, n_it = 150L, mp = .9, vp = .999, lr = .01, cfix, F., Q,

  verbose = FALSE, disp = numeric())

{

  # make checks

  stopifnot(

    inherits(object, "mssmFunc"), object$control$what == "gradient", n_it > 0L,

    lr > 0., mp > 0, mp < 1, vp > 0, vp < 1)

  n_fix <- nrow(object$X)

  n_rng <- nrow(object$Z)

  # objects for estimates at each iteration, gradient norms, and

  # log-likelihood approximations

  has_dispersion <-

    any(sapply(c("^Gamma", "^gaussian"), grepl, x = object$family))

  n_params <-

    has_dispersion + n_fix + n_rng * n_rng + n_rng * (n_rng  + 1L) / 2L

  ests <- matrix(NA_real_, n_it + 1L, n_params)

  ests[1L, ] <- c(

    cfix,  if(has_dispersion) disp else NULL, F.,

    Q[lower.tri(Q, diag = TRUE)])

  grad_norm <- lls <- rep(NA_real_, n_it)

  # indices of the different components

  idx_fix   <- 1:n_fix

  if(has_dispersion)

    idx_dip <- n_fix + 1L

  idx_F     <- 1:(n_rng * n_rng) + n_fix + has_dispersion

  idx_Q     <- 1:(n_rng * (n_rng  + 1L) / 2L) + n_fix + n_rng * n_rng +

    has_dispersion

  # function to set the parameters

  library(matrixcalc) # TODO: get rid of this

  dup_mat <- duplication.matrix(ncol(Q))

  set_parems <- function(i){

    cfix   <<-             ests[i, idx_fix]

    if(has_dispersion)

      disp <<-             ests[i, idx_dip]

    F.[]   <<-             ests[i, idx_F  ]

    Q[]    <<- dup_mat %*% ests[i, idx_Q  ]

  }

  # run algorithm

  max_half <- 25L

  m <- NULL

  v <- NULL

  failed <- FALSE

  for(i in 1:n_it + 1L){

    # get gradient. First, run the particle filter

    filter_out <- object$pf_filter(

      cfix = cfix, disp = disp, F. = F., Q = Q, seed = NULL)

    lls[i - 1L] <- c(logLik(filter_out))

    # then get the gradient associated with each particle and the log

    # normalized weight of the particles

    grads <- tail(filter_out$pf_output, 1L)[[1L]]$stats

    ws    <- tail(filter_out$pf_output, 1L)[[1L]]$ws_normalized

    # compute the gradient and take a small step

    grad <- colSums(t(grads) * drop(exp(ws)))

    grad_norm[i - 1L] <- norm(t(grad))

    m <- if(is.null(m)) (1 - mp) * grad   else mp * m + (1 - mp) * grad

    v <- if(is.null(v)) (1 - vp) * grad^2 else vp * v + (1 - vp) * grad^2

    mh <- m / (1 - mp^(i - 1))

    vh <- v / (1 - vp^(i - 1))

    de <- mh / sqrt(vh + 1e-8)

    lr_i <- lr

    k <- 0L

    while(k < max_half){

      ests[i, ] <- ests[i - 1L, ] + lr_i * de

      set_parems(i)

      # check that Q is positive definite, the dispersion parameter is

      # positive, and the system is stationary

      c1 <- all(abs(eigen(F.)$values) < 1)

      c2 <- all(eigen(Q)$values > 0)

      c3 <- !has_dispersion || disp > 0.

      if(c1 && c2 && c3)

        break

      # decrease learning rate

      lr_i <- lr_i * .5

      k <- k + 1L

    }

    # check if we failed to find a value within our constraints

    if(k == max_half){

      warning("failed to find solution within constraints")

      failed <- TRUE

      break

    }

    # print information if requested

    if(verbose){

      cat(sprintf(

        "\nIt %5d: log-likelihood (current, max) %12.2f, %12.2f\n",

        i - 1L, logLik(filter_out), max(lls, na.rm = TRUE)),

        rep("-", 66), "\n", sep = "")

      cat("cfix\n")

      print(cfix)

      if(has_dispersion)

        cat(sprintf("Dispersion: %20.8f\n", disp))

      cat("F\n")

      print(F.)

      cat("Q\n")

      print(Q)

      cat(sprintf("Gradient norm: %10.4f\n", grad_norm[i - 1L]))

      print(get_ess(filter_out))

    }

  }

  list(estimates = ests, logLik = lls, F. = F., Q = Q, cfix = cfix, 

       disp = disp, failed = failed, grad_norm = grad_norm)

}

```

```{r laplace, cache = 1}

# setup mssmFunc object to use 

ll_func <- mssm(  

  fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat, 

  random = ~ Z, ti = time_idx, control = mssm_control(

    n_threads = 5L, N_part = 200L, what = "gradient", use_antithetic = TRUE))

# use Laplace approximation to get starting values

system.time(

  sta <- ll_func$Laplace(

    F. = diag(.5, 2), Q = diag(1, 2), cfix = coef(glm_fit), disp = numeric()))

```

```{r show_laplace}

# the function returns an object with the estimated parameters and  

# approximation log-likelihood

sta

sta$Q

```

```{r sgd, cache = 1, dependson = "laplace"}

# use stochastic gradient descent with averaging

set.seed(25164416)

system.time( 

  res <- sgd(

    ll_func, F. = sta$F., Q = sta$Q, cfix = sta$cfix, 

    lrs = .001 * (1:50)^(-1/2), n_it = 50L, avg_start = 30L))

# use Adam algorithm instead

set.seed(25164416)

system.time( 

  resa <- adam(

    ll_func, F. = sta$F., Q = sta$Q, cfix = sta$cfix, 

    lr = .01, n_it = 50L))

```

Plots of the approximate log-likelihoods at each iteration is shown below 

along with the final estimates. 

```{r show_use_sgd, fig.height = .fig_height_small}

print(tail(res$logLik), digits = 6) # final log-likelihood approximations

par(mar = c(5, 4, 1, 1))

plot(res$logLik, type = "l", ylab = "log-likelihood approximation")

plot(res$grad_norm, type = "l", ylab = "approximate gradient norm")

# final estimates

res$F. 

res$Q

res$cfix

# compare with output from Adam algorithm

print(tail(resa$logLik), digits = 6) # final log-likelihood approximations

plot(resa$logLik, type = "l", ylab = "log-likelihood approximation")

plot(resa$grad_norm, type = "l", ylab = "approximate gradient norm")

resa$F. 

resa$Q

resa$cfix

```

We may want to use more particles towards the end when we estimate the 

parameters. To do, we use the approximation described in the next section 

at the final estimates that we arrived at before. 

```{r cont_est, cache = 1, dependson = "use_sgd"}

ll_func <- mssm(

  fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat, 

  random = ~ Z, ti = time_idx, control = mssm_control(

    n_threads = 5L, N_part = 10000L, what = "gradient",

    which_ll_cp = "KD", aprx_eps = .01, use_antithetic = TRUE))

set.seed(25164416)

system.time( 

  res_final <- adam(

    ll_func, F. = resa$F., Q = resa$Q, cfix = resa$cfix, 

    lr = .001, n_it = 25L))

```

```{r show_cont_est}

plot(res_final$logLik, type = "l", ylab = "log-likelihood approximation")

plot(res_final$grad_norm, type = "l", ylab = "approximate gradient norm")

res_final$F. 

res_final$Q

res_final$cfix

```

```{r clean_param_ests, echo = FALSE}

# do this to not use them later by an error

rm(res, resa)

```

### Faster Approximation

One drawback with the particle filter we use is that it has $\mathcal{O}(N^2)$ 

computational complexity where $N$ is the number of particles. We can see 

this by changing the number of particles. 

```{r comp_w_n_part, cache = 1}

local({

  # assign function that returns a function that uses a given number of 

  # particles

  func <- function(N){

    ll_func <- mssm(

      fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat, 

      random = ~ Z, ti = time_idx, control = mssm_control(

        n_threads = 5L, N_part = N, what = "log_density", 

        use_antithetic = TRUE))

    function()

      ll_func$pf_filter(

        cfix = coef(glm_fit), disp = numeric(), F. = diag(1e-8, 2), 

        Q = diag(1e-4^2, 2))

      

  }

  

  f_100  <- func( 100L)

  f_200  <- func( 200L)

  f_400  <- func( 400L)

  f_800  <- func( 800L)

  f_1600 <- func(1600L)

  

  # benchmark. Should ĩncrease at ~ N^2 rate

  microbenchmark::microbenchmark(

    `100` = f_100(), `200` = f_200(), `400` = f_400(), `800` = f_800(),

    `1600` = f_1600(), times = 3L)

})

```

A solution is to use the dual k-d tree method I cover later. The computational

complexity is $\mathcal{O}(N \log N)$ for this method which is somewhat 

indicated by the run times shown below.

```{r KD_comp_w_n_part, cache = 1}

local({

  # assign function that returns a function that uses a given number of 

  # particles

  func <- function(N){

    ll_func <- mssm(

      fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat, 

      random = ~ Z, ti = time_idx, control = mssm_control(

        n_threads = 5L, N_part = N, what = "log_density", 

        which_ll_cp = "KD", KD_N_max = 6L, aprx_eps = 1e-2, 

        use_antithetic = TRUE))

    function()

      ll_func$pf_filter(

        cfix = coef(glm_fit), disp = numeric(), F. = diag(1e-8, 2), 

        Q = diag(1e-4^2, 2))

      

  }

  

  f_100   <- func(  100L)

  f_200   <- func(  200L)

  f_400   <- func(  400L)

  f_800   <- func(  800L)

  f_1600  <- func( 1600L)

  f_12800 <- func(12800L) # <-- much larger

  

  # benchmark. Should increase at ~ N log N rate

  microbenchmark::microbenchmark(

    `100` = f_100(), `200` = f_200(), `400` = f_400(), `800` = f_800(), 

    `1600` = f_1600(), `12800` = f_12800(), times = 3L)

})

```

The `aprx_eps` controls the size of the error. To be precise about what this

value does then we need to some notation for the complete likelihood 

(the likelihood where we observe $\vec\beta_1,\dots,\vec\beta_T$s). This is

$$L = \mu_1(\vec \beta_1)g_1(\vec y_1 \mid \vec \beta_1)\prod_{t=2}^Tf(\vec\beta_t \mid\vec\beta_{t-1})g_t(y_t\mid\beta_t)$$

where $g_t$ is conditional distribution $\vec y_t$ given $\vec\beta_t$, $f$ is 

the conditional distribution of $\vec\beta_t$ given $\vec\beta_{t-1}$, and

$\mu$ is the time-invariant distribution of $\vec\beta_t$. 

Let $w_t^{(j)}$ be the weight of particle 

$j$ at time $t$ and $\vec \beta_t^{(j)}$ be the $j$th particle at time $t$. 

Then we ensure the error in our evaluation of terms 

$w_{t-1}^{(j)}f(\vec\beta_t^{(i)} \mid \vec\beta_{t-1}^{(j)})$ never 

exceeds

$$w_{t-1}^{(j)} \frac{u - l}{(u + l)/2}$$

where $u$ and $l$ are respectively an upper and lower bound of 

$f(\vec\beta_t^{(i)} \mid \vec\beta_{t-1}^{(j)})$. 

The question is how big the error is. 

Thus, we consider the error in the log-likelihood approximation at the 

true parameters. 

```{r comp_all_vs_aprx, cache = 1, fig.height = .fig_height_small, message = FALSE}

ll_compare <- local({

  N_use <- 500L 

  # we alter the seed in each run. First, the exact method

  ll_no_approx <- sapply(1:200, function(seed){

    ll_func <- mssm(

      fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,

      random = ~ Z, ti = time_idx, control = mssm_control(

        n_threads = 5L, N_part = N_use, what = "log_density", 

        seed = seed, use_antithetic = TRUE))

    

    logLik(ll_func$pf_filter(

      cfix = cfix, disp = numeric(), F. = F., Q = Q))

  })

  

  # then the approximation

  ll_approx <- sapply(1:200, function(seed){

    ll_func <- mssm(

      fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,

      random = ~ Z, ti = time_idx, control = mssm_control(

        n_threads = 5L, N_part = N_use, what = "log_density", 

        KD_N_max = 6L, aprx_eps = 1e-2, seed = seed, 

        which_ll_cp = "KD", use_antithetic = TRUE))

    

    logLik(ll_func$pf_filter(

      cfix = cfix, disp = numeric(), F. = F., Q = Q))

  })

  

  list(ll_no_approx = ll_no_approx, ll_approx = ll_approx)

})

```

```{r show_comp_arell_aprx, fig.height = .fig_height_small}

par(mar = c(5, 4, 1, 1))

hist(

  ll_compare$ll_no_approx, main = "", breaks = 20L, 

  xlab = "Log-likelihood approximation -- no aprox")

hist(

  ll_compare$ll_approx   , main = "", breaks = 20L, 

  xlab = "Log-likelihood approximation -- aprox")

```

We can make a t-test for whether there is a difference between the two 

log-likelihood estimates

```{r t_test_comp_ll}

with(ll_compare, t.test(ll_no_approx, ll_approx))

```

The fact that there may only be a small difference if any is nice because 

now we can get a much better approximation (in terms of variance) quickly 

of e.g., the log-likelihood as shown below.

```{r show_ll_quick, cache = 1, message = FALSE}

ll_approx <- sapply(1:10, function(seed){

  N_use <- 10000L # many more particles

  

  ll_func <- mssm(

    fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,

    random = ~ Z, ti = time_idx, control = mssm_control(

      n_threads = 5L, N_part = N_use, what = "log_density", 

      KD_N_max = 100L, aprx_eps = 1e-2, seed = seed, 

      which_ll_cp = "KD", use_antithetic = TRUE))

  

  logLik(ll_func$pf_filter(

    cfix = cfix, disp = numeric(), F. = F., Q = Q))

}) 

# approximate log-likelihood

sd(ll_approx)

print(mean(ll_approx), digits = 6)

# compare sd with 

sd(ll_compare$ll_no_approx)

print(mean(ll_compare$ll_no_approx), digits = 6)

```

### Approximate Observed Information Matrix

Next, we look at approximating the observed information matrix with the method 

suggested by @Poyiadjis11.

```{r apprx_obs_info, cache = 1, dependson=c("use_sgd", "cont_est")}

ll_func <- mssm(

  fixed = y ~ X1 + X2 + Z, family = poisson(), data = dat,

  random = ~ Z, ti = time_idx, control = mssm_control(

    n_threads = 5L, N_part = 10000L, what = "Hessian",

    which_ll_cp = "KD", aprx_eps = .01, use_antithetic = TRUE))

set.seed(46658529)

system.time(

  mssm_obs_info <- ll_func$pf_filter(

    cfix = res_final$cfix, disp = numeric(), F. = res_final$F., 

    Q = res_final$Q))

```

We define a function below to get the approximate gradient and approximate 

observed information matrix from the returned object. Then we compare 

the output to the GLM we estimated and to the true parameters.

```{r show_apprx_obs_info}

# Function to subtract the approximate gradient elements and approximate 

# observed information matrix. 

# 

# Args:

#   object: an object of class mssm.

#

# Returns:

#   list with the approximate gradient elements and approximate observed 

# information matrix. 

get_grad_n_obs_info <- function(object){

  stopifnot(inherits(object, "mssm"))

  # get dimension of the components

  n_rng   <- nrow(object$Z)

  dim_fix <- nrow(object$X)

  dim_rng <- n_rng * n_rng + ((n_rng + 1L) * n_rng) / 2L

  has_dispersion <-

    any(sapply(c("^Gamma", "^gaussian"), grepl, x = object$family))

  

  # get quantities for each particle

  quants <-     tail(object$pf_output, 1L)[[1L]]$stats

  ws     <- exp(tail(object$pf_output, 1L)[[1L]]$ws_normalized)

  # get mean estimates

  meas <- colSums(t(quants) * drop(ws))

  # separate out the different components. Start with the gradient

  idx <- dim_fix + dim_rng + has_dispersion

  grad <- meas[1:idx]

  dim_names <- names(grad)

  # then the observed information matrix

  I1 <- matrix(

    meas[-(1:idx)], dim_fix + dim_rng + has_dispersion, 

    dimnames = list(dim_names, dim_names))

  

  I2 <- matrix(0., nrow(I1), ncol(I1))

  for(i in seq_along(ws))

    I2 <- I2 + ws[i] * tcrossprod(quants[1:idx, i])

  list(grad = grad, obs_info = tcrossprod(grad) - I1 - I2)

}

# use function

out <- get_grad_n_obs_info(mssm_obs_info)

# approximate gradient

out$grad

# approximate standard errors

(ses <- sqrt(diag(solve(out$obs_info))))

# look at output for parameters in the observational equation. First, compare 

# with glm's standard errors

sqrt(diag(vcov(glm_fit)))

# and relative to true parameters vs. estimated

rbind(

  true             = cfix, 

  glm              = coef(glm_fit), 

  mssm             = res_final$cfix, 

  `standard error` = ses[1:4])

# next look at parameters in state equation. First four are for F.

rbind(

  true             = c(F.), 

  mssm             = c(res_final$F.), 

  `standard error` = ses[5:8])

# next three are w.r.t. the lower diagonal part of Q

rbind(

  true             =           Q[lower.tri(Q, diag = TRUE)], 

  mssm             = res_final$Q[lower.tri(Q, diag = TRUE)], 

  `standard error` = ses[9:11])

```

## Supported Families

The following families are supported:

 - The binomial distribution is supported with logit, probit, and cloglog link. 

 - The Poisson distribution is supported with square root and log link. 

 - The gamma distribution is supported with log link.

 - The normal distribution with identity link (to compare with e.g., a Kalman 

 filter), log link, and the inverse link function.

## Fast Sum-Kernel Approximation

This package contains a simple implementation of the dual-tree method like the one 

suggested by @Gray03 and shown in @Klaas06. The problem we want to solve is

the sum-kernel problem in @Klaas06. Particularly, we consider the situation 

where we have $1,\dots,N_q$ query particles denoted by 

$\{\vec Y_i\}_{i=1,\dots,N_q}$ and $1,\dots,N_s$ source particles denoted by

$\{\vec X_j\}_{j=1,\dots,N_s}$. For each query particle, we want to compute 

the weights 

$$W_i = \frac{\tilde W_i}{\sum_{k = 1}^{N_q} \tilde W_i},\qquad \tilde W_i = \sum_{j=1}^{N_s} \bar W_j K(\vec Y_i, \vec X_j)$$

where each source particle has an associated weight $\bar W_j$ and $K$ is a

kernel. Computing the above is $\mathcal{O}(N_sN_q)$ which is major 

bottleneck if $N_s$ and $N_q$ is large. However, one can use a 

[k-d tree](https://en.wikipedia.org/wiki/K-d_tree) for the query particles

and source particles and exploit that:

- $W_j K(\vec Y_i, \vec X_j)$ is almost zero for some pairs of nodes in the 

two k-d trees. 

- $K(\cdot, \vec X_j)$ is almost identical for some nodes in the k-d tree 

for the source particles. 

Thus, a substantial amount of computation can be skipped or approximated 

with e.g., the centroid in the source node with only a minor

loss of precision. The dual-tree approximation method is 

$\mathcal{O}(N_s\log N_s)$ and $\mathcal{O}(N_q\log N_q)$.

We start by defining a function to simulate the source 

and query particles (we will let the two sets be identical for simplicity). 

Further, we plot one draw of simulated points and illustrate the leafs in 

the k-d tree.

```{r sim_func}

######

# define function to simulate data

mus <- matrix(c(-1, -1, 

                 1, 1, 

                -1, 1), 3L, byrow = FALSE)

mus <- mus * .75

Sig <- diag(c(.5^2, .25^2))

get_sims <- function(n_per_grp){

  # simulate X

  sims <- lapply(1:nrow(mus), function(i){

    mu <- mus[i, ]

    X <- matrix(rnorm(n_per_grp * 2L), nrow = 2L)

    X <- t(crossprod(chol(Sig), X) + mu)

    

    data.frame(X, grp = i)

  })

  sims <- do.call(rbind, sims)

  X <- t(as.matrix(sims[, c("X1", "X2")]))

  

  # simulate weights

  ws <- exp(rnorm(ncol(X)))

  ws <- ws / sum(ws)

  

  list(sims = sims, X = X, ws = ws)

}

#####

# show example 

set.seed(42452654)

invisible(list2env(get_sims(5000L), environment()))

# plot points

par(mar = c(5, 4, 1, 1))

plot(as.matrix(sims[, c("X1", "X2")]), col = sims$grp + 1L)

# find k-d tree and add borders 

out <- mssm:::test_KD_note(X, 50L)

out$indices <- out$indices + 1L

n_ele <- drop(out$n_elems)

idx <- mapply(`:`, cumsum(c(1L, head(n_ele, -1))), cumsum(n_ele))

stopifnot(all(sapply(idx, length) == n_ele))

idx <- lapply(idx, function(i) out$indices[i])

stopifnot(!anyDuplicated(unlist(idx)), length(unlist(idx)) == ncol(X))

grps <- lapply(idx, function(i) X[, i])

borders <- lapply(grps, function(x) apply(x, 1, range))

invisible(lapply(borders, function(b) 

  rect(b[1, "X1"], b[1, "X2"], b[2, "X1"], b[2, "X2"])))

```

Next, we compute the run-times for the previous examples and compare the 

approximations of the un-normalized log weights, $\log \tilde W_i$, and 

normalized weights, $W_i$. The `n_threads` sets the number of threads to 

use in the methods.

```{r comp_run_times, cache = 1}

# run-times

microbenchmark::microbenchmark(

  `dual tree 1` = mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L, 

                               eps = 5e-3, n_threads = 1L),

  `dual tree 4` = mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L, 

                               eps = 5e-3, n_threads = 4L),

  `naive 1`     = mssm:::naive(X = X, ws = ws, Y = X, n_threads = 1L),

  `naive 4`     = mssm:::naive(X = X, ws = ws, Y = X, n_threads = 4L),

  times = 10L)

# The functions return the un-normalized log weights. We first compare

# the result on this scale

o1 <- mssm:::FSKA  (X = X, ws = ws, Y = X, N_min = 10L, eps = 5e-3, 

                    n_threads = 1L)

o2 <- mssm:::naive(X = X, ws = ws, Y = X, n_threads = 4L)

all.equal(o1, o2)

par(mar = c(5, 4, 1, 1))

hist((o1 - o2)/ abs((o1 + o2) / 2), breaks = 50, main = "", 

     xlab = "Delta un-normalized log weights")

# then we compare the normalized weights

func <- function(x){

  x_max <- max(x)

  x <- exp(x - x_max)

  x / sum(x)

}

o1 <- func(o1)

o2 <- func(o2)

all.equal(o1, o2)

hist((o1 - o2)/ abs((o1 + o2) / 2), breaks = 50, main = "", 

     xlab = "Delta normalized log weights")

```

Finally, we compare the run-times as function of $N = N_s = N_q$. The dashed 

line is "naive" method, the continuous line is the dual-tree method, and the 

dotted line is dual-tree method using 1 thread.

```{r run_times_N, cache = 1}

Ns <- 2^(7:14)

run_times <- lapply(Ns, function(N){ 

  invisible(list2env(get_sims(N), environment()))

  microbenchmark::microbenchmark(

    `dual-tree`   = mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L, eps = 5e-3, 

                                 n_threads = 4L),

    naive         = mssm:::naive(X = X, ws = ws, Y = X, n_threads = 4L),

    `dual-tree 1` = mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L, eps = 5e-3, 

                                 n_threads = 1L), 

    times = 5L)

}) 

Ns_xtra <- 2^(15:19) 

run_times_xtra <- lapply(Ns_xtra, function(N){

  invisible(list2env(get_sims(N), environment()))

  microbenchmark::microbenchmark(

    `dual-tree` = mssm:::FSKA (X = X, ws = ws, Y = X, N_min = 10L, eps = 5e-3, 

                               n_threads = 4L),

    times = 5L)

}) 

```

 

```{r plot_run_times_N}

library(microbenchmark)

meds <- t(sapply(run_times, function(x) summary(x, unit = "s")[, "median"]))

meds_xtra <- 

  sapply(run_times_xtra, function(x) summary(x, unit = "s")[, "median"])

meds <- rbind(meds, cbind(meds_xtra, NA_real_, NA_real_))

dimnames(meds) <- list(

  N = c(Ns, Ns_xtra) * 3L, method = c("Dual-tree", "Naive", "Dual-tree 1"))

meds

par(mar = c(5, 4, 1, 1))

matplot(c(Ns, Ns_xtra) * 3L, meds, lty = 1:3, type = "l", log = "xy", 

        ylab = "seconds", xlab = "N", col = "black")

```

## Function Definitions

```{r mc_grad_est, eval = FALSE}

```

## References
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/boennecd/mssm

Awesome Lists containing this project

README