An open API service indexing awesome lists of open source software.

https://github.com/jennalandy/causallfo

This R package provides all algorithms discussed in the paper “Causal Inference for Latent Outcomes Learned with Factor Models”.
https://github.com/jennalandy/causallfo

causal-inference latent-factor-model latent-outcomes nonnegative-matrix-factorization r-package

Last synced: 10 months ago
JSON representation

This R package provides all algorithms discussed in the paper “Causal Inference for Latent Outcomes Learned with Factor Models”.

Awesome Lists containing this project

README

          

# `causalLFO`: R package for Causal Inference for Latent Outcomes Learned with Factor Models

This R package provides all algorithms discussed in [the paper “Causal Inference for Latent Outcomes Learned with Factor Models”](https://arxiv.org/abs/2506.20549). Code to reproduce results from our paper can be found in the [jennalandy/causalLFO_PAPER](https://github.com/jennalandy/causalLFO_PAPER/tree/master) repository.

## Installation

``` r
remotes::install_github("jennalandy/causalLFO")
```

``` r
library(NMF)
library(causalLFO)
```

`NMF::nmf()` internally uses `setupLibPaths("NMF")`, which calls `path.package("NMF")`. This requires the NMF package to be attached, not just imported, so the user must library `NMF` as well as `causalLFO`.

Please install `NMF` if you have not yet done so. `NMF` requires the `Biobase` package, which may have to be installed separately from `Bioconductor`.

## Quick Start

This code block simulates a simple dataset with 100 samples, three latent factors, and a true ATE of 1000 on the latent dimension 1, with ATE of 0 for dimensions 2 and 3. We include five outliers in the true untreated latent outcomes for factor 3.

``` r
library(tidyverse)
library(ggridges)

set.seed(321)
N = 100; D = 96; K = 3; ATE = c(1000, 0, 0)

# Simulate treatment assignment
Tr = sample(c(0, 1), N, replace = TRUE)

# Simulate latent factors P
true_P = matrix(rexp(D*K, rate = 1), nrow = D)
# Normalize factors to sum to 1
true_P = sweep(true_P, 2, colSums(true_P), '/')

# Simulate untreated factor loadings C
true_C = matrix(nrow = K, ncol = N)
true_C[1,] <- rgamma(N, shape = 1, scale = 1000) # larger scale for factor 1
true_C[2,] <- rexp(N, rate = 0.01)
true_C[3,] <- rexp(N, rate = 0.01)
true_C[3,sample(1:N, 10)] <- rnorm(10, mean = 1500, sd = 1000) # outliers for factor 3
data.frame(t(true_C)) %>%
pivot_longer(1:K, names_to = 'k', values_to = 'C') %>%
ggplot(aes(x = C, y = as.factor(k))) +
geom_density_ridges() +
theme_bw() +
labs(x = "Untreated latent outcome distribution", y = "Latent dimension")
```

![](README_files/figure-commonmark/unnamed-chunk-3-1.png)

``` r
# Add ATE to loadigns of treated samples
for (k in 1:K) {
true_C[k, Tr == 1] <- true_C[k, Tr == 1] + ATE[k]
}

# Simulate M ~ Poisson(PC)
M = matrix(nrow = D, ncol = N)
for (i in 1:N) {
M[,i] <- rpois(D, lambda = true_P %*% true_C[,i])
}
```

### Run impute and stabilize algorithm once to yield a point estimate.

Providing a `reference_P` does not affect the algorithm, but aligns results at the end.

``` r
impute_and_stabilize_res <- impute_and_stabilize(
M, Tr, rank = 3, reference_P = true_P
)
class(impute_and_stabilize_res)
```

```
[1] "causalLFO_result"
```

``` r
summary(impute_and_stabilize_res)
```

```
ATE
1 944.07472
2 11.60620
3 -52.84368
```

``` r
plot(impute_and_stabilize_res)
```

![](README_files/figure-commonmark/unnamed-chunk-4-1.png)

If you have multiple sets of results, they can be plotted together with `plot_causalLFO_results`. This could be from multiple algorithms as we have here, or alternatively from multiple datasets. This only makes sense when the same `reference_P` is used for all results. If a reference is not available, the resulting `Phat` from the first result.

``` r
all_data_res <- all_data(
M, Tr, rank = 3, reference_P = true_P
)
res_list <- list(
'All Data' = all_data_res,
'Impute and Stabilize' = impute_and_stabilize_res
)
plot_causalLFO_results(res_list)
```

![](README_files/figure-commonmark/unnamed-chunk-5-1.png)

### Run impute and stabilize algorithm with bootstrap resampling to estimate a 95% confidence interval.

When `bootstrap = TRUE`, any of the `causalLFO` algorithms will create three files named according to the `bootstrap_filename` parameter: `examples/impute_and_stabilize.csv` with ATE estimates from each of the 500 bootstrap replicates, `examples/impute_and_stabilize_aligned_Ps.rds` with a list of all 500 aligned factor matrices. We also choose to save the `res` object to a separate `.rds` file for easy access at a later time, and `examples/impute_and_stabilize_res.rds` with the full results object that is also returned by the function

``` r
impute_and_stabilize_bootstrap_res <- impute_and_stabilize(
M, Tr, rank = 3, reference_P = true_P,
bootstrap = TRUE, bootstrap_reps = 30,
bootstrap_filename = "examples/impute_and_stabilize"
# small bootstrap_reps for demonstration purposes only
# we recommend default bootstrap_reps = 500
)
```

When `bootstrap = TRUE`, the `class` is changed from `causalLFO_result` to `causalLFO_bootstrap_result`, resulting in updated `summary` and `plot` methods:

``` r
impute_and_stabilize_bootstrap_res <- readRDS("examples/impute_and_stabilize_res.rds")
class(impute_and_stabilize_bootstrap_res)
```

```
[1] "causalLFO_bootstrap_result"
```

``` r
summary(impute_and_stabilize_bootstrap_res)
```

```
mean lower upper
1 1002.70616 763.47986 1384.10774
2 10.74524 -27.91742 42.69258
3 -33.87980 -132.16468 77.48332
```

``` r
plot(impute_and_stabilize_bootstrap_res)
```

![](README_files/figure-commonmark/unnamed-chunk-7-1.png)

Again, multiple sets of results can be plotted together with `plot_causalLFO_bootstrap_results`.

``` r
all_data_bootstrap_res <- all_data(
M, Tr, rank = 3, reference_P = true_P,
bootstrap = TRUE, bootstrap_reps = 30,
bootstrap_filename = "examples/all_data"
# small bootstrap_reps for demonstration purposes only
# we recommend default bootstrap_reps = 500
)
```

Comparing the All Data and Impute and Stabilize algorithms, recall that the true ATE is 1000 for latent dimension 1 and 0 for dimensions 2 and 3. We see:

- Improved efficiency of Impute and Stabilize, narrower confidence intervals on factors 2 and 3 (especially factor 3 which has outliers in the data generating model)
- Impute and Stabilize corrects the All Data algorithm’s biased estimates for factors 1 and 3

``` r
all_data_bootstrap_res <- readRDS("examples/all_data_res.rds")
summary(all_data_bootstrap_res)
```

```
mean lower upper
1 852.34174 580.46693 1086.67511
2 12.08584 -40.73959 81.13129
3 -140.59002 -339.77022 17.31335
```

``` r
res_list <- list(
'All Data' = all_data_bootstrap_res,
'Impute and Stabilize' = impute_and_stabilize_bootstrap_res
)
plot_causalLFO_bootstrap_results(res_list)
```

![](README_files/figure-commonmark/unnamed-chunk-9-1.png)

## Algorithms

Novel algorithm from “Causal Inference for Latent Outcomes Learned with Factor Models”:

- **Impute and Stabilize** algorithm to estimate ATE on latent factor-modeled outcomes. Imputes counterfactual outcomes under Poisson distributional assumptions, fits NMF on untreated data (mix of observed and imputed), a Poisson non-negative linear model on treated data, then estimates ATE as the mean difference in estimated latent outcomes between treated and untreated.

Ablations of Impute and Stabilize:

- **Impute** algorithm to estimate ATE on latent factor-modeled outcomes. Imputes counterfactual outcomes under Poisson distributional assumptions, fits NMF on observed data, a Poisson non-negative linear model on imputed data, then estimates ATE as the mean difference in estimated latent outcomes between treated and untreated. *Intended as an ablation of impute_and_stabilize and not recommended by the authors.*
- **Stabilize** algorithm to estimate ATE on latent factor-modeled outcomes. Fits NMF on untreated samples, a Poisson non-negative linear model on treated samples, then estimates ATE using estimated latent outcomes. *Intended as an ablation of impute_and_stabilize and not recommended by the authors.*

Baseline Algorithms:

- **All Data** algorithm to estimate ATE on latent factor-modeled outcomes. Fits NMF on all data, then estimates ATE from estimated latent outcomes. *Subject to measurement interference and not recommended by the authors.*
- **Random Split** algorithm to estimate ATE on latent factor-modeled outcomes. Fits NMF on a subset of data, a Poisson non-negative linear model on the rest with fixed factors, then estimates ATE from estimated latent outcomes in the second subset. *Subject to measurement interference and not recommended by the authors.*