https://github.com/jennalandy/causallfo

This R package provides all algorithms discussed in the paper “Causal Inference for Latent Outcomes Learned with Factor Models”.
https://github.com/jennalandy/causallfo

causal-inference latent-factor-model latent-outcomes nonnegative-matrix-factorization r-package

Last synced: 10 months ago
JSON representation

This R package provides all algorithms discussed in the paper “Causal Inference for Latent Outcomes Learned with Factor Models”.

Host: GitHub
URL: https://github.com/jennalandy/causallfo
Owner: jennalandy
Created: 2025-06-03T19:32:57.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-28T23:23:00.000Z (12 months ago)
Last Synced: 2025-06-29T00:24:32.070Z (12 months ago)
Topics: causal-inference, latent-factor-model, latent-outcomes, nonnegative-matrix-factorization, r-package
Language: R
Homepage:
Size: 666 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # `causalLFO`: R package for Causal Inference for Latent Outcomes Learned with Factor Models

This R package provides all algorithms discussed in [the paper “Causal Inference for Latent Outcomes Learned with Factor Models”](https://arxiv.org/abs/2506.20549). Code to reproduce results from our paper can be found in the [jennalandy/causalLFO_PAPER](https://github.com/jennalandy/causalLFO_PAPER/tree/master) repository.

## Installation

``` r

remotes::install_github("jennalandy/causalLFO")

```

``` r

library(NMF)

library(causalLFO)

```

`NMF::nmf()` internally uses `setupLibPaths("NMF")`, which calls `path.package("NMF")`. This requires the NMF package to be attached, not just imported, so the user must library `NMF` as well as `causalLFO`.

Please install `NMF` if you have not yet done so. `NMF` requires the `Biobase` package, which may have to be installed separately from `Bioconductor`.

## Quick Start

This code block simulates a simple dataset with 100 samples, three latent factors, and a true ATE of 1000 on the latent dimension 1, with ATE of 0 for dimensions 2 and 3. We include five outliers in the true untreated latent outcomes for factor 3.

``` r

library(tidyverse)

library(ggridges)

set.seed(321)

N = 100; D = 96; K = 3; ATE = c(1000, 0, 0)

# Simulate treatment assignment

Tr = sample(c(0, 1), N, replace = TRUE)

# Simulate latent factors P

true_P = matrix(rexp(D*K, rate = 1), nrow = D)

# Normalize factors to sum to 1

true_P = sweep(true_P, 2, colSums(true_P), '/')

# Simulate untreated factor loadings C

true_C = matrix(nrow = K, ncol = N)

true_C[1,] <- rgamma(N, shape = 1, scale = 1000) # larger scale for factor 1

true_C[2,] <- rexp(N, rate = 0.01)

true_C[3,] <- rexp(N, rate = 0.01)

true_C[3,sample(1:N, 10)] <- rnorm(10, mean = 1500, sd = 1000) # outliers for factor 3

data.frame(t(true_C)) %>%

  pivot_longer(1:K, names_to = 'k', values_to = 'C') %>%

  ggplot(aes(x = C, y = as.factor(k))) +

  geom_density_ridges() +

  theme_bw() +

  labs(x = "Untreated latent outcome distribution", y = "Latent dimension")

```

![](README_files/figure-commonmark/unnamed-chunk-3-1.png)

``` r

# Add ATE to loadigns of treated samples

for (k in 1:K) {

  true_C[k, Tr == 1] <- true_C[k, Tr == 1] + ATE[k]

}

# Simulate M ~ Poisson(PC)

M = matrix(nrow = D, ncol = N)

for (i in 1:N) {

  M[,i] <- rpois(D, lambda = true_P %*% true_C[,i])

}

```

### Run impute and stabilize algorithm once to yield a point estimate.

Providing a `reference_P` does not affect the algorithm, but aligns results at the end.

``` r

impute_and_stabilize_res <- impute_and_stabilize(

  M, Tr, rank = 3, reference_P = true_P

)

class(impute_and_stabilize_res)

```

```         

[1] "causalLFO_result"

```

``` r

summary(impute_and_stabilize_res)

```

```         

        ATE

1 944.07472

2  11.60620

3 -52.84368

```

``` r

plot(impute_and_stabilize_res)

```

![](README_files/figure-commonmark/unnamed-chunk-4-1.png)

If you have multiple sets of results, they can be plotted together with `plot_causalLFO_results`. This could be from multiple algorithms as we have here, or alternatively from multiple datasets. This only makes sense when the same `reference_P` is used for all results. If a reference is not available, the resulting `Phat` from the first result.

``` r

all_data_res <- all_data(

  M, Tr, rank = 3, reference_P = true_P

)

res_list <- list(

  'All Data' = all_data_res,

  'Impute and Stabilize' = impute_and_stabilize_res

)

plot_causalLFO_results(res_list)

```

![](README_files/figure-commonmark/unnamed-chunk-5-1.png)

### Run impute and stabilize algorithm with bootstrap resampling to estimate a 95% confidence interval.

When `bootstrap = TRUE`, any of the `causalLFO` algorithms will create three files named according to the `bootstrap_filename` parameter: `examples/impute_and_stabilize.csv` with ATE estimates from each of the 500 bootstrap replicates, `examples/impute_and_stabilize_aligned_Ps.rds` with a list of all 500 aligned factor matrices. We also choose to save the `res` object to a separate `.rds` file for easy access at a later time, and `examples/impute_and_stabilize_res.rds` with the full results object that is also returned by the function

``` r

impute_and_stabilize_bootstrap_res <- impute_and_stabilize(

  M, Tr, rank = 3, reference_P = true_P,

  bootstrap = TRUE, bootstrap_reps = 30,

  bootstrap_filename = "examples/impute_and_stabilize"

  # small bootstrap_reps for demonstration purposes only

  # we recommend default bootstrap_reps = 500

)

```

When `bootstrap = TRUE`, the `class` is changed from `causalLFO_result` to `causalLFO_bootstrap_result`, resulting in updated `summary` and `plot` methods:

``` r

impute_and_stabilize_bootstrap_res <- readRDS("examples/impute_and_stabilize_res.rds")

class(impute_and_stabilize_bootstrap_res)

```

```         

[1] "causalLFO_bootstrap_result"

```

``` r

summary(impute_and_stabilize_bootstrap_res)

```

```         

        mean      lower      upper

1 1002.70616  763.47986 1384.10774

2   10.74524  -27.91742   42.69258

3  -33.87980 -132.16468   77.48332

```

``` r

plot(impute_and_stabilize_bootstrap_res)

```

![](README_files/figure-commonmark/unnamed-chunk-7-1.png)

Again, multiple sets of results can be plotted together with `plot_causalLFO_bootstrap_results`.

``` r

all_data_bootstrap_res <- all_data(

  M, Tr, rank = 3, reference_P = true_P,

  bootstrap = TRUE, bootstrap_reps = 30,

  bootstrap_filename = "examples/all_data"

  # small bootstrap_reps for demonstration purposes only

  # we recommend default bootstrap_reps = 500

)

```

Comparing the All Data and Impute and Stabilize algorithms, recall that the true ATE is 1000 for latent dimension 1 and 0 for dimensions 2 and 3. We see:

-   Improved efficiency of Impute and Stabilize, narrower confidence intervals on factors 2 and 3 (especially factor 3 which has outliers in the data generating model)

-   Impute and Stabilize corrects the All Data algorithm’s biased estimates for factors 1 and 3

``` r

all_data_bootstrap_res <- readRDS("examples/all_data_res.rds")

summary(all_data_bootstrap_res)

```

```         

        mean      lower      upper

1  852.34174  580.46693 1086.67511

2   12.08584  -40.73959   81.13129

3 -140.59002 -339.77022   17.31335

```

``` r

res_list <- list(

  'All Data' = all_data_bootstrap_res,

  'Impute and Stabilize' = impute_and_stabilize_bootstrap_res

)

plot_causalLFO_bootstrap_results(res_list)

```

![](README_files/figure-commonmark/unnamed-chunk-9-1.png)

## Algorithms

Novel algorithm from “Causal Inference for Latent Outcomes Learned with Factor Models”:

-   **Impute and Stabilize** algorithm to estimate ATE on latent factor-modeled outcomes. Imputes counterfactual outcomes under Poisson distributional assumptions, fits NMF on untreated data (mix of observed and imputed), a Poisson non-negative linear model on treated data, then estimates ATE as the mean difference in estimated latent outcomes between treated and untreated.

Ablations of Impute and Stabilize:

-   **Impute** algorithm to estimate ATE on latent factor-modeled outcomes. Imputes counterfactual outcomes under Poisson distributional assumptions, fits NMF on observed data, a Poisson non-negative linear model on imputed data, then estimates ATE as the mean difference in estimated latent outcomes between treated and untreated. *Intended as an ablation of impute_and_stabilize and not recommended by the authors.*

-   **Stabilize** algorithm to estimate ATE on latent factor-modeled outcomes. Fits NMF on untreated samples, a Poisson non-negative linear model on treated samples, then estimates ATE using estimated latent outcomes. *Intended as an ablation of impute_and_stabilize and not recommended by the authors.*

Baseline Algorithms:

-   **All Data** algorithm to estimate ATE on latent factor-modeled outcomes. Fits NMF on all data, then estimates ATE from estimated latent outcomes. *Subject to measurement interference and not recommended by the authors.*

-   **Random Split** algorithm to estimate ATE on latent factor-modeled outcomes. Fits NMF on a subset of data, a Poisson non-negative linear model on the rest with fixed factors, then estimates ATE from estimated latent outcomes in the second subset. *Subject to measurement interference and not recommended by the authors.*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jennalandy/causallfo

Awesome Lists containing this project

README