https://github.com/jackmwolf/pcsstools

Tools for regression using pre-computed summary statistics
https://github.com/jackmwolf/pcsstools

gwas r statistical-genetics

Last synced: 2 months ago
JSON representation

Tools for regression using pre-computed summary statistics

Host: GitHub
URL: https://github.com/jackmwolf/pcsstools
Owner: jackmwolf
License: gpl-3.0
Created: 2019-07-11T20:10:23.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2023-09-06T00:51:36.000Z (almost 2 years ago)
Last Synced: 2023-11-20T16:31:05.101Z (over 1 year ago)
Topics: gwas, r, statistical-genetics
Language: R
Homepage:
Size: 1.26 MB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE.md

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# pcsstools  

[![CRAN status](https://www.r-pkg.org/badges/version/pcsstools)](https://CRAN.R-project.org/package=pcsstools)

![CRAN Downloads](https://cranlogs.r-pkg.org/badges/grand-total/pcsstools)

[![R-CMD-check](https://github.com/jackmwolf/pcsstools/workflows/R-CMD-check/badge.svg)](https://github.com/jackmwolf/pcsstools/actions)

## Overview 

pcsstools is an R package to describe various regression models using only pre-computed summary statistics (PCSS) from genome-wide association studies (GWASs) and PCSS repositories such as [GeneAtlas](http://geneatlas.roslin.ed.ac.uk/).

This eliminates the logistic, privacy, and access concerns that accompany the use of individual patient-level data (IPD).

The following figure highlights the information typically needed to perform regression analysis on a set of $m$ phenotypes with $p$ covariates when IPD is available, and the PCSS that are commonly needed to approximate this same model in pcsstools.

![Data needed for analysis using IPD compared to that when using PCSS](./man/figures/IPDvsPCSS.png)

Currently, pcsstools supports the linear modeling of complex phenotypes defined via functions of other phenotypes.

Supported functions include:

* linear combinations (e.g. $\phi_1y_1 + \phi_2y_2$)

* products (e.g. $y_1\circ y_2$)

* logical combinations (e.g. $y_1\wedge y_2$ or $y_1\vee y_2$)

## Installation

You can install pcsstools from CRAN with

``` r

install.packages("pcsstools")

```

### Development Version

You can install the in-development version of pcsstools from [GitHub](https://github.com/) with

``` r

# install.packages("devtools")

devtools::install_github("jackmwolf/pcsstools")

```

## Examples

We will walk through two examples using pcsstools to model combinations of phenotypes using PCSS and then compare our results to those found using IPD.

```{r}

library(pcsstools)

```

### Principal Component Analysis

Let's model the first principal component score of three phenotypes using PCSS.

First, we'll load in some data. We have three SNPs; minor allele counts (`g1`, `g2`, and `g3`), a continuous covariate (`x1`), and three continuous phenotypes (`y1`, `y2`, and `y3`).

```{r}

dat <- pcsstools_example[c("g1", "g2", "g3", "x1", "y1", "y2", "y3")]

head(dat)

```

First, we need our assumed summary statistics: means, the full covariance matrix, and our sample size.

```{r}

pcss <- list(

  means = colMeans(dat),

  covs  = cov(dat),

  n     = nrow(dat)

)

```

Then, we can calculate the linear model by using `pcsslm()`.

Our `formula` will list all phenotypes as one sum, joined together by `+` operators and we indicate that we want the first principal component score by setting `comp = 1`.

We also want to center and standardize `y1`, `y2`, and `y3` before computing principal component scores; we will do so by setting `center = TRUE` and `standardize = TRUE`.

```{r}

model_pcss <- pcsslm(y1 + y2 + y3 ~ g1 + g2 + g3 + x1, pcss = pcss, comp = 1,

                     center = TRUE, standardize = TRUE)

model_pcss

```

Here's the same model using individual patient data. 

```{r}

pc_1 <- prcomp(x = dat[c("y1", "y2", "y3")], center = TRUE, scale. = TRUE)$x[, "PC1"]

model_ipd <- lm(pc_1 ~ g1 + g2 + g3 + x1, data = dat)

summary(model_ipd)

```

In this case, our coefficient estimates are off by a factor of -1; this is because we picked the opposite vector of principal component weights to `prcomp`.

This distinction in sign is arbitrary (see the note in `?prcomp`).

We can also compare this model to a smaller model using `anova` and find the same results when using both PCSS and IPD.

```{r}

model_pcss_reduced <- update(model_pcss, . ~ . - g1 - g2 - g3)

anova(model_pcss_reduced, model_pcss)

model_ipd_reduced <-update(model_ipd, . ~ . - g1 - g2 - g3)

anova(model_ipd_reduced, model_ipd)

```

### Logical Combination

In this example we will approximate a linear model where our response is the logical combination "$y_4$ or $y_5$" ($y_4\vee y_5$).

First we need data with binary phenotypes.

```{r}

dat <- pcsstools_example[c("g1", "g2", "x1", "y4", "y5")]

head(dat)

```

Once again we will organized our assumed PCSS.

In addition to the summary statistics we needed for the previous example, we also need to describe the distributions of both of our predictors through objects of class `predictor`. 

(See `?new_predictor`.)

`pcsstools` has shortcut functions to create `predictor` objects for common types of variables, which we will use to create a list of `predictor`s.

```{r}

pcss <- list(

 means = colMeans(dat),

 covs = cov(dat),

 n = nrow(dat),

 predictors = list(

   g1 = new_predictor_snp(maf = mean(dat$g1) / 2),

   g2 = new_predictor_snp(maf = mean(dat$g2) / 2),

   x1 = new_predictor_normal(mean = mean(dat$x1), sd = sd(dat$x1))

 )

)

class(pcss$predictors[[1]])

```

Then we can approximate the linear model using `pcsslm()`.

```{r}

model_pcss <- pcsslm(y4 | y5 ~ g1 + g2 + x1, pcss = pcss) 

model_pcss

```

And here's the result we would get using IPD:

```{r}

model_ipd <- lm(y4 | y5 ~ g1 + g2 + x1, data = dat)

summary(model_ipd)

```

## Future Work

* Support function notation for linear combinations of phenotypes (e.g. `y1 - y2 + 0.5 * y3 ~ 1 + g + x`)  instead of requiring a separate vector of weights

* Support functions using `.` and `-` in the dependent variable (e.g. `y1 ~ .`, `y1 ~ . -x`)

* Write a vignette

## References

Following are the key references for the functions in this package

* Wolf, J.M., Westra, J., and Tintle, N. (2021). Using summary statistics to 

  model multiplicative combinations of initially analyzed phenotypes with a 

  flexible choice of covariates. *Frontiers in Genetics*, 25, 1962.

  [https://doi.org/10.3389/fgene.2021.745901](https://doi.org/10.3389/fgene.2021.745901).

* Wolf, J.M., Barnard, M., Xueting, X., Ryder, N., Westra, J., and Tintle, N. 

  (2020). Computationally efficient, exact, covariate-adjusted genetic principal

  component analysis by leveraging individual marker summary statistics from 

  large biobanks. *Pacific Symposium on Biocomputing*, 25, 719-730. 

  [https://doi.org/10.1142/9789811215636_0063](https://doi.org/10.1142/9789811215636_0063).

  

* Gasdaska A., Friend D., Chen R., Westra J., Zawistowski M., Lindsey W. and 

  Tintle N. (2019) Leveraging summary statistics to make inferences about 

  complex phenotypes in large biobanks. *Pacific Symposium on Biocomputing*, 24, 

  391-402.

  [https://doi.org/10.1142/9789813279827_0036](https://doi.org/10.1142/9789813279827_0036).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jackmwolf/pcsstools

Awesome Lists containing this project

README