Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lieberinstitute/qsvar
Quality Surrogate Variable Analysis for Degradation Correction
https://github.com/lieberinstitute/qsvar
bioconductor brain degradation human qsva rstats
Last synced: about 1 month ago
JSON representation
Quality Surrogate Variable Analysis for Degradation Correction
- Host: GitHub
- URL: https://github.com/lieberinstitute/qsvar
- Owner: LieberInstitute
- Created: 2021-10-26T19:22:51.000Z (about 3 years ago)
- Default Branch: devel
- Last Pushed: 2024-11-26T20:48:49.000Z (about 1 month ago)
- Last Synced: 2024-11-26T21:24:10.329Z (about 1 month ago)
- Topics: bioconductor, brain, degradation, human, qsva, rstats
- Language: R
- Homepage: http://research.libd.org/qsvaR/
- Size: 9.41 MB
- Stars: 0
- Watchers: 6
- Forks: 2
- Open Issues: 15
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Contributing: .github/CONTRIBUTING.md
- Code of conduct: .github/CODE_OF_CONDUCT.md
- Support: .github/SUPPORT.md
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# qsvaR
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![Bioc release status](http://www.bioconductor.org/shields/build/release/bioc/qsvaR.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/qsvaR)
[![Bioc devel status](http://www.bioconductor.org/shields/build/devel/bioc/qsvaR.svg)](https://bioconductor.org/checkResults/devel/bioc-LATEST/qsvaR)
[![Bioc downloads rank](https://bioconductor.org/shields/downloads/release/qsvaR.svg)](http://bioconductor.org/packages/stats/bioc/qsvaR/)
[![Bioc support](https://bioconductor.org/shields/posts/qsvaR.svg)](https://support.bioconductor.org/tag/qsvaR)
[![Bioc history](https://bioconductor.org/shields/years-in-bioc/qsvaR.svg)](https://bioconductor.org/packages/release/bioc/html/qsvaR.html#since)
[![Bioc last commit](https://bioconductor.org/shields/lastcommit/devel/bioc/qsvaR.svg)](http://bioconductor.org/checkResults/devel/bioc-LATEST/qsvaR/)
[![Bioc dependencies](https://bioconductor.org/shields/dependencies/release/qsvaR.svg)](https://bioconductor.org/packages/release/bioc/html/qsvaR.html#since)
[![Codecov test coverage](https://codecov.io/gh/LieberInstitute/qsvaR/branch/devel/graph/badge.svg)](https://codecov.io/gh/LieberInstitute/qsvaR?branch=devel)
[![R build status](https://github.com/LieberInstitute/qsvaR/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/LieberInstitute/qsvaR/actions)
[![GitHub issues](https://img.shields.io/github/issues/LieberInstitute/qsvaR)](https://github.com/LieberInstitute/qsvaR/issues)
[![GitHub pulls](https://img.shields.io/github/issues-pr/LieberInstitute/qsvaR)](https://github.com/LieberInstitute/qsvaR/pulls)
[![DOI](https://zenodo.org/badge/421556636.svg)](https://zenodo.org/badge/latestdoi/421556636)Differential expressions analysis requires the ability to normalize complex datasets. In the case of postmortem brain tissue we are tasked with removing the effects of bench degradation. The `qsvaR` package combines an established method for removing the effects of degradation from RNA-seq data with easy to use functions. It is the second iteration of the qSVA framework ([Jaffe et al, PNAS, 2017](https://doi.org/10.1073/pnas.1617384114)).
The first step in the `qsvaR` workflow is to create an [`RangedSummarizedExperiment`](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class) object with the transcripts identified in our qSVA experiment. If you already have a [`RangedSummarizedExperiment`](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class) of transcripts we can do this with the `getDegTx()` function as shown below.If not this can be generated with the [`SPEAQeasy`](http://research.libd.org/SPEAQeasy/index.html) (a RNA-seq pipeline maintained by our lab) pipeline using the `--qsva` flag. If you already have a [`RangedSummarizedExperiment`](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class) object with transcripts then you do not need to run [`SPEAQeasy`](http://research.libd.org/SPEAQeasy/index.html). This flag requires a full path to a text file, containing one Ensembl transcript ID per line for each transcript desired in the final transcripts R output object (called `rse_tx`). The `sig_transcripts` argument in this package should contain the same Ensembl transcript IDs as the text file for the `--qsva` flag.The goal of `qsvaR` is to provide software that can remove the effects of bench degradation from RNA-seq data.
## Installation Instructions
Get the latest stable R release from CRAN. Then install `qsvaR` using from Bioconductor the following code:
```{r "install pkg", eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}BiocManager::install("qsvaR")
```And the development version from GitHub with:
```{r "install qsva devel", eval=FALSE}
BiocManager::install("LieberInstitute/qsvaR")
```## Example
This is a basic example which shows how to obtain the quality surrogate variables (qSVs) for the brainseq [phase II dataset](http://eqtl.brainseq.org/phase2). qSVs are essentially principal components from an rna-seq experiment designed to model bench degradation. For more on principal components you can read and introductory article [here](https://towardsdatascience.com/tidying-up-with-pca-an-introduction-to-principal-components-analysis-f876599af383#:~:text=The%20goal%20of%20PCA%20is,eliminate%20ones%20that%20do%20not). At the start of this script we will have an [`RangedSummarizedExperiment`](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class) and a list of all the transcripts found in our degradation study. At the end we will have a table with differential expression results that is adjusted for qSVs.
```{r "load pkgs", message = FALSE, warning = FALSE}
## R packages we'll use
library("qsvaR")
library("limma")
``````{r example_qsvs}
library("qsvaR")## We'll download example data from the BrainSeq Phase II project
## described at http://eqtl.brainseq.org/phase2/.
##
## We'll use BiocFileCache to cache these files so you don't have to download
## them again for other examples.
bfc <- BiocFileCache::BiocFileCache()
rse_file <- BiocFileCache::bfcrpath(
"https://s3.us-east-2.amazonaws.com/libd-brainseq2/rse_tx_unfiltered.Rdata",
x = bfc
)## Now that we have the data in our computer, we can load it.
load(rse_file, verbose = TRUE)
```In this next step, we subset to the transcripts associated with degradation.
`qsvaR` provides significant transcripts determined in four different linear
models of transcript expression against degradation time, brain region, and
potentially cell-type proportions:1. `exp ~ DegradationTime + Region`
2. `exp ~ DegradationTime * Region`
3. `exp ~ DegradationTime + Region + CellTypeProp`
4. `exp ~ DegradationTime * Region + CellTypeProp``select_transcripts()` returns degradation-associated transcripts and supports
two parameters. First, `top_n` controls how many significant transcripts to
extract from each model. When `cell_component = TRUE`, all four models are used;
otherwise, just the first two are used. The union of significant transcripts
from all used models is returned.As an example, we'll subset our `RangedSummarizedExperiment` to the union of
the top 1000 significant transcripts derived from each of the four models.```{r select_transcripts}
# Subset 'rse_tx' to the top 1000 significant transcripts from the four
# degradation models
DegTx <- getDegTx(
rse_tx,
sig_transcripts = select_transcripts(top_n = 1000, cell_component = TRUE)
)## Now we can compute the Principal Components (PCs) of the degraded
## transcripts
pcTx <- getPCs(DegTx, "tpm")
```Next we use the `k_qsvs()` function to calculate how many PCs we will need to account for the variation. A model matrix accounting for relevant variables should be used. Common variables such as Age, Sex, Race and Region are often included in the model. Again we are using our `RangedSummarizedExperiment` `DegTx` as the `rse_tx` option. Next we specify the `mod` with our `model.matrix()`. `model.matrix()` creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly. For more information on creating a design matrix for your experiment see the documentation [here](http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/designmatrices.html). Again we use the `assayname` option to specify the we are using the `tpm` assay, where TPM stands for _transcripts per million_.
```{r select_k}
## Using a simple statistical model we determine the number of PCs needed (k)
mod <- model.matrix(~ Dx + Age + Sex + Race + Region,
data = colData(rse_tx)
)
k <- k_qsvs(DegTx, mod, "tpm")
print(k)
```Now that we have our PCs and the number we need we can generate our qSVs.
```{r example_get_qSVs}
## Obtain the k qSVs
qsvs <- get_qsvs(pcTx, k)
dim(qsvs)
```This can be done in one step with our wrapper function `qSVA` which just combinds all the previous mentioned functions.
```{r "wrapper function"}
## Example use of the wrapper function qSVA()
qsvs_wrapper <- qSVA(
rse_tx = rse_tx,
sig_transcripts = select_transcripts(top_n = 1000, cell_component = TRUE),
mod = mod,
assayname = "tpm"
)
dim(qsvs_wrapper)
```## Differential Expression
Next we can use a standard `limma` package approach to do differential expression on the data. The key here is that we add our qSVs to the statistical model we use through `model.matrix()`.
Here we input our [`Ranged SummarizedExperiment`](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class) object and our `model.matrix` with qSVs. Note here that the `Ranged SummarizedExperiment` object is the original object loaded with the full list of transcripts, not the the one we subsetted for qSVs. This is because while PCs can be generated from a subset of genes, differential expression is best done on the full dataset. The expected output is a `sigTx` object that shows the results of differential expression.```{r "perform DE", warning=FALSE}
library("limma")## Add the qSVs to our statistical model
mod_qSVA <- cbind(
mod,
qsvs
)## Extract the transcript expression values and put them in the
## log2(TPM + 1) scale
txExprs <- log2(assays(rse_tx)$tpm + 1)## Run the standard linear model for differential expression
fitTx <- lmFit(txExprs, mod_qSVA)
eBTx <- eBayes(fitTx)## Extract the differential expression results
sigTx <- topTable(eBTx,
coef = 2,
p.value = 1, number = nrow(rse_tx)
)## Explore the top results
head(sigTx)
```Finally, you can compare the resulting t-statistics from your differential expression model against the degradation time t-statistics adjusting for the six different brain regions. This type of plot is called `DEqual` plot and was shown in the initial qSVA framework paper ([Jaffe et al, PNAS, 2017](https://doi.org/10.1073/pnas.1617384114)). We are really looking for two patterns exemplified here in Figure 1 (cartoon shown earlier). A direct positive correlation with degradation shown in Figure 1 on the right tells us that there is signal in the data associated with qSVs. An example of nonconfounded data or data that has been modeled can be seen in Figure 1 on the right with its lack of relationship between the x and y variables.
```{r DEqualCartoon,fig.cap="Cartoon showing patterns in DEqual plots", echo = FALSE}
knitr::include_graphics("./man/figures/DEqual_example.png")
``````{r "DEqual",fig.cap="Result of Differential Expression with qSVA normalization."}
## Generate a DEqual() plot using the model results with qSVs
DEqual(sigTx)
```For comparison, here is the `DEqual()` plot for the model without qSVs.
```{r "DEqual-no-qSVs",fig.cap="Result of Differential Expression without qSVA normalization.", warning=FALSE}
## Generate a DEqual() plot using the model results without qSVs
DEqual(topTable(eBayes(lmFit(txExprs, mod)), coef = 2, p.value = 1, number = nrow(rse_tx)))
```In these two DEqual plots we can see that the first is much better. With a correlation of -0.014 we can effectively conclude that we have removed the effects of degradation from the data. In the second plot after modeling for several common variables we still have a correlation of 0.5 with the degradation experiment. This high correlation shows we still have a large amount of signal from degradation in our data potentially confounding our case-control (SCZD vs neurotypical controls) differential expression results.