Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mikelove/avgtxlength


https://github.com/mikelove/avgtxlength

Last synced: 24 days ago
JSON representation

Awesome Lists containing this project

README

        

## Average transcript length normalization for count based methods

The count of RNA-seq reads which can be assigned to a gene depends on
a number of factors, including the abundance of that gene, the library
size, the length of the gene, sequence biases, the distribution of
fragment lengths and other factors. Comparing counts across samples and only
accounting for library size assumes that the gene-specific factors
other than gene abundance do not change across samples: for example
the length of the gene or the gene-specific effects of sequence
biases. These factors can also be included as model parameters into a
count based method, through the use of a gene-by-sample matrix of
normalization factors, or an offset matrix (on the log scale).

Here we show how to use the estimates of average transcript length for each
gene and each sample from the output of RSEM and Cufflinks as
normalization for count base methods DESeq2 and edgeR.

By gene length, we refer to a gene's average transcript
length, averaging with respect to the abundance of each
transcript. For example, if a gene has two transcripts, one of length
1000 which accounts for 1/4 of the abundance, and one of length 2000
which accounts for 3/4 of the abundance, the average transcript length, would be:

```{r}
0.25 * 1000 + 0.75 * 2000
```

## Experiment files

The files are RNA-seq runs from the UCSD Human Reference Epigenome
Mapping Project. Note that these are individuals runs from experiments
which might have had multiple runs per experiment.
The experiments are two samples of each: left ventricle and sigmoid colon.

```{r}
samples <- list.files("rsem_out")
condition <- scan("condition.txt",what="char")
ord <- order(condition)
(samples <- samples[ord])
(condition <- condition[ord])
```

## RSEM output

The RSEM `.genes.results` files contain a column
`effective_length`. The effective length as defined by the RSEM
authors as "transcript length - mean fragment length + 1", and for the
gene results, the reported effective length is averaged over the
transcripts, weighting by the percent of total abundance, as described above.
All we have to do is collect this column over the samples.

```{r}
library(data.table)
rsem_list <- list()
for (i in seq_along(samples)) {
cat(i)
rsem_raw <- fread(paste0("rsem_out/",samples[i],"/",samples[i],".genes.results"))
rsem_list[[i]] <- data.frame(eff_length=rsem_raw$effective_length, row.names=rsem_raw$gene_id)
}
head(rsem_list[[1]])
```

We combine these columns into one data frame, checking that the
`gene_id` is consistent.

```{r}
for (i in seq_along(samples)[-1]) {
stopifnot(all(rownames(rsem_list[[i]]) == rownames(rsem_list[[1]])))
}
rsem <- do.call(cbind, rsem_list)
rsem <- as.matrix(rsem)
colnames(rsem) <- samples
head(rsem)
```

## Cufflinks output

For Cufflinks output files, we do not have the exact same
quantity, the average effective length, but the `isoforms.fpkm_tracking` file does
report the length of each isoform and the FPKM estimate for each
isoform. We can calculate the average gene length by averaging over
the isoforms, weighting by the ratio of each isoform's FPKM to the total
FPKM for all isoforms.

```{r}
library(dplyr)
cuff_list <- list()
for (i in seq_along(samples)) {
cat(i)
suppressWarnings({cuff_raw <- fread(paste0("cufflinks_out/",samples[i],"/isoforms.fpkm_tracking"))})
res <- cuff_raw %>% group_by(gene_id) %>% summarize(avg_length=sum(length*FPKM)/sum(FPKM))
cuff_list[[i]] <- data.frame(avg_length=res$avg_length, row.names=res$gene_id)
}
head(cuff_list[[1]])
```

We combine these columns into one data frame, checking that the
`gene_id` is consistent.

```{r}
for (i in seq_along(samples)[-1]) {
stopifnot(all(rownames(cuff_list[[i]]) == rownames(cuff_list[[1]])))
}
cuff <- do.call(cbind, cuff_list)
cuff <- as.matrix(cuff)
colnames(cuff) <- samples
head(cuff)
```

## Genes with zero length

We have to remove gene's with zero length to continue.

```{r}
rsem.nz <- rsem[apply(rsem, 1, function(x) all(x > 0)),]
cuff.nz <- cuff[apply(cuff, 1, function(x) all(x > 0)),]
```

## Dividing out the geometric mean

As the count based methods have an intercept on the log scale, and
the effect of gene length on counts is multiplicative, we can
divide each row by it's geometric median to produce a matrix which is
centered on zero in the log scale.

```{r}
norm.mat <- rsem.nz / exp(rowMeans(log(rsem.nz)))
head(norm.mat)
```

## Input to DESeq2

Here we show the code for including these gene lengths in a
DESeq2 analysis. The `estimateSizeFactors` function corrects for
library size after taking into account the information in `norm.mat`.
Here we demonstrate using the normalization matrix on a matrix of
random counts, where the actual read counts for each gene and sample
would go.

```{r}
n <- nrow(norm.mat)
m <- length(samples)
library(DESeq2)
cts <- matrix(rpois(n*m,lambda=100),ncol=m)
coldata <- data.frame(condition=factor(condition), row.names=samples)
dds <- DESeqDataSetFromMatrix(cts, coldata, ~ condition)
dds <- estimateSizeFactors(dds, normMatrix=norm.mat)
head(normalizationFactors(dds))
```

## Input to edgeR

One can provide the normalization factors as an offset to edgeR, which
needs to be on the natural log scale. The offset matrix, like the
`normalizationFactors` in DESeq2, needs to account for library size.

```{r}
library(edgeR)
o <- log(calcNormFactors(cts/norm.mat)) + log(colSums(cts/norm.mat))
y <- DGEList(cts)
y$offset <- t(t(log(norm.mat)) + o)
```

## Exploring RSEM and Cufflinks estimates

Above, we used RSEM column `effective_length`
(where the average fragment length has been subtracted),
whereas the Cufflinks column we used was the transcript length alone.
In order to compare across software, we reload the RSEM data, this time using the
column `length`.

```{r}
rsem_list <- list()
for (i in seq_along(samples)) {
cat(i)
rsem_raw <- fread(paste0("rsem_out/",samples[i],"/",samples[i],".genes.results"))
rsem_list[[i]] <- data.frame(eff_length=rsem_raw$length, row.names=rsem_raw$gene_id)
}
head(rsem_list[[1]])
for (i in seq_along(samples)[-1]) {
stopifnot(all(rownames(rsem_list[[i]]) == rownames(rsem_list[[1]])))
}
rsem <- do.call(cbind, rsem_list)
rsem <- as.matrix(rsem)
colnames(rsem) <- samples
head(rsem)
```

```{r}
idx <- intersect(rownames(rsem), rownames(cuff))
length(idx)
rsem <- rsem[idx,]
cuff <- cuff[idx,]
stopifnot(all.equal(rownames(rsem), rownames(cuff)))
colnames(rsem) <- condition
colnames(cuff) <- condition
```

```{r scatter}
library(rafalib)
mypar(2,2)
for (i in 1:4) {
suppressWarnings(plot(rsem[,i], cuff[,i],log="xy",cex=.3,main=samples[i],
xlab="rsem", ylab="cufflinks"))
abline(0,1,col="blue")
}
```

```{r cache=TRUE, rsem_pairs}
suppressWarnings(pairs(rsem,log="xy",cex=.1))
```

```{r cache=TRUE, cuff_pairs}
suppressWarnings(pairs(cuff,log="xy",cex=.1))
```

```{r rsem_ma}
mypar()
rsem_a <- rowMeans(rsem)
rsem_m <- log2(rowMeans(rsem[,condition == "SC"])/rowMeans(rsem[,condition == "LV"]))
plot(rsem_a, rsem_m, log="x", xlab="A", ylab="M", main="rsem", cex=.3)
abline(h=0,col="blue")
```

```{r cuff_ma}
mypar()
cuff_a <- rowMeans(cuff)
cuff_m <- log2(rowMeans(cuff[,condition == "SC"])/rowMeans(cuff[,condition == "LV"]))
plot(cuff_a, cuff_m, log="x", xlab="A", ylab="M", main="cufflinks", cex=.3)
abline(h=0,col="blue")
```

```{r m_vs_m}
mypar()
plot(rsem_m, cuff_m, xlab="rsem", ylab="cufflinks", main="M vs M", cex=.3)
legend("bottomright",legend=paste0("pearson=",round(cor(rsem_m,cuff_m,use="complete"),2)),inset=.05)
abline(0,1,col="blue")
```

```{r m_minus_m}
mypar()
plot((rsem_a + cuff_a)/2, rsem_m - cuff_m, xlab="(rsem A + cufflinks A)/2", ylab="rsem M - cufflinks M",
main="M minus M", cex=.3, log="x")
abline(h=0,col="blue")
```