https://github.com/vjcitn/lihc450k
demonstrate HDF5 object store backed SE for LIHC 450k data
https://github.com/vjcitn/lihc450k
Last synced: 4 months ago
JSON representation
demonstrate HDF5 object store backed SE for LIHC 450k data
- Host: GitHub
- URL: https://github.com/vjcitn/lihc450k
- Owner: vjcitn
- Created: 2018-03-08T16:51:23.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-03-15T16:51:28.000Z (over 7 years ago)
- Last Synced: 2025-01-09T13:46:46.332Z (5 months ago)
- Language: R
- Size: 18.6 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
---
title: "patelGBMSC -- CONQUER quantification of single-cell RNA-seq in glioblastoma"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{patelGBMSC -- a single-cell RNA-seq dataset in glioblastoma}
%\VignetteEncoding{UTF-8}
output:
BiocStyle::html_document:
highlight: pygments
number_sections: yes
theme: united
toc: yes
---```{r setup,echo=FALSE,results="hide"}
suppressPackageStartupMessages({
suppressMessages({
library(patelGBMSC)
})
})
```# Introduction
[Patel et al. 2014](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4123637/)
describe a single-cell RNA-seq study of several glioblastoma
samples. The data were reprocessed with the
[CONQUER](http://imlspenticton.uzh.ch:3838/conquer/)
pipeline (see the [QC report](http://imlspenticton.uzh.ch/robinson_lab/conquer/report-multiqc/GSE57872_multiqc_report.html)).The rds file distributed by CONQUER is large as it includes
multiple gene-level and transcript-level quantifications.
As of Oct 30 2017, the CONQUER distribution does not include
sample-level information beyond the GSM identifier. This
package includes a smaller image of the data (the `count_lstpm`
quantifications, that are estimated counts created using the
salmon algorithm, rescaled to account for library size).
The data image is over 200MB, so the `r Biocpkg("BiocFileCache")`
discipline is used to perform a one-time download, insertion
and bookkeeping in cache; the `loadPatel` function takes
care of the download and retrieval from cache as appropriate.# Quick view of the data
We'll randomly sample 5000 genes to reduce runtime
in this vignette. We filter down to the 430 patient samples
that passed quality control.```{r getdat}
library(patelGBMSC)
patelGeneCount = loadPatel()
#
# use metadata on sample QC to exclude failed samples
#
qdrop = grep("excluded", patelGeneCount$description) # QC issues
patelGeneCount = patelGeneCount[,-qdrop]
#
# drop gliospheres
#
ispat = grep("MGH", patelGeneCount$characteristics_ch1)
patelGeneCount = patelGeneCount[,ispat]
patelERCCCount = patelGeneCount # save for ERCC check later
#
# drop ERCC spikeins
#
patelGeneCount = patelGeneCount[-grep("ERCC", rownames(patelGeneCount)),]
#
# keep ERCC spikeins
#
patelERCCCount = patelERCCCount[grep("ERCC", rownames(patelERCCCount)),]
#
# derive patient code
#
patelGeneCount$sampcode = factor(gsub("patient id: ", "", patelGeneCount$characteristics_ch1))
tcol = as.numeric(tfac <- factor(patelGeneCount$sampcode))
patelERCCCount$sampcode = factor(gsub("patient id: ", "", patelERCCCount$characteristics_ch1))
etcol = as.numeric(tfac <- factor(patelERCCCount$sampcode))
#
# sample 5000 genes for t-SNE
#
set.seed(1234)
samp = assay(patelGeneCount[sample(1:nrow(patelGeneCount), size=5000),])
library(Rtsne)
RTL = Rtsne(t(log(samp+1)))
myd = data.frame(ts1=RTL$Y[,1], ts2=RTL$Y[,2],
code = patelGeneCount$sampcode, tcol=tcol)
library(ggplot2)
ggplot(myd, aes(x=ts1, y=ts2, group=code, colour=code)) + geom_point() +
ggtitle("t-SNE for 5000 randomly chosen genes in five GBM scRNA samples")
```# Some views of ERCC spikeins
## t-SNE
```{r liker}
ercc = assay(patelERCCCount)
set.seed(1234)
ERTL = Rtsne(t(log(ercc+1)))
ed = data.frame(ets1=ERTL$Y[,1], ets2=ERTL$Y[,2],
code = patelERCCCount$sampcode, tcol=etcol)
ggplot(ed, aes(x=ets1, y=ets2, group=code, colour=code)) + geom_point() +
ggtitle("t-SNE for ERCC spikeins in five GBM scRNA samples")
```## PCA
```{r lkr2}
pcs = prcomp(t(log(ercc+1)))
plot(pcs$x[,1], pcs$x[,2], pch=19, col=etcol)
boxplot(split(pcs$x[,1], patelERCCCount$sampcode))
boxplot(split(pcs$x[,2], patelERCCCount$sampcode))
```