Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/csoneson/rna_velocity_quant

Evaluation of the effect of quantification choices on RNA velocity estimates
https://github.com/csoneson/rna_velocity_quant

Last synced: 16 days ago
JSON representation

Evaluation of the effect of quantification choices on RNA velocity estimates

Awesome Lists containing this project

README

        

# Preprocessing choices affect RNA velocity results for droplet scRNA-seq data

This repository contains the scripts used to generate the results in

* C Soneson, A Srivastava, R Patro, MB Stadler: Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. bioRxiv (2020).

## Repository structure

In this manuscript, five real scRNA-seq data sets generated by 10x Genomics technology were analyzed:

* `Dentate gyrus` ([Hochgerner _et al_. 2018](https://www.ncbi.nlm.nih.gov/pubmed/29335606)) - GEO accession [GSE95315](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE95315)
* `Pancreas` ([Bastidas-Ponce _et al_. 2019](https://www.ncbi.nlm.nih.gov/pubmed/31160421)) - GEO accession [GSE132188](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132188)
* `OldBrain` ([Ximerakis _et al_. 2019](https://pubmed.ncbi.nlm.nih.gov/31551601/)) - GEO accession [GSE129788](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129788) (sample accession [GSM3722108](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3722108))
* `PFC` ([Bhattacherjee _et al_. 2019](https://www.ncbi.nlm.nih.gov/pubmed/31519873)) - GEO accession [GSE124952](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124952) (sample accession [GSM3559979](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3559979))
* `Spermatogenesis` ([Hermann _et al_. 2018](https://www.ncbi.nlm.nih.gov/pubmed/30404016)) - GEO accession [GSE109033](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109033) (sample accession [GSM2928341](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2928341))

For each of these data sets, the corresponding subfolder of this repository contains the Makefile that was used to run the analysis. The R and Python scripts called in these Makefiles are provided in the `scripts/` folder.

## Data preprocessing

The downloaded data was preprocessed as follows before being used in the evaluation:

### Dentate gyrus

* FASTQ files for individual cells from the p12 and p35 time points were downloaded (accession date November 23, 2019) and merged using the [GSE95315_download.sh](dentate_gyrus_mouse/data_preprocessing/GSE95315_download.sh) script.
* The sample annotation file ([`cells_kept_in_scvelo_exampledata.csv`](dentate_gyrus_mouse/data_preprocessing/cells_kept_in_scvelo_exampledata.csv)) was obtained via the [scVelo](https://scvelo.readthedocs.io/) Python package (v0.1.24):

```
import scvelo as scv
adata = scv.datasets.dentategyrus()
adata.obs.to_csv("cells_kept_in_scvelo_exampledata.csv")
```

### Pancreas

* FASTQ files for the E15.5 sample were downloaded from [ENA](https://www.ebi.ac.uk/ena/data/view/SRR9201794) (accession date November 24, 2019)
* The FASTQ files were subsequently renamed, replacing `_X` with `S1_L001_RX_001` (for `X` = 1,2)
* The sample annotation file ([`cells_kept_in_scvelo_exampledata.csv`](pancreas_mouse/data_preprocessing/cells_kept_in_scvelo_exampledata.csv)) was obtained via the [scVelo](https://scvelo.readthedocs.io/) Python package (v0.1.24):

```
import scvelo as scv
aadata = scv.datasets.pancreatic_endocrinogenesis()
aadata.obs.to_csv("cells_kept_in_scvelo_exampledata.csv")
```

### OldBrain

* The BAM file with the reads was downloaded (accession date September 18, 2020) from [SRA](https://sra-pub-src-1.s3.amazonaws.com/SRR8895038/OX1X.bam.1)
* FASTQ files were extracted from the BAM file using [bamtofastq](https://github.com/10XGenomics/bamtofastq) v1.1.2:

```
bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 OX1X.bam FASTQtmp

## Concatenate FASTQ files from all flowcells and lanes
mkdir -p FASTQ
cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_I1_001.fastq.gz > \
FASTQ/OX1X_S1_L001_I1_001.fastq.gz

cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_R1_001.fastq.gz > \
FASTQ/OX1X_S1_L001_R1_001.fastq.gz

cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_R2_001.fastq.gz > \
FASTQ/OX1X_S1_L001_R2_001.fastq.gz
```

* The sample annotation file (`GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt`) was downloaded from the [GEO record](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129788). It was processed to retain only cells from sample 37 using the following R code, generating the final [`cells_in_sample37.csv`](oldbrain_mouse/data_preprocessing/cells_in_sample37.csv) file.

```
suppressPackageStartupMessages({
library(dplyr)
})

csv <- read.delim("GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt",
header = FALSE, as.is = TRUE, skip = 2)
colnames(csv) <- read.delim("GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt",
header = FALSE, as.is = TRUE)[1, ]

## Only Sample 37
csv <- csv %>% dplyr::filter(grepl("Aging_mouse_brain_portal_data_37", NAME)) %>%
dplyr::mutate(index = gsub("Aging_mouse_brain_portal_data_37_", "", NAME)) %>%
dplyr::rename(clusters = cell_classes,
clusters_coarse = cluster) %>%
dplyr::select(index, clusters_coarse, clusters, everything())

write.table(csv, file = "cells_in_sample37.csv",
row.names = FALSE, col.names = TRUE,
sep = ",", quote = FALSE)

```

### PFC

* The BAM file with the reads was downloaded (accession date December 19, 2019) from [SRA](https://sra-pub-src-2.s3.amazonaws.com/SRR8433692/PFC_Sample2.bam.1)
* FASTQ files were extracted from the BAM file using [bamtofastq](https://github.com/10XGenomics/bamtofastq) v1.1.2:

```
bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 PFC_Sample2.bam FASTQtmp

## Concatenate FASTQ files from all flowcells and lanes
mkdir -p FASTQ
cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_I1_001.fastq.gz > \
FASTQ/PFCsample2_S1_L001_I1_001.fastq.gz

cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_R1_001.fastq.gz > \
FASTQ/PFCsample2_S1_L001_R1_001.fastq.gz

cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_R2_001.fastq.gz > \
FASTQ/PFCsample2_S1_L001_R2_001.fastq.gz
```

* The sample annotation file (`GSE124952_meta_data.csv`) was downloaded from the [GEO record](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124952). It was processed to retain only cells from PFCsample2 using the following R code, generating the final [`cells_in_pfcsample2.csv`](pfc_mouse/data_preprocessing/cells_in_pfcsample2.csv) file:

```
suppressPackageStartupMessages({
library(dplyr)
})

csv <- read.delim("GSE124952_meta_data.csv", header = TRUE, as.is = TRUE, sep = ",")

## Only Sample2
csv <- csv %>% dplyr::filter(Sample == "PFCSample2") %>%
dplyr::mutate(Sample = gsub("Sample", "sample", Sample)) %>%
dplyr::mutate(index = gsub("PFCSample2_", "", X)) %>%
dplyr::rename(clusters = CellType,
clusters_coarse = L2_clusters) %>%
dplyr::select(index, clusters_coarse, clusters, everything())

write.table(csv, file = "cells_in_pfcsample2.csv",
row.names = FALSE, col.names = TRUE,
sep = ",", quote = FALSE)
```

### Spermatogenesis

* The BAM file from the AdultMouse_Rep3 sample was downloaded (accession date January 23, 2020) from [SRA](https://sra-pub-src-1.s3.amazonaws.com/SRR6459157/AdultMouse_Rep3_possorted_genome_bam.bam.1).
* FASTQ files were extracted from the BAM file using [bamtofastq](https://github.com/10XGenomics/bamtofastq) v1.1.2:

```
bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 AdultMouse_Rep3_possorted_genome_bam.bam FASTQtmp
mkdir -p FASTQ
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_I1_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_I1_001.fastq.gz
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R1_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_R1_001.fastq.gz
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R2_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_R2_001.fastq.gz
```

* The sample annotation file ([`spermatogenesis_loupe_celltypes_repl3.csv`](spermatogenesis_mouse/data_preprocessing/spermatogenesis_loupe_celltypes_repl3.csv)) was obtained by downloading the loupe file `Mouse Unselected Spermatogenic cells.cloupe` from [here](https://data.mendeley.com/datasets/kxd5f8vpt4/1#file-fe79c10b-c42e-472e-9c7e-9a9873d9b3d8), loading it into the 10x Genomics [Loupe browser](https://support.10xgenomics.com/single-cell-gene-expression/software/visualization/latest/what-is-loupe-cell-browser) and exporting the "Cell Type" labels. The resulting file was subset to only include replicate 3 and cell types with at least 5 cells, using the following R code:

```
suppressPackageStartupMessages({
library(dplyr)
})

csv <- read.delim("Spermatogenesis_Loupe_CellTypes.csv",
header = TRUE, as.is = TRUE, sep = ",")

## Only Replicate 3
csv <- csv %>% dplyr::filter(grepl("-3", Barcode)) %>%
dplyr::mutate(index = gsub("-3", "", Barcode)) %>%
dplyr::mutate(clusters = Cell.Types,
clusters_coarse = Cell.Types) %>%
dplyr::select(index, clusters_coarse, clusters)

## Remove rare cell types (<5 cells)
tbl <- table(csv$clusters)
kp <- names(tbl)[tbl >= 5]
csv <- csv %>% dplyr::filter(clusters %in% kp)

write.table(csv, file = "spermatogenesis_loupe_celltypes_repl3.csv",
row.names = FALSE, col.names = TRUE,
sep = ",", quote = FALSE)
```

## Reference generation

All reference files were obtained from [GENCODE, mouse release M21](https://www.gencodegenes.org/mouse/release_M21.html).
In addition to the reference files and indices generated in the respective Makefiles, the following indices were prepared:

### CellRanger index

The `CellRanger` index was generated as follows (with `CellRanger` v3.0.2):

```
cellranger mkref --nthreads 64 \
--genome=cellranger3_0_2_GRCm38.primary_assembly_gencodeM21_spliced \
--fasta=GRCm38.primary_assembly.genome.fa \
--genes=gencode.vM21.annotation.gtf \
--ref-version=GRCm38.primary_assembly_gencodeM21_spliced
```

### STAR index

The `STAR` index used for quantification was generated using `STAR` v 2.7.3a:

```
mkdir star2_7_3a_GRCm38.primary_assembly_gencodeM21_spliced_sjdb150
STAR \
--runThreadN 48 \
--runMode genomeGenerate \
--genomeDir ./star2_7_3a_GRCm38.primary_assembly_gencodeM21_spliced_sjdb150 \
--genomeFastaFiles GRCm38.primary_assembly.genome.fa \
--genomeSAindexNbases 14 \
--genomeChrBinNbits 18 \
--sjdbGTFfile gencode.vM21.annotation.gtf \
--sjdbOverhang 150
```

### Transcript-to-gene mapping

Finally, the transcript-to-gene mapping files (in `.rds` and `.txt` format) were generated with R v3.6/Bioconductor release 3.9 as follows:

```
suppressPackageStartupMessages({
library(rtracklayer)
library(dplyr)
})

entrez <- read.delim("gencode.vM21.metadata.EntrezGene", header = FALSE, as.is = TRUE) %>%
setNames(c("transcript_id","entrez_id"))
gtf <- rtracklayer::import("gencode.vM21.annotation.gtf")
gtftx <- subset(gtf, type == "transcript")
gtfex <- subset(gtf, type == "exon")

df <- data.frame(gtftx, stringsAsFactors = FALSE) %>%
dplyr::select(transcript_id, seqnames, start, end, strand, source,
gene_id, gene_type, gene_name, level, havana_gene, transcript_type,
transcript_name, transcript_support_level, tag, havana_transcript) %>%
dplyr::left_join(data.frame(gtfex, stringsAsFactors = FALSE) %>%
dplyr::group_by(transcript_id) %>%
dplyr::summarize(transcript_length = sum(width)),
by = "transcript_id") %>%
dplyr::left_join(entrez, by = "transcript_id")

dfsub <- df %>% dplyr::select(transcript_id, gene_id)

write.table(dfsub, file = "gencode.vM21.annotation.tx2gene.txt",
sep = "\t", quote = FALSE, row.names = FALSE,
col.names = FALSE)
saveRDS(dfsub, file = "gencode.vM21.annotation.tx2gene.rds")
saveRDS(df, file = "gencode.vM21.annotation.tx2gene_full.rds")
```