https://github.com/csoneson/rna_velocity_quant

Evaluation of the effect of quantification choices on RNA velocity estimates
https://github.com/csoneson/rna_velocity_quant
Last synced: 3 months ago
JSON representation
Evaluation of the effect of quantification choices on RNA velocity estimates
Host: GitHub
URL: https://github.com/csoneson/rna_velocity_quant
Owner: csoneson
License: mit
Created: 2019-10-27T16:23:20.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2021-11-06T16:14:06.000Z (over 3 years ago)
Last Synced: 2025-02-10T14:11:28.140Z (5 months ago)
Language: Makefile
Homepage:
Size: 2.39 MB
Stars: 27
Watchers: 5
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # Preprocessing choices affect RNA velocity results for droplet scRNA-seq data

This repository contains the scripts used to generate the results in 

* C Soneson, A Srivastava, R Patro, MB Stadler: Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. bioRxiv (2020).

## Repository structure

In this manuscript, five real scRNA-seq data sets generated by 10x Genomics technology were analyzed:

* `Dentate gyrus` ([Hochgerner _et al_. 2018](https://www.ncbi.nlm.nih.gov/pubmed/29335606)) - GEO accession [GSE95315](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE95315)

* `Pancreas` ([Bastidas-Ponce _et al_. 2019](https://www.ncbi.nlm.nih.gov/pubmed/31160421)) - GEO accession [GSE132188](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132188)

* `OldBrain` ([Ximerakis _et al_. 2019](https://pubmed.ncbi.nlm.nih.gov/31551601/)) - GEO accession [GSE129788](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129788) (sample accession [GSM3722108](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3722108))

* `PFC` ([Bhattacherjee _et al_. 2019](https://www.ncbi.nlm.nih.gov/pubmed/31519873)) - GEO accession [GSE124952](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124952) (sample accession [GSM3559979](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3559979))

* `Spermatogenesis` ([Hermann _et al_. 2018](https://www.ncbi.nlm.nih.gov/pubmed/30404016)) - GEO accession [GSE109033](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109033) (sample accession [GSM2928341](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2928341))

For each of these data sets, the corresponding subfolder of this repository contains the Makefile that was used to run the analysis. The R and Python scripts called in these Makefiles are provided in the `scripts/` folder. 

## Data preprocessing

The downloaded data was preprocessed as follows before being used in the evaluation:

### Dentate gyrus

* FASTQ files for individual cells from the p12 and p35 time points were downloaded (accession date November 23, 2019) and merged using the [GSE95315_download.sh](dentate_gyrus_mouse/data_preprocessing/GSE95315_download.sh) script.

* The sample annotation file ([`cells_kept_in_scvelo_exampledata.csv`](dentate_gyrus_mouse/data_preprocessing/cells_kept_in_scvelo_exampledata.csv)) was obtained via the [scVelo](https://scvelo.readthedocs.io/) Python package (v0.1.24):

```

import scvelo as scv

adata = scv.datasets.dentategyrus()

adata.obs.to_csv("cells_kept_in_scvelo_exampledata.csv")

```

### Pancreas

* FASTQ files for the E15.5 sample were downloaded from [ENA](https://www.ebi.ac.uk/ena/data/view/SRR9201794) (accession date November 24, 2019)

* The FASTQ files were subsequently renamed, replacing `_X` with `S1_L001_RX_001` (for `X` = 1,2)

* The sample annotation file ([`cells_kept_in_scvelo_exampledata.csv`](pancreas_mouse/data_preprocessing/cells_kept_in_scvelo_exampledata.csv)) was obtained via the [scVelo](https://scvelo.readthedocs.io/) Python package (v0.1.24):

```

import scvelo as scv

aadata = scv.datasets.pancreatic_endocrinogenesis()

aadata.obs.to_csv("cells_kept_in_scvelo_exampledata.csv")

```

### OldBrain

* The BAM file with the reads was downloaded (accession date September 18, 2020) from [SRA](https://sra-pub-src-1.s3.amazonaws.com/SRR8895038/OX1X.bam.1)

* FASTQ files were extracted from the BAM file using [bamtofastq](https://github.com/10XGenomics/bamtofastq) v1.1.2:

```

bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 OX1X.bam FASTQtmp

## Concatenate FASTQ files from all flowcells and lanes

mkdir -p FASTQ

cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_I1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_I1_001.fastq.gz > \

FASTQ/OX1X_S1_L001_I1_001.fastq.gz

cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_R1_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_R1_001.fastq.gz > \

FASTQ/OX1X_S1_L001_R1_001.fastq.gz

cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_R2_001.fastq.gz \

FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_R2_001.fastq.gz > \

FASTQ/OX1X_S1_L001_R2_001.fastq.gz

```

* The sample annotation file (`GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt`) was downloaded from the [GEO record](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129788). It was processed to retain only cells from sample 37 using the following R code, generating the final [`cells_in_sample37.csv`](oldbrain_mouse/data_preprocessing/cells_in_sample37.csv) file. 

```

suppressPackageStartupMessages({

  library(dplyr)

})

csv <- read.delim("GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt", 

                  header = FALSE, as.is = TRUE, skip = 2)

colnames(csv) <- read.delim("GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt",

                            header = FALSE, as.is = TRUE)[1, ]

## Only Sample 37

csv <- csv %>% dplyr::filter(grepl("Aging_mouse_brain_portal_data_37", NAME)) %>%

  dplyr::mutate(index = gsub("Aging_mouse_brain_portal_data_37_", "", NAME)) %>%

  dplyr::rename(clusters = cell_classes,

                clusters_coarse = cluster) %>%

  dplyr::select(index, clusters_coarse, clusters, everything())

write.table(csv, file = "cells_in_sample37.csv",

            row.names = FALSE, col.names = TRUE,

            sep = ",", quote = FALSE)

```

### PFC

* The BAM file with the reads was downloaded (accession date December 19, 2019) from [SRA](https://sra-pub-src-2.s3.amazonaws.com/SRR8433692/PFC_Sample2.bam.1)

* FASTQ files were extracted from the BAM file using [bamtofastq](https://github.com/10XGenomics/bamtofastq) v1.1.2:

```

bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 PFC_Sample2.bam FASTQtmp

## Concatenate FASTQ files from all flowcells and lanes

mkdir -p FASTQ

cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_I1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_I1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_I1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_I1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_I1_001.fastq.gz > \

FASTQ/PFCsample2_S1_L001_I1_001.fastq.gz

cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_R1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_R1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_R1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_R1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_R1_001.fastq.gz > \

FASTQ/PFCsample2_S1_L001_R1_001.fastq.gz

cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_R2_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_R2_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_R2_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_R2_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \

FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_R2_001.fastq.gz > \

FASTQ/PFCsample2_S1_L001_R2_001.fastq.gz

```

* The sample annotation file (`GSE124952_meta_data.csv`) was downloaded from the [GEO record](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124952). It was processed to retain only cells from PFCsample2 using the following R code, generating the final [`cells_in_pfcsample2.csv`](pfc_mouse/data_preprocessing/cells_in_pfcsample2.csv) file:

```

suppressPackageStartupMessages({

  library(dplyr)

})

csv <- read.delim("GSE124952_meta_data.csv", header = TRUE, as.is = TRUE, sep = ",")

## Only Sample2

csv <- csv %>% dplyr::filter(Sample == "PFCSample2") %>%

  dplyr::mutate(Sample = gsub("Sample", "sample", Sample)) %>%

  dplyr::mutate(index = gsub("PFCSample2_", "", X)) %>%

  dplyr::rename(clusters = CellType,

                clusters_coarse = L2_clusters) %>%

  dplyr::select(index, clusters_coarse, clusters, everything())

write.table(csv, file = "cells_in_pfcsample2.csv",

            row.names = FALSE, col.names = TRUE,

            sep = ",", quote = FALSE)

```

### Spermatogenesis

* The BAM file from the AdultMouse_Rep3 sample was downloaded (accession date January 23, 2020) from [SRA](https://sra-pub-src-1.s3.amazonaws.com/SRR6459157/AdultMouse_Rep3_possorted_genome_bam.bam.1).

* FASTQ files were extracted from the BAM file using [bamtofastq](https://github.com/10XGenomics/bamtofastq) v1.1.2:

```

bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 AdultMouse_Rep3_possorted_genome_bam.bam FASTQtmp

mkdir -p FASTQ

mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_I1_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_I1_001.fastq.gz

mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R1_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_R1_001.fastq.gz

mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R2_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_R2_001.fastq.gz

```

* The sample annotation file ([`spermatogenesis_loupe_celltypes_repl3.csv`](spermatogenesis_mouse/data_preprocessing/spermatogenesis_loupe_celltypes_repl3.csv)) was obtained by downloading the loupe file `Mouse Unselected Spermatogenic cells.cloupe` from [here](https://data.mendeley.com/datasets/kxd5f8vpt4/1#file-fe79c10b-c42e-472e-9c7e-9a9873d9b3d8), loading it into the 10x Genomics [Loupe browser](https://support.10xgenomics.com/single-cell-gene-expression/software/visualization/latest/what-is-loupe-cell-browser) and exporting the "Cell Type" labels. The resulting file was subset to only include replicate 3 and cell types with at least 5 cells, using the following R code:

```

suppressPackageStartupMessages({

  library(dplyr)

})

csv <- read.delim("Spermatogenesis_Loupe_CellTypes.csv", 

                  header = TRUE, as.is = TRUE, sep = ",")

## Only Replicate 3

csv <- csv %>% dplyr::filter(grepl("-3", Barcode)) %>%

  dplyr::mutate(index = gsub("-3", "", Barcode)) %>%

  dplyr::mutate(clusters = Cell.Types,

                clusters_coarse = Cell.Types) %>%

  dplyr::select(index, clusters_coarse, clusters)

## Remove rare cell types (<5 cells)

tbl <- table(csv$clusters)

kp <- names(tbl)[tbl >= 5]

csv <- csv %>% dplyr::filter(clusters %in% kp)

write.table(csv, file = "spermatogenesis_loupe_celltypes_repl3.csv",

            row.names = FALSE, col.names = TRUE,

            sep = ",", quote = FALSE)

```

## Reference generation

All reference files were obtained from [GENCODE, mouse release M21](https://www.gencodegenes.org/mouse/release_M21.html). 

In addition to the reference files and indices generated in the respective Makefiles, the following indices were prepared:

### CellRanger index

The `CellRanger` index was generated as follows (with `CellRanger` v3.0.2):

```

cellranger mkref --nthreads 64 \

--genome=cellranger3_0_2_GRCm38.primary_assembly_gencodeM21_spliced \

--fasta=GRCm38.primary_assembly.genome.fa \

--genes=gencode.vM21.annotation.gtf \

--ref-version=GRCm38.primary_assembly_gencodeM21_spliced

```

### STAR index

The `STAR` index used for quantification was generated using `STAR` v 2.7.3a:

```

mkdir star2_7_3a_GRCm38.primary_assembly_gencodeM21_spliced_sjdb150

STAR \

  --runThreadN 48 \

  --runMode genomeGenerate \

  --genomeDir ./star2_7_3a_GRCm38.primary_assembly_gencodeM21_spliced_sjdb150 \

  --genomeFastaFiles GRCm38.primary_assembly.genome.fa \

  --genomeSAindexNbases 14 \

  --genomeChrBinNbits 18 \

  --sjdbGTFfile gencode.vM21.annotation.gtf \

  --sjdbOverhang 150

```

### Transcript-to-gene mapping

Finally, the transcript-to-gene mapping files (in `.rds` and `.txt` format) were generated with R v3.6/Bioconductor release 3.9 as follows: 

```

suppressPackageStartupMessages({

  library(rtracklayer)

  library(dplyr)

})

entrez <- read.delim("gencode.vM21.metadata.EntrezGene", header = FALSE, as.is = TRUE) %>%

  setNames(c("transcript_id","entrez_id"))

gtf <- rtracklayer::import("gencode.vM21.annotation.gtf")

gtftx <- subset(gtf, type == "transcript")

gtfex <- subset(gtf, type == "exon")

df <- data.frame(gtftx, stringsAsFactors = FALSE) %>%

  dplyr::select(transcript_id, seqnames, start, end, strand, source, 

                gene_id, gene_type, gene_name, level, havana_gene, transcript_type,

                transcript_name, transcript_support_level, tag, havana_transcript) %>%

  dplyr::left_join(data.frame(gtfex, stringsAsFactors = FALSE) %>%

                     dplyr::group_by(transcript_id) %>% 

                     dplyr::summarize(transcript_length = sum(width)),

                   by = "transcript_id") %>%

  dplyr::left_join(entrez, by = "transcript_id")

dfsub <- df %>% dplyr::select(transcript_id, gene_id)

write.table(dfsub, file = "gencode.vM21.annotation.tx2gene.txt", 

            sep = "\t", quote = FALSE, row.names = FALSE, 

            col.names = FALSE)

saveRDS(dfsub, file = "gencode.vM21.annotation.tx2gene.rds")

saveRDS(df, file = "gencode.vM21.annotation.tx2gene_full.rds")

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/csoneson/rna_velocity_quant

Awesome Lists containing this project

README