https://github.com/mikelove/fastqctheoreticalgc

R package for generating theoretical GC content curves for the FASTQC module
https://github.com/mikelove/fastqctheoreticalgc

Last synced: 5 months ago
JSON representation

R package for generating theoretical GC content curves for the FASTQC module

Host: GitHub
URL: https://github.com/mikelove/fastqctheoreticalgc
Owner: mikelove
License: mit
Created: 2016-12-20T16:56:48.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2016-12-21T15:54:30.000Z (over 8 years ago)
Last Synced: 2025-01-01T03:41:46.652Z (5 months ago)
Language: R
Homepage:
Size: 12.7 KB
Stars: 3
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # fastqcTheoreticalGC

R package for generating theoretical GC content curves for the FASTQC module

The `generateDistn` function will create a theoretical distribution

from a genome or transcriptome, and then write this out to a file

which can be read by MultiQC, to be plotted over the GC content 

curves. The function randomly samples n=1 million simulated reads

(default 100bp) to generate the theoretical distribution.

See `?generateDistn` for more details on the function.

See the example file `inst/script/human_mouse.R` for examples

of generating human and mouse genome and transcriptome theoretical

distributions (the output files are also included there).

An example of generating a transcriptome theoretical distribution is:

```{r}

library(BSgenome.Hsapiens.UCSC.hg38)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

set.seed(1)

hs.txs <- extractTranscriptSeqs(Hsapiens, TxDb.Hsapiens.UCSC.hg38.knownGene)

generateDistn(seqs=hs.txs,

              file="fastqc_theoretical_gc_hg38_txome.txt",

              name="Human Transcriptome (UCSC hg38)")

```

# Assumptions

The function generates reads uniformly from the sequences or chromosomes 

provided. So, for the transcriptome, the example files involve generation of 

reads from all the transcripts (with counts proportional to the length 

of the transcripts). Genes with more isoforms are contributing more reads. 

You could pick a single isoform per gene, or you could use the weights to 

generate reads proportional to estimated transcript expression from an 

RNA-seq experiment.

The reads are generated from all positions of the sequences (allowing 

for the read length). They do not take into account fragment length 

distribution or any kind of positional bias.

For the transcriptome / RNA-seq, the read starts do not take into 

account random hexamer bias, but are uniform across the transcripts.

From my personal research into DNA-seq and RNA-seq sequence bias, 

I believe that these simplifying assumptions are not of practical 

consquence for coming up with a theoretical distribution of GC content 

for quality control testing purposes.

If however, you want to plot the observed over expected GC content 

distribution for RNA-seq taking all of these other biases into acccount, 

you could use the Biconductor package [alpine](http://bioconductor.org/packages/alpine),

which implements the methods of [this paper](http://www.nature.com/nbt/journal/v34/n12/full/nbt.3682.html),

and the `plotGC` function. The estimated fragment rate over GC 

is plotted, conditioned on sample-specific transcript expression, 

positional bias, random hexamer priming bias, etc.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mikelove/fastqctheoreticalgc

Awesome Lists containing this project

README