Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/darwinawardwinner/cd4-csaw

Reproducible reanalysis of a combined ChIP-Seq & RNA-Seq data set
https://github.com/darwinawardwinner/cd4-csaw

bioconductor bioinformatics-pipeline chip-seq r reproducible-research rna-seq

Last synced: 19 days ago
JSON representation

Reproducible reanalysis of a combined ChIP-Seq & RNA-Seq data set

Awesome Lists containing this project

README

        

# Re-analysis of a combined ChIP-Seq & RNA-Seq data set

This is the code for a re-analysis of a [GEO dataset][1] that I
originally analyzed for [this paper][2] using statistical methods that
were not yet available at the time, such as the
[csaw Bioconductor package][3], which provides a principled way to
normalize windowed counts of ChIP-Seq reads and test them for
differential binding. The original paper only analyzed binding within
pre-defined promoter regions. In addition, some improvements have also
been made to the RNA-seq analysis using newer features of [limma][4]
such as quality weights.

This workflow downloads the sequence data and sample metadata from the
public GEO/SRA release, so anyone can download and run this code to
reproduce the full analysis.

## Workflow

![Rule Graph](rulegraphs/rulegraph-all.png "Rule graph of currently implemented workflow")

### Completed components

* ChIP-seq
* Mapping with bowtie2
* Peak calling with MACS2 and Epic
* Fetching of [blacklists][5] from UCSC
* Generation of greylists from ChIP-Seq input samples
* IDR analysis of blacklist-filtered peak calls
* Computation of cross-correlation function for ChIP-Seq samples,
excluding blacklisted regions
* Counting in windows across the genome
* RNA-seq
* Mapping with STAR & HISAT2
* Counting reads aligned to genes
* Alignment-free bias-corrected transcript quantification using Salmon & Kallisto
* Differential gene expression

### Possible TODO components

* Integrating RNA-seq and ChIP-seq
* hiAnnotator: http://bioconductor.org/packages/devel/bioc/html/hiAnnotator.html
* ChIPseeker: http://bioconductor.org/packages/devel/bioc/html/ChIPseeker.html
* mogsa: http://bioconductor.org/packages/release/bioc/html/mogsa.html
* Gene set tests
* ToPASeq: http://bioconductor.org/packages/devel/bioc/html/ToPASeq.html
* mvGST: http://bioconductor.org/packages/devel/bioc/html/mvGST.html
* mgsa: http://bioconductor.org/packages/release/bioc/html/mgsa.html
* QC Stuff
* ChIPQC: http://bioconductor.org/packages/release/bioc/html/ChIPQC.html
* MultiQC: http://multiqc.info/
* Rqc: http://www.bioconductor.org/packages/devel/bioc/html/Rqc.html
* mixOmics: http://mixomics.org/
* ica: https://cran.rstudio.com/web/packages/ica/index.html
* Motif enrichment
* pcaExplorer: https://bioconductor.org/packages/release/bioc/html/pcaExplorer.html

## TODO Code cleanup

* Remove unnecessary library() calls
* Put spaces around equals signs

## TODO Other

* Document how to run the pipeline
* Provide install script for R & Python packages.

## Dependencies

### Command-line tools

* [ascp](http://downloads.asperasoft.com/en/downloads/50) Aspera
download client for downloading SRA files
* [Bedtools](http://bedtools.readthedocs.io/en/latest/)
* [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
aligner
* [Epic](https://github.com/endrebak/epic) peak caller
* [fastq-tools](http://homes.cs.washington.edu/~dcjones/fastq-tools/)
* [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) aligner
* [IDR python script](https://github.com/nboley/idr)
* [Kallisto](https://pachterlab.github.io/kallisto/about) RNA-seq
quantifier
* [MACS2](https://github.com/taoliu/MACS) peak caller
* [Picard tools](https://broadinstitute.github.io/picard/) for various
file manipulation utilities
* [Salmon](http://salmon.readthedocs.io/en/latest/) RNA-seq quantifier
(devel version 0.7.3)
* [Shoal](https://github.com/COMBINE-lab/shoal)
* [Snakemake](https://bitbucket.org/snakemake/snakemake/wiki/Home) for
running the workflow
* [SRA toolkit](https://github.com/ncbi/sra-tools) for extracting
reads from SRA files
* [STAR](https://github.com/alexdobin/STAR) aligner
* [UCSC command-line tools](http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads)
(e.g. liftOver)

### Programming languages and packages

* [R](https://www.r-project.org/),
[Bioconductor](http://bioconductor.org/), and the following R
packages:
* From [CRAN](http://cran.r-project.org/): assertthat, doParallel,
dplyr, future, getopt, GGally, ggforce, ggfortify, ggplot2, ks,
lazyeval, lubridate, magrittr, MASS, Matrix, openxlsx, optparse,
parallel, purrr, RColorBrewer, readr, reshape2, rex, scales,
stringi, stringr
* From [Bioconductor](http://bioconductor.org/): annotate,
Biobase, BiocParallel, BSgenome.Hsapiens.UCSC.hg19,
BSgenome.Hsapiens.UCSC.hg38, ChIPQC, csaw, edgeR,
GenomicFeatures, GenomicRanges, GEOquery, limma, org.Hs.eg.db,
Rsamtools, Rsubread, rtracklayer, S4Vectors, SRAdb,
SummarizedExperiment, TxDb.Hsapiens.UCSC.hg19.knownGene,
tximport
* Installed manually:
[sleuth](http://pachterlab.github.io/sleuth/about),
[wasabi](https://github.com/COMBINE-lab/wasabi)
* [Python 3](https://www.python.org/) and the following Python
packages: biopython, atomicwrites, numpy, pandas, plac, pysam, rpy2,
snakemake

[1]: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73214
[2]: http://www.ncbi.nlm.nih.gov/pubmed/27170561
[3]: https://bioconductor.org/packages/release/bioc/html/csaw.html
[4]: https://bioconductor.org/packages/release/bioc/html/limma.html
[5]: http://www.broadinstitute.org/~anshul/projects/encode/rawdata/blacklists/hg19-blacklist-README.pdf