Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/darwinawardwinner/cd4-csaw
Reproducible reanalysis of a combined ChIP-Seq & RNA-Seq data set
https://github.com/darwinawardwinner/cd4-csaw
bioconductor bioinformatics-pipeline chip-seq r reproducible-research rna-seq
Last synced: 19 days ago
JSON representation
Reproducible reanalysis of a combined ChIP-Seq & RNA-Seq data set
- Host: GitHub
- URL: https://github.com/darwinawardwinner/cd4-csaw
- Owner: DarwinAwardWinner
- Created: 2016-05-03T21:32:00.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2019-08-09T16:53:57.000Z (about 5 years ago)
- Last Synced: 2024-10-03T10:19:39.220Z (about 1 month ago)
- Topics: bioconductor, bioinformatics-pipeline, chip-seq, r, reproducible-research, rna-seq
- Language: R
- Homepage:
- Size: 63.2 MB
- Stars: 16
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.mkdn
Awesome Lists containing this project
README
# Re-analysis of a combined ChIP-Seq & RNA-Seq data set
This is the code for a re-analysis of a [GEO dataset][1] that I
originally analyzed for [this paper][2] using statistical methods that
were not yet available at the time, such as the
[csaw Bioconductor package][3], which provides a principled way to
normalize windowed counts of ChIP-Seq reads and test them for
differential binding. The original paper only analyzed binding within
pre-defined promoter regions. In addition, some improvements have also
been made to the RNA-seq analysis using newer features of [limma][4]
such as quality weights.This workflow downloads the sequence data and sample metadata from the
public GEO/SRA release, so anyone can download and run this code to
reproduce the full analysis.## Workflow
![Rule Graph](rulegraphs/rulegraph-all.png "Rule graph of currently implemented workflow")
### Completed components
* ChIP-seq
* Mapping with bowtie2
* Peak calling with MACS2 and Epic
* Fetching of [blacklists][5] from UCSC
* Generation of greylists from ChIP-Seq input samples
* IDR analysis of blacklist-filtered peak calls
* Computation of cross-correlation function for ChIP-Seq samples,
excluding blacklisted regions
* Counting in windows across the genome
* RNA-seq
* Mapping with STAR & HISAT2
* Counting reads aligned to genes
* Alignment-free bias-corrected transcript quantification using Salmon & Kallisto
* Differential gene expression### Possible TODO components
* Integrating RNA-seq and ChIP-seq
* hiAnnotator: http://bioconductor.org/packages/devel/bioc/html/hiAnnotator.html
* ChIPseeker: http://bioconductor.org/packages/devel/bioc/html/ChIPseeker.html
* mogsa: http://bioconductor.org/packages/release/bioc/html/mogsa.html
* Gene set tests
* ToPASeq: http://bioconductor.org/packages/devel/bioc/html/ToPASeq.html
* mvGST: http://bioconductor.org/packages/devel/bioc/html/mvGST.html
* mgsa: http://bioconductor.org/packages/release/bioc/html/mgsa.html
* QC Stuff
* ChIPQC: http://bioconductor.org/packages/release/bioc/html/ChIPQC.html
* MultiQC: http://multiqc.info/
* Rqc: http://www.bioconductor.org/packages/devel/bioc/html/Rqc.html
* mixOmics: http://mixomics.org/
* ica: https://cran.rstudio.com/web/packages/ica/index.html
* Motif enrichment
* pcaExplorer: https://bioconductor.org/packages/release/bioc/html/pcaExplorer.html## TODO Code cleanup
* Remove unnecessary library() calls
* Put spaces around equals signs## TODO Other
* Document how to run the pipeline
* Provide install script for R & Python packages.## Dependencies
### Command-line tools
* [ascp](http://downloads.asperasoft.com/en/downloads/50) Aspera
download client for downloading SRA files
* [Bedtools](http://bedtools.readthedocs.io/en/latest/)
* [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
aligner
* [Epic](https://github.com/endrebak/epic) peak caller
* [fastq-tools](http://homes.cs.washington.edu/~dcjones/fastq-tools/)
* [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) aligner
* [IDR python script](https://github.com/nboley/idr)
* [Kallisto](https://pachterlab.github.io/kallisto/about) RNA-seq
quantifier
* [MACS2](https://github.com/taoliu/MACS) peak caller
* [Picard tools](https://broadinstitute.github.io/picard/) for various
file manipulation utilities
* [Salmon](http://salmon.readthedocs.io/en/latest/) RNA-seq quantifier
(devel version 0.7.3)
* [Shoal](https://github.com/COMBINE-lab/shoal)
* [Snakemake](https://bitbucket.org/snakemake/snakemake/wiki/Home) for
running the workflow
* [SRA toolkit](https://github.com/ncbi/sra-tools) for extracting
reads from SRA files
* [STAR](https://github.com/alexdobin/STAR) aligner
* [UCSC command-line tools](http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads)
(e.g. liftOver)### Programming languages and packages
* [R](https://www.r-project.org/),
[Bioconductor](http://bioconductor.org/), and the following R
packages:
* From [CRAN](http://cran.r-project.org/): assertthat, doParallel,
dplyr, future, getopt, GGally, ggforce, ggfortify, ggplot2, ks,
lazyeval, lubridate, magrittr, MASS, Matrix, openxlsx, optparse,
parallel, purrr, RColorBrewer, readr, reshape2, rex, scales,
stringi, stringr
* From [Bioconductor](http://bioconductor.org/): annotate,
Biobase, BiocParallel, BSgenome.Hsapiens.UCSC.hg19,
BSgenome.Hsapiens.UCSC.hg38, ChIPQC, csaw, edgeR,
GenomicFeatures, GenomicRanges, GEOquery, limma, org.Hs.eg.db,
Rsamtools, Rsubread, rtracklayer, S4Vectors, SRAdb,
SummarizedExperiment, TxDb.Hsapiens.UCSC.hg19.knownGene,
tximport
* Installed manually:
[sleuth](http://pachterlab.github.io/sleuth/about),
[wasabi](https://github.com/COMBINE-lab/wasabi)
* [Python 3](https://www.python.org/) and the following Python
packages: biopython, atomicwrites, numpy, pandas, plac, pysam, rpy2,
snakemake[1]: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73214
[2]: http://www.ncbi.nlm.nih.gov/pubmed/27170561
[3]: https://bioconductor.org/packages/release/bioc/html/csaw.html
[4]: https://bioconductor.org/packages/release/bioc/html/limma.html
[5]: http://www.broadinstitute.org/~anshul/projects/encode/rawdata/blacklists/hg19-blacklist-README.pdf