https://github.com/neurogenomics/epicompare

Comparison, benchmarking & QC of epigenetic datasets
https://github.com/neurogenomics/epicompare

benchmark benchmarking bioconductor bioconductor-package comparison epigenetics genetics html interactive-reporting r-package

Last synced: 3 months ago
JSON representation

Comparison, benchmarking & QC of epigenetic datasets

Host: GitHub
URL: https://github.com/neurogenomics/epicompare
Owner: neurogenomics
Created: 2021-10-18T11:53:37.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2024-04-25T16:08:17.000Z (about 1 year ago)
Last Synced: 2024-05-09T13:38:23.107Z (about 1 year ago)
Topics: benchmark, benchmarking, bioconductor, bioconductor-package, comparison, epigenetics, genetics, html, interactive-reporting, r-package
Language: HTML
Homepage: https://doi.org/doi:10.18129/B9.bioc.EpiCompare
Size: 29.7 MB
Stars: 12
Watchers: 3
Forks: 3
Open Issues: 15
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md

Awesome Lists containing this project

README

---
title: "⚖EpiCompare⚖
QC and Benchmarking of Epigenomic Datasets"
author: "`r rworkflows::use_badges(add_doi = 'https://doi.org/10.1101/2022.07.22.501149',
add_bioc_release = TRUE,
add_bioc_download_month = TRUE,
add_bioc_download_total = TRUE,
add_bioc_download_rank = TRUE)`"
date: "

Updated: `r format(Sys.Date(), '%b-%d-%Y')`

"
output:
github_document
---

```{r, echo=FALSE, include=FALSE}
pkg <- read.dcf("DESCRIPTION", fields = "Package")[1]
title <- read.dcf("DESCRIPTION", fields = "Title")[1]
description <- read.dcf("DESCRIPTION", fields = "Description")[1]
URL <- read.dcf('DESCRIPTION', fields = 'URL')[1]
owner <- tolower(strsplit(URL,"/")[[1]][4])
```

# Introduction

`EpiCompare` is an R package for comparing multiple epigenomic datasets
for quality control and benchmarking purposes. The function outputs a
report in HTML format consisting of three sections:

1. **General Metrics**: Metrics on peaks (percentage of blacklisted and
non-standard peaks, and peak widths) and fragments (duplication
rate) of samples.
2. **Peak Overlap**: Frequency, percentage, statistical significance of
overlapping and non-overlapping peaks. This also includes Upset,
precision-recall and correlation plots.
3. **Functional Annotation**: Functional annotation (ChromHMM, ChIPseeker
and enrichment analysis) of peaks. Also includes peak enrichment
around Transcription Start Site.

*Note*: Peaks located in blacklisted regions and non-standard chromosomes are
removed from the files prior to analysis.

# Installation

## Standard

To install `EpiCompare` use:

```r
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("EpiCompare")
```
## All dependencies

👈 Details

Installing all *Imports* and *Suggests* will allow you to use the full functionality of `EpiCompare` right away, without having to stop and install extra dependencies later on.

To install these packages as well, use:

```R
BiocManager::install("EpiCompare", dependencies=TRUE)
```

Note that this will increase installation time,
but it means that you won't have to worry about installing any R packages
when using functions with certain suggested dependencies

## Development

👈 Details
To install the development version of `EpiCompare`, use:

```R
if (!require("remotes")) install.packages("remotes")
remotes::install_github("neurogenomics/EpiCompare")
```

## Citation

If you use ``r pkg``, please cite:

> `r citation(pkg)$textVersion`

# Documentation

## [EpiCompare website](https://neurogenomics.github.io/EpiCompare)
## [Docker/Singularity container](https://neurogenomics.github.io/EpiCompare/articles/docker)
## [Bioconductor page](https://doi.org/doi:10.18129/B9.bioc.EpiCompare)

### :warning: Note on documentation versioning

The documentation in this README and the [GitHub Pages website](https://neurogenomics.github.io/EpiCompare/)
pertains to the *development* version of `EpiCompare`.
Older versions of `EpiCompare` may have slightly different documentation
(e.g. available functions, parameters). For documentation in older versions of
`EpiCompare`, please see the **Documentation** section of the relevant
version on [Bioconductor](https://doi.org/doi:10.18129/B9.bioc.EpiCompare)

# Usage

Load package and example datasets.

```r
library(EpiCompare)
data("encode_H3K27ac") # example peakfile
data("CnT_H3K27ac") # example peakfile
data("CnR_H3K27ac") # example peakfile
data("CnT_H3K27ac_picard") # example Picard summary output
data("CnR_H3K27ac_picard") # example Picard summary output
```

Prepare input files:

```r
# create named list of peakfiles
peakfiles <- list("CnT"=CnT_H3K27ac,
"CnR"=CnR_H3K27ac)
# set ref file and name
reference <- list("ENCODE_H3K27ac" = encode_H3K27ac)
# create named list of Picard summary
picard_files <- list("CnT"=CnT_H3K27ac_picard,
"CnR"=CnR_H3K27ac_picard)
```

👈 Tips on importing user-supplied files

`EpiCompare::gather_files` is helpful for identifying and importing
peak or picard files.
```r
# To import BED files as GRanges object
peakfiles <- EpiCompare::gather_files(dir = "path/to/peaks/",
type = "peaks.stringent")
# EpiCompare alternatively accepts paths (to BED files) as input
peakfiles <- list(sample1="/path/to/peaks/file1_peaks.stringent.bed",
sample2="/path/to/peaks/file2_peaks.stringent.bed")
# To import Picard summary output txt file as data frame
picard_files <- EpiCompare::gather_files(dir = "path/to/peaks",
type = "picard")
```

Run `EpiCompare()`:

```r
EpiCompare::EpiCompare(peakfiles = peakfiles,
genome_build = list(peakfiles="hg19",
reference="hg38"),
genome_build_output = "hg19",
picard_files = picard_files,
reference = reference,
run_all = TRUE
output_dir = tempdir())
```

#### Required Inputs

These input parameters must be provided:

👈 Details

- `peakfiles` : Peakfiles you want to analyse. EpiCompare accepts
peakfiles as GRanges object and/or as paths to BED files. Files must
be listed and named using `list()`.
E.g. `list("name1"=peakfile1, "name2"=peakfile2)`.
- `genome_build` : A named list indicating the human genome build used to
generate each of the following inputs:
- `peakfiles` : Genome build for the `peakfiles` input. Assumes genome build
is the same for each element in the `peakfiles` list.
- `reference` : Genome build for the `reference` input.
- `blacklist` : Genome build for the `blacklist` input.

E.g. `genome_build = list(peakfiles="hg38", reference="hg19", blacklist="hg19")`
- `genome_build_output` Genome build to standardise all inputs to. Liftovers
will be performed automatically as needed. Default is "hg19".
- `blacklist` : Peakfile as GRanges object specifying genomic regions
that have anomalous and/or unstructured signals independent of the
cell-line or experiment. For human hg19 and hg38 genome, use
built-in data `data(hg19_blacklist)` and `data(hg38_blacklist)`
respectively. For mouse mm10 genome, use built-in data `data(mm10_blacklist)`.
- `output_dir` : Please specify the path to directory, where all
`EpiCompare` outputs will be saved.

#### Optional Inputs

The following input files are optional:

👈 Details

- `picard_files` : A list of summary metrics output from
[Picard](https://broadinstitute.github.io/picard/). *Picard MarkDuplicates*
can be used to identify the duplicate reads amongst the alignment. This tool
generates a summary output, normally with the ending
*.markdup.MarkDuplicates.metrics.txt*. If this input is provided, metrics on
fragments (e.g. mapped fragments and duplication rate) will be included
in the report. Files must be in data.frame format and listed using `list()`
and named using `names()`. To import Picard duplication metrics (.txt file)
into R as data frame, use
`picard <- read.table("/path/to/picard/output", header = TRUE, fill = TRUE)`.
- `reference` : Reference peak file(s) is used in `stat_plot` and
`chromHMM_plot`. File must be in `GRanges` object, listed and named
using `list("reference_name" = GRanges_obect)`. If more than one reference
is specified, `EpiCompare` outputs individual reports for each reference.
However, please note that this can take awhile.

#### Optional Plots

By default, these plots will not be included in the report unless set to `TRUE`.
To turn on all features at once, simply use the `run_all=TRUE` argument:

👈 Details

- `upset_plot` : Upset plot of overlapping peaks between samples.
- `stat_plot` : included only if a `reference` dataset is provided.
The plot shows statistical significance (p/q-values) of sample peaks
that are overlapping/non-overlapping with the `reference` dataset.
- `chromHMM_plot` : ChromHMM annotation of peaks. If a `reference`
dataset is provided, ChromHMM annotation of overlapping and
non-overlapping peaks with the `reference` is also included in the
report.
- `chipseeker_plot` : ChIPseeker annotation of peaks.
- `enrichment_plot` : KEGG pathway and GO enrichment analysis of
peaks.
- `tss_plot` : Peak frequency around (+/- 3000bp) transcriptional
start site. Note that it may take awhile to generate this plot for
large sample sizes.
- `precision_recall_plot` : Plot showing the precision-recall score across
the peak calling stringency thresholds.
- `corr_plot` : Plot showing the correlation between the quantiles when the
genome is binned at a set size. These quantiles are based on the intensity
of the peak, dependent on the peak caller used (q-value for MACS2).

#### Other Options

👈 Details

- `chromHMM_annotation` : Cell-line annotation for ChromHMM. Default
is K562. Options are:
- "K562" = K-562 cells
- "Gm12878" = Cellosaurus cell-line GM12878
- "H1hesc" = H1 Human Embryonic Stem Cell
- "Hepg2" = Hep G2 cell
- "Hmec" = Human Mammary Epithelial Cell
- "Hsmm" = Human Skeletal Muscle Myoblasts
- "Huvec" = Human Umbilical Vein Endothelial Cells
- "Nhek" = Normal Human Epidermal Keratinocytes
- "Nhlf" = Normal Human Lung Fibroblasts
- `interact` : By default, all heatmaps (percentage overlap and
ChromHMM heatmaps) in the report will be interactive. If set FALSE,
all heatmaps will be static. N.B. If `interact=TRUE`, interactive
heatmaps will be saved as html files, which may take time for larger
sample sizes.
- `output_filename` : By default, the report is named *EpiCompare.html*.
You can specify the file name of the report here.
- `output_timestamp` : By default FALSE. If TRUE, the filename of the
report includes the date.

#### Outputs

`EpiCompare` outputs the following:

1. **HTML report**: A summary of all analyses saved in specified
`output_dir`
2. **EpiCompare_file**: if `save_output=TRUE`, all plots generated by
`EpiCompare` will be saved in *EpiCompare_file* directory also in
specified `output_dir`

An example report comparing ATAC-seq and DNase-seq can be found
[here](https://neurogenomics.github.io/EpiCompare/articles/example_report)

## Datasets

`EpiCompare` includes several built-in datasets:

👈 Details

- `encode_H3K27ac`: Human H3K27ac peak file generated with ChIP-seq using K562
cell-line. Taken from [ENCODE](https://www.encodeproject.org/files/ENCFF044JNJ/)
project. For more information, run `?encode_H3K27ac`.
- `CnT_H3K27ac`: Human H3K27ac peak file generated with CUT&Tag using K562
cell-line from [Kaya-Okur et al., (2019)](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8383507). For more
information, run `?CnT_H3K27ac`.
- `CnR_H3K27ac`: Human H3K27ac peak file generated with CUT&Run using K562
cell-line from [Meers et al., (2019)](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8581604).
For more details, run `?CnR_H3K27ac`.

## Contact

### [Neurogenomics Lab](https://www.neurogenomics.co.uk/inst/report/EpiCompare.html)

UK Dementia Research Institute
Department of Brain Sciences
Faculty of Medicine
Imperial College London
[GitHub](https://github.com/neurogenomics)
[DockerHub](https://hub.docker.com/orgs/neurogenomicslab)

## Session Info

👈 Details

```{r Session Info}
utils::sessionInfo()
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/neurogenomics/epicompare

Awesome Lists containing this project

README

Updated: `r format(Sys.Date(), '%b-%d-%Y')`