{"id":13763665,"url":"https://github.com/ay-lab/ATACProc","last_synced_at":"2025-05-10T17:30:51.719Z","repository":{"id":112915460,"uuid":"146371093","full_name":"ay-lab/ATACProc","owner":"ay-lab","description":"ATAC-seq processing pipeline","archived":false,"fork":false,"pushed_at":"2022-04-08T18:49:52.000Z","size":141,"stargazers_count":27,"open_issues_count":2,"forks_count":13,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-02-12T23:45:45.269Z","etag":null,"topics":["atac-seq","atacseqqc","macs2","qc"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ay-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-08-28T00:47:21.000Z","updated_at":"2024-01-25T16:40:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"7c75bf2c-7672-48ff-94c5-fdd446901f61","html_url":"https://github.com/ay-lab/ATACProc","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ay-lab%2FATACProc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ay-lab%2FATACProc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ay-lab%2FATACProc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ay-lab%2FATACProc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ay-lab","download_url":"https://codeload.github.com/ay-lab/ATACProc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253453198,"owners_count":21911055,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["atac-seq","atacseqqc","macs2","qc"],"created_at":"2024-08-03T15:00:55.068Z","updated_at":"2025-05-10T17:30:51.235Z","avatar_url":"https://github.com/ay-lab.png","language":"Shell","funding_links":[],"categories":["Raw data processing pipelines"],"sub_categories":[],"readme":"# ATACProc - a pipeline for processing ATAC-seq data\n\nDevloper: Sourya Bhattacharyya\n\nSupervisors: Dr. Ferhat Ay and Dr. Pandurangan Vijayanand\n\nLa Jolla Institute for Immunology, CA 92037, USA\n\n\n#######################\n\nATACProc is a pipeline to analyze ATAC-seq data. Currently datasets involving one of the four reference genomes, namely hg19, hg38, mm9 and mm10 are supported. Important features of this pipeline are:\n\n1) Supports single or paired-end fastq or BAM formatted data.\n\n2) Generates alignment summary and QC statistics.\n\n3) Peak calls using MACS2, for multiple FDR thresholds (0.01 and 0.05)\n\n4) Generating raw and coverage normalized BigWig tracks for visualizing the data in UCSC genome browser.\n\n5) Irreproducible Discovery Rate (IDR) analysis (https://github.com/nboley/idr) between a set of peak calls or even a set of input alignment (BAM) files (in which case, peaks are estimated first) corresponding to a set of biological or technical ATAC-seq replicates. \n\n6) **New in version 2.0:** Support discarding reads falling in blacklisted genomic regions\n\n7) **New in version 2.0:** Support extracting nucleosome free reads (NFR), one or more nucleosome containing regions (denoted as +1M), for TF footprinting analysis.\n\n8) **New in version 2.0:** Compatibility to the package ATAQV (https://github.com/ParkerLab/ataqv) for generating summary statistics across a set of samples.\n\n#######################\n\nRelease notes\n-----------------\n\n**Version 2.2 - April 2022**\n\nAdded -F option - corresponds to using different types of reads for footprinting. \n\nDefault = 1, means footprinting with nucleosome free reads (NFR) will be done.\n\nBest for standard ATAC-seq protocols (Li et al. Genome Biology, 2019)\n\nIf -F option is 2, footprinting with nucleosome reads will also be separately computed in addition to the NFR based footprints (two different footprinting outputs).\n\nIf -F option is 3, footprinting with all the reads will also be separately computed in addition to the NFR based and nucleosome read based footprints (three different footprinting outputs).\n\n**Version 2.1 - July 2020**\n\nMinor change of picard duplicate removal syntax, according to the picard tool version 2.8.14 \nWe recommend using this (or later) versions\n\n**Version 2.0 - November 2019**\n\n1) Included TF footprinting, optional discarding of blacklisted genomic regions, motif analysis\n\n2) Updated summary statistics incorporating support for ATAQV package (https://github.com/ParkerLab/ataqv)\n\n3) Discarded R package ATACseqQC (https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html) and corresponding operations, mainly due to its time complexity and reliability issues.\n\n\n*Version 1.0 - July 2018:*\n\n1) Released first version of ATAC-seq pipeline, supporting generation of QC metrics, peak calls, signal tracks for visualizing in UCSC genome browser. \n\n2) Also supports IDR between a set of peaks / alignments for a set of replicates.\n\n\nTheory\n----------\n\nPapers / links for understanding ATAC-seq QCs:\n\n1) https://github.com/crazyhottommy/ChIP-seq-analysis  (very useful; contains many papers \nand links for understanding ChIP-seq and ATAC-seq data)\n\n2) https://www.encodeproject.org/data-standards/terms/#library\n\n3) https://www.biostars.org/p/187204/\n\n4) http://seqanswers.com/forums/archive/index.php/t-59219.html\n\n5) https://github.com/kundajelab/atac_dnase_pipelines\n\n6) https://github.com/ParkerLab/bioinf525#sifting\n\n7) https://github.com/taoliu/MACS/issues/145\n\n8) https://www.biostars.org/p/207318/\n\n9) https://www.biostars.org/p/209592/\n\n10) https://www.biostars.org/p/205576/\n\n\nUnderstanding peak calling\n\n1) https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-9-r137\n\nUnderstanding TF footprinting\n\n1) https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1642-2\n\nUnderstanding IDR analysis\n\n1) https://github.com/nboley/idr\n\n\n\nInstallation\n-------------\n\nFollowing packages / libraries should be installed before running this pipeline:\n\n1) Python 2.7 \n\n2) R environment (we have used 3.4.3)\n\n\tUser should also install the following R packages, by running the following command inside R prompt:\n\n\tinstall.packages(c(“optparse”, “ggplot2”, “data.table”, “plotly”))\n\n\tAlso user needs to install the bioconductor package GenomicRanges \u003chttps://bioconductor.org/packages/release/bioc/html/GenomicRanges.html\u003e\n\n3) Bowtie2 (we have used version 2.3.3.1) \u003chttp://bowtie-bio.sourceforge.net/bowtie2/index.shtml\u003e\n\n4) samtools (we have used version 1.6) \u003chttp://samtools.sourceforge.net/\u003e\n\n5) PICARD tools (we have used 2.8.14 version now; previously we were using version 2.7.1) \u003chttps://broadinstitute.github.io/picard/\u003e\n\n6) Utilities \"bedGraphToBigWig\", \"bedSort\", \"bigBedToBed\", \"hubCheck\" and \"fetchChromSizes\" - to be downloaded from UCSC repository \u003chttp://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/\u003e\n\n7) deepTools (we have used version 2.0) \u003chttps://deeptools.readthedocs.io/en/develop/\u003e\n\n8) MACS2 (we have used version 2.1.1) https://github.com/taoliu/MACS\n\n9) HOMER (we recommend using the latest version) http://homer.ucsd.edu/homer/\n\n10) The package *ataqv* (https://github.com/ParkerLab/ataqv). User needs to download the GitHub release (.tar.gz) file in a convenient location, extract it, and provide corresponding path in a configuration file (mentioned below).\n\n11) Regulatory genomics toolbox (https://www.regulatory-genomics.org/) \n\n\tFirst user needs to install the module *RGT* using the following commands:\n\n\t\tpip install --user cython numpy scipy\n\t\tpip install --user RGT\n\n\tA folder *rgtdata* would be created inside the home directory. Next step is to configure that folder by typing the following commands:\n\n\t\tcd ~/rgtdata\n\t\tpython setupGenomicData.py --hg19\n\t\tpython setupGenomicData.py --hg38\n\t\tpython setupGenomicData.py --mm9\n\t\tpython setupGenomicData.py --mm10\n\n\t\t(Note: it is better to run the last four commands together in a qsub / cluster environment, otherwise it'll be time consuming).\n\n\n\tThen, user needs to set up the motif configuration data, via executing the following commands (preferable to run in qsub / cluster environment)\n\n\t\tcd ~/rgtdata\n\t\tpython setupLogoData.py --all\n\n\n**User should include the PATH of above mentioned libraries / packages inside their SYSTEM PATH variable. Alternatively, installation PATHS for some of these packages are to be mentioned in a separate configuration file (described below)**\n\n**Following packages / libraries are to be installed for executing IDR code**\n\n9) sambamba (we have used version 0.6.7) \u003chttp://lomereiter.github.io/sambamba/\u003e\n\n10) IDRCode (https://drive.google.com/file/d/0B_ssVVyXv8ZSX3luT0xhV3ZQNWc/view?usp=sharing). User should unzip the archieve and store in convenient location. Path of this archieve is to be provided for executing IDR code.\n\n\n\nExecution\n----------\n\nUser should first clone this pipeline in a convenient location, using the following command: \n\ngit clone https://github.com/ay-lab/ATACProc.git\n\nA sample script \"pipeline_exec.sh\" contains basic execution commands, to invoke the main executable \"pipeline.sh\" (located inside the folder \"bin\"). The executable has the following command line options:\n\nOptions:\n\nMandatory parameters:\n\n\t-C  ConfigFile\t\t    \n         \tConfiguration file to be separately provided. Mandatory parameter. Current package includes four sample configuration files named \"configfile_*\" corresponding to the reference genomes hg19, hg38, mm9 and mm10. Detailed description of the entries in this configuration file are mentioned later.\n\t              \n\t-f  FASTQ1          \n        \tRead 1 (or forward strand) of paired-end sequencing data  [.fq|.gz|.bz2]. Or, even an aligned genome (.bam file; single or paired end alignment) can be provided.\n\t        \n\t-r  FASTQ2          \n            R2 of pair-end sequencing data [.fq|.gz|.bz2]. If not provided, and the -f parameter is not a BAM file, the input is assumed to be single ended.\n\n\t-n  PREFIX           \n            Prefix string of output files. For example, -n \"TEST\" means that the output filenames start with the string \"TEST\". Generally, sample names with run ID, lane information, etc. can be used as a prefix string.\n\n\t-g  BOWTIE2_GENOME   \n            Bowtie2 indexed reference genome. Basically, the folder containing bwt2 indices (corresponding to the reference genome) are to be provided. Mandatory parameter if the user provides fastq files as input (-f and -r options). If user provides .bam files as an input (-f option) then this field is optional.\n\n\t-d  OutDir \t\t\t  \n            Output directory to store the results for the current sample.\n\n\t-c  CONTROLBAM\t\t \n         \tControl file(s) used for peak calling using MACS2. One or more alignment files can be provided to be used as a control. It may not be specified at all, in which case MACS2 operates without any control. Control file can be either in *BAM* or in *tagalign.gz* format (the standalone script *bin/TagAlign.sh* in this repository converts BAM file to tagalign.gz format). For multiple control files, they all are required to be of the same format (i.e. either all BAM or all tagalign.gz). Example: -c control1.bam -c control2.bam puts two control files for using in MACS2.\n\t\t\n\t-w BigWigGenome\t \n\t\t\tReference genome as a string. Allowed values are hg19 (default), hg38, mm9 and mm10. If -g option is enabled (i.e. the Bowtie2 index genome is provided), this field is optional. Otherwise, mandatory parameter.\t\t\t\t\n\t\t\n\t-D  DEBUG_TXT\t\t \n\t\t\tBinary variable. If 1 (recommended), dumps QC statistics. For a set of samples, those QC statistics can be used later to profile QC variation among different samples.\t\t\t\t\n\t\t\n\t-O \tOverwrite\t\t \n\t\t\tBinary variable. If 1, overwrites the existing files (if any). Default = 0.\n\n\t-F \tFootprint \t \t\n\t\t\tThis flag specifies the footprinting option. Value can be 1 (default), 2, or 3\n\t\t\t1: footoprint using the nucleosome free reads (NFR) will be computed. \n\t\t\t   Default setting. Best for default ATAC-seq protocol (check Li et. al. Genome Biology 2019)\n\t\t\t2: footoprint using the nucleosome free reads (NFR) and also the nucleosome containing reads (NFR + 1N + 2N + 3N ...) \n\t\t\t   will be computed (two different footprint outputs - time consuming). \n\t\t\t   Best for Omni-ATAC protocol (check Li et. al. Genome Biology 2019)\n \t\t\t3: footoprint using NFR, NFR with nucleosome reads, and all reads will be computed \n\t\t\t   (three different footprint outputs - highly time consuming).\t\n\t\t\t   \nOptional parameters:\n\t-q  MAPQ_THR\t\t \n\t\t\tMapping quality threshold for bowtie2 alignment. Aligned reads with quality below this threshold are discarded. Default = 30. \n\t\t \n\t-t  NUMTHREADS              \n\t\t\tNumber of sorting, Bowtie2 mapping THREADS [Default = 1]. If multiprocessing core is available, user should specify values \u003e 1 such as 4 or 8, for faster execution of Bowtie2.\n\t\t\n\t-m  MAX_MEM          \n\t\t\tSet max memory used for PICARD duplication removal [Default = 8G].\n\t\t\n\t-a  ALIGNVALIDMAX\t \n\t\t\tSet the number of (max) valid alignments which will be searched [Default = 4] for Bowtie2.\n\t\t\n\t-l  MAXFRAGLEN \t\t \n\t\t\tSet the maximum fragment length to be used for Bowtie2 alignment [Default = 2000]\n\t\t\t\n\nEntries in the configuration file (first parameter)\n---------------------------------------------------\n\nThe configuration file follows the format parameter=value\n\nAnd is to be filled with the following entries:\n\n\tpicardexec=\n\t\tPath of Picard tool executable\n\t\tExample: /home/sourya/packages/picard-tools/picard-tools-2.7.1/picard.jar\n\n\tHOMERPath=\n\t\tPath of HOMER (after installation)\n\t\tExample: /home/sourya/packages/HOMER/bin/\n\n\tDeepToolsDir=\n\t\tPath of deepTools executable\n\t\tExample: /home/sourya/packages/deepTools/deepTools2.0/bin/\t\n\n\tNarrowPeakASFile=\n\t\tfile (SQL) required to convert the narrowPeak file to the bigBed format\n\t\tDownload the file from this link (and save):\n\t\thttps://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/lib/encode/narrowPeak.as\n\t\tSpecify the location of this downloaded file:\n\t\tExample: /home/sourya/genomes/chrsize/narrowPeak.as\n\n\tBigNarrowPeakASFile=\n\t\tfile (SQL) required to convert the bignarrowPeak file to the bigBed format\n\t\tDownload the file from this link (and save):\n\t\thttps://genome.ucsc.edu/goldenPath/help/examples/bigNarrowPeak.as\n\t\tSpecify the location of this downloaded file:\n\t\tExample: /home/sourya/genomes/chrsize/bigNarrowPeak.as\n\t\t\n\tBroadPeakASFile=\n\t\tfile (SQL) required to convert the broadPeak file to the bigBed format\n\t\tDownload the file from this link (and save):\n\t\thttps://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/lib/encode/broadPeak.as\n\t\tSpecify the location of this downloaded file:\n\t\tExample: /home/sourya/genomes/chrsize/broadPeak.as\n\t\t\n\tRefChrSizeFile=\n\t\tfiles containing chromosome size information\n\t\ttwo column file storing the size of individual chromosomes\n\t\tDownloaded from the link (depends on the reference Chromosome employed):\n\t\tFor example, the hg38.chrom.sizes file for the hg38 database is located at \n\t\thttp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes.\n\t\tAlternatively, Use the \"fetchChromSizes\" script from the UCSC repository \n\t\tto get the appropriate chromosome size file.\n\t\tSpecify the location of this downloaded file:\n\t\tExample: /home/sourya/genomes/chrsize/hg38.chrom.sizes\n\t\t\n\tRefChrFastaFile=\n\t\tFasta file of the reference Chromosome. \n\t\tCan be downloaded from the link: http://hgdownload.cse.ucsc.edu/downloads.html\n\t\tExample: /home/sourya/genomes/Complete_Genome/hg38/hg38.fa\n\t\t\n\tRefChrAnnotFile=\n\t\tfile containing reference genome specific annotation (.gtf format). \n\t\tTo be downloaded from the following links:\n\t\thg38: ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/\n\t\thg19: ftp://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/\n\t\tmm9: ftp://ftp.ensembl.org/pub/release-67/gtf/mus_musculus/\n\t\tmm10: ftp://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/\n\t\tExample: /home/sourya/genomes/Annotation/hg38/hg38.gtf\n\n\tBlackListFile=\n\t\tfile containing blacklisted regions corresponding to the reference genome. \n\t\tTo be downloaded from the link: https://github.com/Boyle-Lab/Blacklist/tree/master/lists (v2)\n\t\tFile can be gzipped or normal text format.\n\t\t*Note: This parameter is optional.*\n\t\tExample: /home/sourya/genomes/BlackListed_Regions/hg38-blacklist.v2.bed\n\n\tATAQVPath=\n\t\tPath of ataqv package (https://github.com/ParkerLab/ataqv) executable. \n\t\tUser needs to download the GitHub release (.tar.gz) file, extract it, and provide the ataqv executable path here.\n\t\tExample: /home/sourya/packages/ataqv/ataqv-1.0.0/bin/ataqv\n\n\tTSSFile=\n\t\tFile containing TSS information for the reference genome. Obtained using the gene annotation (GTF) file.\n\t\tExample: /home/sourya/genomes/Annotation/hg38/hg38_TSS.gtf\n\n\t\n\tThe last parameter, *TSSFile*, needs a special mention. User can apply the following awk script to the reference genome annotation file (indicated in the parameter *RefChrAnnotFile*) to produce a file with TSS information.\n\n\tAssuming user has downloaded the reference genome specific gene annotation file using one of the ftp links provided above, when the reference genome is either hg19, hg38 or mm10, user can apply the following awk script to obtain a TSS file (input_TSS.gtf) from the gene annotation file (input.gtf) (Note: it is always best to check the .gtf file format) :\n\n\t\tawk -F'[\\t]' '{if ((substr($1,1,1)!=\"#\") \u0026\u0026 ($3==\"transcript\")) {if ($7==\"+\") {print \"chr\"$1\"\\t\"$4\"\\t\"$4\"\\t\"$3\"\\t\"$4\"\\t\"$5\"\\t\"$7\"\\t\"$9} else {print \"chr\"$1\"\\t\"$5\"\\t\"$5\"\\t\"$3\"\\t\"$4\"\\t\"$5\"\\t\"$7\"\\t\"$9}}}' input.gtf \u003e input_TSS.gtf\n\n\tWhen the reference genome is mm9, user can apply the following script (it is best to check the .gtf file format):\n\n\t\tawk -F'[\\t]' '{if ((substr($1,1,1)!=\"#\") \u0026\u0026 ($3==\"exon\")) {if ($7==\"+\") {print \"chr\"$1\"\\t\"$4\"\\t\"$4\"\\t\"$3\"\\t\"$4\"\\t\"$5\"\\t\"$7\"\\t\"$9} else {print \"chr\"$1\"\\t\"$5\"\\t\"$5\"\\t\"$3\"\\t\"$4\"\\t\"$5\"\\t\"$7\"\\t\"$9}}}' mm9.gtf \u003e mm9_TSS.gtf\n\nDescribing output of ATAC-seq pipeline\n-----------------------------------------\n\nWithin the folder *OutDir* (specified by the configuration option -d) following files (f) and folders (F) exist:\n\n\tF1: Alignment_MAPQ${MAPQ_THR}\n\n\t\tf1-1: Bowtie2_Init_Align.sam\n\t\t\tInitial alignment by Bowtie2 (if fastq files are provided as the input.)\n\t\tf1-2: UniqMappedRead.bam\n\t\t\tUniquely mapped reads.\n\t\tf1-3: Bowtie2_del_Random.bam\n\t\t\tAlignment after excluding reads from chromosomes other than autosomal chromosomes, chrX, and chrM.\n\t\tf1-4: Bowtie2_del_Mitch.bam: \n\t\t\tAlignment after excluding reads from chrM.\n\t\tf1-5: ${PREFIX}.align.sort.MAPQ${MAPQ_THR}.bam\n\t\t\tSorted, and MAPQ thresholded alignment.\n\t\tf1-6: ${PREFIX}.align.sort.MAPQ${MAPQ_THR}.rmdup.bam\n\t\t\tDe-duplicated alignment (used for subsequent operations)\n\t\tf1-7: ${PREFIX}.align.sort.MAPQ${MAPQ_THR}.picard_metrics.txt\n\t\t\tPICARD metrics log file corresponding to the duplicate removal operation.\n\t\tf1-8: ${PREFIX}.align.sort.MAPQ${MAPQ_THR}_TN5_Shift.bam\n\t\t\t**New in version 2.0:** De-duplicated reads with shifted forward (+4bp) and reverse strands (-5bp) by Tn5 transposase. Used to extract the nucleosome free and nucleosome containing regions.\n\t\tf1-9: ${PREFIX}.align.sort.MAPQ${MAPQ_THR}_TN5_Shift.bed\n\t\t\t**New in version 2.0:** Bed converted f7, used for MACS2 peak calling.\n\t\tf1-10: NucleosomeFree.bam\n\t\t\t**New in version 2.0:** Alignment with nucleosome free regions (NFR)\n\t\tf1-11: mononucleosome.bam\n\t\t\t**New in version 2.0:** Alignment with mononucleosome fragments\n\t\tf1-12: dinucleosome.bam\n\t\t\t**New in version 2.0:** Alignment with dinucleosome fragments\n\t\tf1-13: trinucleosome.bam\n\t\t\t**New in version 2.0:** Alignment with trinucleosome fragments\n\t\tf1-14: Merged_nucleosome.bam\n\t\t\t**New in version 2.0:** File containing fragments of nucleosome free and one or more nucleosomes (denoted as NFR +1M, in the HINT-ATAC genome biology paper). Generated by merging files f1-10 to f1-13.\n\n\tF2: Out_BigWig\n\t\tf2-1: ${PREFIX}.bw \n\t\t\tbigwig file for track visualization.\n\n\tF3: Out_BigWig_NormCov:\n\t\tf3-1: ${PREFIX}_NormCov.bw\n\t\t\tbigwig file for track visualization (after normalizing the coverage). Recommended to use this file for visualizing tracks in UCSC genome browser.\n\n\tF4: MACS2_Ext_*\n\t\tContains peaks employing MACS2 with the parameters:\n\t\t\t--nomodel --nolambda --shift -100 --extsize 200 --keep-dup all --call-summits\n\t\t*Note: this parameter is recommended for ATAC-seq, as mostly followed in existin studies.*\n\n\t\tIf the folder name is \"*_No_Control\", no control BAM file was used to infer the peaks. Otherwise, if the folder name is \"*_With_Control\", one or more control alignment files were used for inferring the peaks.\n\n\t\t\tf4-1: *.narrowPeak: narrow peaks with p-value threshold of 0.01\n\t\t\tf4-2: *.narrowPeak_Q0.05filt: narrow peaks with FDR (q-value) threshold = 0.05\n\t\t\tf4-3: *.narrowPeak_Q0.01filt: narrow peaks with FDR threshold = 0.01\n\t\t\tf4-4: *.broadPeak: broad peaks with p-value threshold of 0.01\n\t\t\tf4-5: *.broadPeak_Q0.05filt: broad peaks with FDR threshold = 0.05\n\t\t\tf4-6: *.broadPeak_Q0.01filt: broad peaks with FDR threshold = 0.01\n\t\t\tf4-7: out_FRiP.txt: FRiP (fraction of reads in peaks) statistics for the narrow and broad peaks.\n\t\t\tf4-8: Peak_Statistics.txt: number of peaks in different settings.\n\t\t\tF4-9: Peak_Annotate_Q*:\n\t\t\t\tHOMER based annotations corresponding to the narrow peaks inferred by the corresponding FDR threshold (0.01 or 0.05). Contains the following files:\n\t\t\t\tf4-9-1: Out_Summary.log: summary text file containing HOMER annotation.\n\t\t\t\tf4-9-2: Annotated_Peak_Q*filt.txt: Detailed HOMER annotation of the corresponding peaks.\n\t\t\t\tf4-9-3: Pie_Chart_Peak_Annotation.pdf: pie chart of peaks containing different annotations.\n\t\t\t\tf4-9-4: Peak_TSS_Distance.pdf: Histogram of distance between peaks and closest TSS\n\t\t\tf4-10: Files of *.bb extension are big-bed formatted peaks, used to visualize those peaks in UCSC tracks.\n\n\tF5: MACS2_Default_*\n\t\tContains peaks employing default MACS2 parameters. (generally not used for ATAC-seq processing, but we've kept it for comparison).\n\t\tFile and folder structure is similar as F4.\n\n\tf8: out_NRF_MAPQ${MAPQ_THR}.txt\n\t\tMetric NRF\n\t\t\n\tf9: Read_Count_Stat.txt\n\t\tRead count statistics.\n\n\tF10: QC_ataqv_ParkerLab_Test\n\t\t**New in version 2.0:** Folder containing the summary .json files generated by the package ATAQV, which for diferent samples, can be combined to put a summary statistic and displayed in a Web browser.\n\n\tF11: TSS_Enrichment_Peaks\n\t\t**New in version 2.0:** Processes the narrow peaks from the folder F4, and computes the TSS enrichment of these peaks. The underlying file structure is:\n\n\t\tMACS2_Ext_*${CONTROLSTR}/macs2_narrowPeak_Q${FDRTHR}filt_Offset_${OFFSETVAL}/${PEAKTYPE}/*.pdf\n\n\t\twhere, \n\t\t\t${CONTROLSTR}: \"*_No_Control\" or \"*_With_Control\", depending on the use of control BAM file in inferring the peaks.\n\t\t\t${FDRTHR}: FDR threshold. Can be either 0.01 or 0.05\n\t\t\t${OFFSETVAL}: can be either 1000 (1 Kb) or 5000 (5Kb) (1 Kb or 5 Kb regions surrounding TSS are checked for computing TSS enrichment).\n\t\t\t${PEAKTYPE}: can be either \"Complete_Peaks\" (means complete set of peaks are experimented), \"Promoter_Peaks\" (means peaks located within 5 Kb of a TSS site are only considered), or \"Enhancer_Peaks\" (peaks excluding the promoter peaks).\n\n\n\tF12: Motif_MACS2_Ext_*${CONTROLSTR}_narrowPeak_Q${FDRTHR}filt\n\t\t**New in version 2.0:** TF footorinting analysis corresponding to the ChIP-seq peaks stored in F4. Here, ${CONTROLSTR} is either \"*_No_Control\" or \"*_With_Control\", depending on the use of control BAM file in inferring the peaks. ${FDRTHR} is either 0.01 or 0.05.\n\n\t\tThe principle is to extract the peak summits and surroundings (by some bp, defined as an offset) and compute the TF footprinting regions and underlying motifs within these regions.\n\n\t\tWithin this folder, the file structure is as follows:\n\t\tMotif_${PEAKS_ANALYZED}_SummitOffset_${OFFSET}/Footprint_HINT_ATAC/${READTYPE}/footprints_HINT_ATAC.bed\n\n\t\twhere, \n\t\t\t${PEAKS_ANALYZED}: can be \"Complete_Peaks\" (means complete set of peaks) or \"Peaks_PvalThr_50\" (means peaks with -log10(p-value) \u003e 50 are only considered).\n\t\t\t${OFFSET}: can be either 200 or 500, means the summit +/- offset bp regions are accounted for TF footprinting.\n\t\t\t${READTYPE}: can be one of the following:\n\t\t\t \t\"all\" (means all de-duplicated reads in the file f1-8 considered), \n\t\t\t \t\"NFR\" (means only nucleosome free reads in the file f1-10 are considered), \n\t\t\t \t\"NFRANDNucl\" (means NFR regions and +1M reads, indicated by the file f1-14, are considered).\n\n\t\t \tThe output file in each occasion, \"footprints_HINT_ATAC.bed\", contains the TF footprinting regions.\n\n\nSummarizing a set of ATAC-seq samples\n---------------------------------------\n\nSuppose, a directory \"/home/sourya/Results\" contain within it, the following folders: \n1, 2, 3, 4, ... each corresponding to the output for processing individual ATAC-seq samples.\n\nTo get a summarized list of performance metrics for these samples, use the script *Analysis/ResSummary.r*, using the following syntax.\n\n\tRscript ResSummary.r --BaseDir ${BaseDir} --OutDir ${OutDir}\n\n\twhere,\n\t1) ${BaseDir}: \n\t\tDirectory containing results of all ATAC-seq sample analysis \t\t\n\t\t(like /home/sourya/Results as mentioned above). Mandatory parameter.\n\n\t2) ${OutDir}: \n\t\tOutput directory to contain the summarized results. Default: current working directory.\n\n\tFor details of ATAC-seq QC measures, user may check this link:\n\thttps://www.encodeproject.org/atac-seq/\n\n\tUpon executing the R script, the following files are created within the specified ${OutDir}:\n\n\t\t1) Results_All_Samples_Summary.txt: summarized statistics for all samples\n\t\t2) Field_Description.txt: Summary description of individual fields / parameters.\n\t\t3) TotalReadCount_Distribution.html: To be loaded in any web browser. Plot depicting the distribution of total reads for all samples.\n\t\t4) Fraction_MappableReadCount_Distribution.html: Fraction of mappability for all samples.\n\t\t5) Fraction_MitochondrialReadCount_Distribution.html: Fraction of mitochondrial reads for all samples.\n\t\t6) Fraction_UniqueMappReadCount_Distribution.html: Fraction of unique mappability for all samples.\n\t\t7) Fraction_LowQualReadCount_Distribution.html: Fraction of low quality reads for all samples.\n\t\t8) Fraction_DuplicateReadCount_Distribution.html: Fraction of duplicate reads for all samples.\n\t\t9) NRF_Distribution.html: NRF for all samples.\n\t\t10) M1_Distribution.html: M1 metric for all samples.\n\t\t11) M2_Distribution.html: M2 metric for all samples.\n\t\t12) PBC1_Distribution.html: PBC1 metric for all samples.\n\t\t13) PBC2_Distribution.html: PBC2 metric for all samples.\n\t\t14) FRiP_Def_NoCtrl_Distribution.html: FRiP statistics for MACS2 peaks with default command, and without using any control BAM files.\n\t\t15) NumPeak_Def_NoCtrl_Distribution.html: Number of MACS2 peaks with default command, and without using any control BAM files.\n\t\t16) FRiP_Ext_NoCtrl_Distribution.html: FRiP statistics for MACS2 peaks with --Extsize option (recommended), and without using any control BAM files.\n\t\t17) NumPeak_Ext_NoCtrl_Distribution.html: Number of MACS2 peaks with --Extsize option (recommended), and without using any control BAM files.\n\t\t18) FRiP_Def_Ctrl_Distribution.html: FRiP statistics for MACS2 peaks with default command, and when one or more control BAM files are used.\n\t\t19) NumPeak_Def_Ctrl_Distribution.html: Number of MACS2 peaks with default command, and when one or more control BAM files are used.\n\t\t20) FRiP_Ext_Ctrl_Distribution.html: FRiP statistics for MACS2 peaks with --Extsize option (recommended), and when one or more control BAM files are used.\n\t\t21) NumPeak_Ext_Ctrl_Distribution.html: Number of MACS2 peaks with --Extsize option (recommended), and when one or more control BAM files are used.\n\nCommand for executing IDR codes\n---------------------------------\n\nCurrent pipeline supports IDR analysis between either a list of ATAC-seq peak files \nor between a list of alignment (BAM) files. In the second case, first the BAM files \nare analyzed and subsampled to contain equal number of reads (minimum number of reads \ncontained in the inputs), and subsequently, peaks are estimated from these \n(subsampled) BAM files using MACS2. These peaks are then applied for IDR analysis.\n\nThe script \"sample_IDRScript.sh\" included within this package \nshows calling following two functions (both are included within the folder \n\"IDR_Codes\"):\n\n\t1) IDRMain.sh\n\n\t2) IDR_SubSampleBAM_Main.sh\n\n\tThe first script, IDRMain.sh, performs IDR between two or more \n\tinput peak files (we have used peaks estimated from MACS2). The parameters \n\tcorresponding to this script are as follows:\n\n\t-I  InpFile        \t \n\t\t\tA list of input peak files (obtained from MACS2 - in .narrowPeak or .narrowPeak.gz format). \n\t\t\tAt least two peak files are required. \n\t\n\t-P \tPathIDRCode\t\t \n\t\t\tPath of the IDRCode package (Kundaje et. al. after its installation). \n\t\t\tPlease check the \"Required packages\" section for the details.\n\n\t-d  OutDir \t\t \t \n\t\t\tOutput directory (absolute path preferred) which will store the IDR results.\n\n\t-n \tPREFIX \t\t\t \n\t\t\tPrefix of output files. Default 'IDR_ATAC'.\n\n\tA sample execution of this script is as follows:\n\n\t./IDRMain.sh -I peak1.narrowPeak -I peak2.narrowPeak -I peak3.narrowPeak -P /home/sourya/packages/idrCode/ -d /home/sourya/OutDir_IDR -n 'IDR_test'\n\n\n\n\tThe second script, IDR_SubSampleBAM_Main.sh, takes input of two or more BAM files, \n\testimates peaks from these BAM files, and then performs IDR analysis. The parameters \n\tcorresponding to this script are as follows:\n\n\t-I  InpFile        \t \n\t\t\tA list of input BAM files. At least two BAM files are required. \n\t\n\t-P \tPathIDRCode\t\t \n\t\t\tPath of the IDRCode package (Kundaje et. al. after its installation). \n\t\t\tPlease check the \"Required packages\" section for the details.\n\n\t-d  OutDir \t\t \t \n\t\t\tOutput directory (absolute path preferred) which will store the IDR results.\n\n\t-n \tPREFIX \t\t\t \n\t\t\tPrefix of output files. Default 'IDR_ATAC'.\n\n\t-c  CountPeak\t\t \n\t\t\tNo of peaks in both replicates that will be compared for IDR analysis.\n\t\t\tDefault 25000.\n\t\t\n\t-C  CONTROLBAM\t\t \n\t\t\tControl file (in eiher .BAM or tagalign file in .gz format)\t\n\t\t\tused to estimate the peaks from MACS2. User may leave this field \n\t\t\tblank if no control file is available.\n\n\tA sample execution of this script is as follows:\n\n\t./IDR_SubSampleBAM_Main.sh -I inpfile1.bam -I inpfile2.bam -P /home/sourya/packages/idrCode/ -d /home/sourya/OutDir_IDR -n 'IDR_test' -c 25000 -C control.bam\n\n\nDescribing output of IDR analysis\n----------------------------------\n\nIn the specified output directory \"OutDir\" mentioned in the IDR script, following \nfiles (f) and folders (F) exist:\n\n\tF1: Folders of the name $i$_and_$j$ where 0 \u003c= i \u003c N and 1 \u003c= j \u003c= N, where N is \n\tthe number of replicates analyzed. Individual folders contain results for \n\tpairwise IDR analysis. For example, folder 0_and_1 contain IDR analysis \n\tfor the sample 0 (first replicate) and the sample 1 (second replicate).\n\n\tf1 : \"Replicate_Names.txt\" : names of the replicate samples used for IDR analysis.\n\n\tf2: Input_Peak_Statistics.txt: number of peaks and the peak containing replicates.\n\n\tf3: IDR_Batch_Plot-plot.pdf: final IDR plot. Here individual pairs (whose results \n\t\tare stored in the above mentioned folders) are numbered 1, 2, ...\n\t\tConsideing N = 3, the number of pairs possible is also 3. Here, \n\t\tthe number 1 denotes the folder (pair) 0_and_1, \n\t\t2 denotes the folder (pair) 0_and_2, and 3 denotes the \n\t\tfolder (pair) 1_and_2.\n\n\nContact\n-----------\n\nFor any queries, please generate a GitHub issue, or alternatively, e-mail us:\n\nSourya Bhattacharyya (sourya@lji.org)\n\nFerhat Ay (ferhatay@lji.org)\n\nPandurangan Vijayanand (vijay@lji.org)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fay-lab%2FATACProc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fay-lab%2FATACProc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fay-lab%2FATACProc/lists"}