{"id":20109410,"url":"https://github.com/crazyhottommy/pyflow-chipseq","last_synced_at":"2025-05-06T10:31:19.901Z","repository":{"id":47027126,"uuid":"89386223","full_name":"crazyhottommy/pyflow-ChIPseq","owner":"crazyhottommy","description":"a snakemake pipeline to process ChIP-seq files from GEO or in-house ","archived":false,"fork":false,"pushed_at":"2020-04-17T18:40:27.000Z","size":432,"stargazers_count":107,"open_issues_count":0,"forks_count":40,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-05-03T09:53:29.354Z","etag":null,"topics":["chip-seq","python","snakemake"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crazyhottommy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-04-25T17:05:43.000Z","updated_at":"2025-04-01T06:03:33.000Z","dependencies_parsed_at":"2022-08-03T04:06:06.530Z","dependency_job_id":null,"html_url":"https://github.com/crazyhottommy/pyflow-ChIPseq","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyhottommy%2Fpyflow-ChIPseq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyhottommy%2Fpyflow-ChIPseq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyhottommy%2Fpyflow-ChIPseq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyhottommy%2Fpyflow-ChIPseq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crazyhottommy","download_url":"https://codeload.github.com/crazyhottommy/pyflow-ChIPseq/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252665949,"owners_count":21785174,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chip-seq","python","snakemake"],"created_at":"2024-11-13T18:08:08.764Z","updated_at":"2025-05-06T10:31:19.364Z","avatar_url":"https://github.com/crazyhottommy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pyflow-ChIPseq\na snakemake pipeline to process ChIP-seq files from GEO\n\nI have so many people asking me to process a public GEO ChIP-seq data set for them. I hate to repeat the same steps and decide to make a pipeline for it.\n\nswitch to a different branch to see the codes. now I have `shark` branch for the LSF system.\n\n**UPDATE** 05/30/2017. Now, the pipeline can handle in-house data as well.\n\nNow, this is working on LSF, I will have another branch for Torque.\n\n### Citation\n\nI created a doi on [zenodo](https://zenodo.org/).\nYou can cite:\n\n[![DOI](https://zenodo.org/badge/89386223.svg)](https://zenodo.org/badge/latestdoi/89386223)\n\nThe pipeline was [used in JOVE](https://www.jove.com/video/56972/an-integrated-platform-for-genome-wide-mapping-chromatin-states-using)!  please cite the following \nas well:\n\n```\nTerranova, C., Tang, M., Orouji, E., Maitituoheti, M., Raman, A., Amin, S., et al. An Integrated Platform for Genome-wide Mapping of Chromatin States Using High-throughput ChIP-sequencing in Tumor Tissues. J. Vis. Exp. (134), e56972, doi:10.3791/56972 (2018).\n```\n\n### work flow of the pipeline\n\n![](./rulegraph.png)\n\nIn the `config.yaml` file you can change settings. e.g. path to a different genome to align, p value cut-offs. The `target_reads` is the number of reads that downsampled to. I set 15 million for default. If the number of reads of the orignal bam files are less than `target_reads`, the pipeline will just keep whatever the number it has.\n\n### Dependiencies\n\n* [snakemake](https://bitbucket.org/snakemake/snakemake). snakemake is python3\n* R \u003e 3.3.0\nyou will need `optparse` package. `install.packages(\"optparse\")`\n`SRAdb`\n\n```r\nsource(\"https://bioconductor.org/biocLite.R\")\nbiocLite(\"SRAdb\")\n```\n\n```\n Rscript sraDownload.R  -a ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -t fastq SRR3144652\n```\n\nThis script will download the meta file for each SRR id as well.\n\n* aspera for downloading\n\ncheck this blog post by MARK ZIEMANN http://genomespot.blogspot.com/2015/05/download-sra-data-with-aspera-command.html\n\n```bash\nsh \u003c(curl -s aspera-connect-3.6.2.117442-linux-64.sh)\n```\n\n`sraDownload.R` is under the `scripts` folder from [Luke Zappia](https://github.com/lazappi):\n\n```bash\n## single quote your ascp command, otherwise R will be confused\nRscript sraDownload.R  -a 'ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh' -t fastq SRR3144652\n\n```\n\n* [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)\n* `bamCoverage` v2.3.3 from [deeptools](https://github.com/fidelram/deepTools) for making RPKM normalized and input subtracted bigwig files\n* [bowtie1](http://bowtie-bio.sourceforge.net/index.shtml) for aligning short reads (\u003c 50bp)\n* [samblaster](https://github.com/GregoryFaust/samblaster) v0.1.22 to remove duplicates and downsampling.\n* [samtools](http://www.htslib.org/) v1.3.1\n* [ROSE](http://younglab.wi.mit.edu/super_enhancer_code.html) for calling superEnhancer. ROSE has to be run inside the installation folder. now I hard coded the path in the Snakefile. (you will have to change that to the ROSE directory in your cluster). Todo: expose the path to the `config.yaml` file that one can change.\n* [macs1](https://pypi.python.org/pypi/MACS/1.4.2) v1.4.2 and [macs2](https://github.com/taoliu/MACS) v2.1.1 for calling peaks (macs2 for broad peaks).\n* [multiQC](http://multiqc.info/)\n* [phantompeakqual](https://github.com/kundajelab/phantompeakqualtools)\n\n\n`macs1`, `macs2` and `ROSE` are python2.x, see this [Using Snakemake and Macs2](https://groups.google.com/forum/#!searchin/snakemake/macs%7Csort:relevance/snakemake/60txGSq81zE/NzCUTdJ_AQAJ) in the snakemake google group.\n\nif you look inside the `Snakefile`, I did `source activate root` back to python2.x before running macs1 and macs2.\n\nThere will be [Integration of conda package management into Snakemake](https://bitbucket.org/snakemake/snakemake/pull-requests/92/wip-integration-of-conda-package/diff)\n\n\n### How to distribute workflows\n\nread [doc](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html)\n\n```bash\nssh shark.mdanderson.org\n\n# start a screen session\nscreen\n\n# make a folder, name it yourself, I named it workdir for demon\nmkdir /rsch2/genomic_med/krai/workdir/\n\ncd /rsch2/genomic_med/krai/workdir/\n\ngit clone https://github.com/crazyhottommy/pyflow-ChIPseq\n\ncd pyflow-ChIPseq\n\n## go to downsampling branch. shark is LSF system\ngit checkout shark\n\n## edit the config.yaml file as needed, e.g. set mouse or human for ref genome, p value cut off for peak calling, the number of reads you want to downsample to\nnano config.yaml\n\n## skip this if on Shark, samir has py351 set up for you. see below STEPS\nconda create -n snakemake python=3 -c bioconda multiqc snakemake deeptools\nsource activate snakemake\n```\n\n## STEPS for fastq files from GEO\n\n### Download the sra files\n\nPrepare a txt file `SRR.txt` which has three columns: sample_name, fastq_name, and factor:\n\ne.g.\n\n```bash\ncat SRR.txt\n\nsample_name fastq_name  factor\nMOLM-14_DMSO1_5 SRR2518123   BRD4\nMOLM-14_DMSO1_5 SRR2518124  Input\nMOLM-14_DMSO2_6 SRR2518125  BRD4\nMOLM-14_DMSO2_6 SRR2518126  Input\n\n\n```\n\nYou can have mulitple different factors for the same sample_name.\n\n`sample_name_factor` will be used to name the output. e.g. :`MOLM-14_DMSO1_5_BRD4.sorted.bam`\n\n### download the sra files using the R script\n\n```bash\ncd pyflow-ChIPseq\nmkdir fastqs\ncd fastqs\n## fastq-dump only convert the sra files to fastq to the current folderr\n```\n\nmake a shell script:\n`download.sh`\n\ndownload the sqlite database from http://dl.dropbox.com/u/51653511/SRAmetadb.sqlite.gz and unzip it. place it in the `scripts` folder.\n\nUPDATE (2018-04-28). You may want to upgrade the `sradb` or download the database here https://github.com/seandavi/SRAdb/blob/master/README.md#raw-database-downloads\n\nsee issue https://github.com/seandavi/SRAdb/issues/10\n\n```bash\n# /bin/bash\nset -euo pipefail\n\n## you will need to change the ascp command to get the right path\nRscript ../scripts/sraDownload.R -a 'ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh' -d ../scripts/SRAmetadb.sqlite $1\n```\n\n`chmod u+x download.sh`\n\n```bash\n# inside the pyflow-ChIPseq/fastq folder:\ncat ../SRR.txt | sed '1d' | cut-f2 | sort | uniq \u003e srr_unique.txt\n\n## only have 4 jobs in parallel, good behavior on a cluster\ncat srr_unique.txt | parallel -j 4 ./download.sh {}\n\n# all the sra files will be downloaded in the current fastqs folder.\n```\n\nNow you have all `sra` files downloaded into `fastqs` folder, proceed below:\n\n### convert `sra` to `fastqs` and compress to .gz files\n\n```bash\n\n## you can use a for loop to fastq-dump the downloaded sra files.\nfind *sra| parallel -j 4  fastq-dump {}\n\nfind *fastq | parallel -j 4  bgzip {}\n\n## save some space\nrm *sra\n\n# go gack to the pyflow-ChIPseq folder\ncd ..\n\npython3 sample2json.py --fastq_dir fastqs/ --meta SRR.txt\n```\n\nA `samples.json` file will be created and some information will be printed out on the screen.\n\n### start the pipeline\n\n```bash\n## dry run\nsnakemake -np\n\n## test for one sample\n./pyflow-ChIPseq.sh  07bigwig/SRR2518123.bw\n\n```\n\nif no errors, preceed below.\n\n### Using [DRMAA](https://www.drmaa.org/)\n\n[job control through drmaa](http://drmaa-python.readthedocs.io/en/latest/tutorials.html#controlling-a-job)\n\nDRMAA is only supported on `Shark` (LSF system).\n\n```bash\nmodule load drmma\n./pyflow-drmaa-ChIPseq.sh\n```\n\nUsing `drmaa` can `control + c` to stop the current run.\n\nDependent jobs are submitted one by one, if some jobs failed, the pipeline will stop. Good for initital testing.\n\n### submit all jobs to the cluster\n\n```bash\n./pyflow-ChIPseq.sh\n```\n\nAll jobs will be submitted to the cluster on queue.  This is useful if you know your jobs will succeed for most of them and the jobs are on queue to gain priority.\n\n## process the custom data produced from the sequencing core.\n\nDifferent People have different naming conventions, to accomondate this situation, I require them to give me a `meta.txt` tab delimited file to have the sample information.\n\nThe `sample2json.py` script assumes that the fastq_name in the `meta.txt` file exist in the fastq files. Only the first three columns will be used.\n`factor`s from the same `sample_name` will be made into one group.\n\nset the `control` in the `config.ymal` file, which you are going to use for peak calling. e.g. Input, IgG\n\n\n```bash\ncd pyflow-ChIPseq\ncat meta.txt\nsample_name     fastq_name      factor  reference\nsample1 Li-Lane-1-1A-062817     CST-CHD1        mouse\nsample1 Li-Lane-1-1C-062817     Bethal-CHD1     mouse\nsample1 Li-Lane-1-1E-062817     IgG     mouse\nsample1 Li-Lane-4-7E-062817     Input   mouse\nsample2 Li-Lane-1-1B-062817     CST-CHD1        mouse\nsample2 Li-Lane-1-1D-062817     Bethal-CHD1     mouse\nsample2 Li-Lane-1-1F-062817     IgG     mouse\nsample2 Li-Lane-4-7F-062817     Input   mouse\nsample3 Li-Lane-2-3C-062817     SOX2    mouse\nsample3 Li-Lane-2-3D-062817     H3K27Ac mouse\nsample3 Li-Lane-4-7G-062817     Input   mouse\nsample4 Li-Lane-2-3E-062817     H3K9me3 mouse\nsample4 Li-Lane-3-5A-062817     H3K27Ac mouse\nsample4 Li-Lane-3-5E-062817     MYC     mouse\nsample4 Li-Lane-2-7H-062817     Input   mouse\nsample5 Li-Lane-2-3F-062817     H3K9me3 mouse\nsample5 Li-Lane-3-5B-062817     H3K27Ac mouse\nsample5 Li-Lane-3-9A-062817     Input   mouse\nsample6 Li-Lane-2-3G-062817     H3K9me3 mouse\nsample6 Li-Lane-3-5C-062817     H3K27Ac mouse\nsample6 Li-Lane-4-9B-062817     Input   mouse\nsample7 Li-Lane-2-3H-062817     H3K9me3 mouse\nsample7 Li-Lane-3-5D-062817     H3K27Ac mouse\nsample7 Li-Lane-3-5H-062817     MYC     mouse\nsample7 Li-Lane-4-9C-062817     Input   mouse\n\n\n## only the first 3 columns are required.\n\n## make a samples.json file\npython3 sample2json.py --fastq_dir dir/to/fastqs/ --meta meta.txt\n```\n\nThe real name of the fastq files:\n\n`/rsrch2/genomic_med/krai/zheng-ChIPseq-2/Sample_Li-Lane-1-1C-062817/Li-Lane-1-1C-062817_S24_L004_R1_001.fastq.gz`\n\ncheck the example `samples.json` file in the repo.\n\nNow, the same as the steps as processing the `sra` files\n\n```bash\n# dry run\npyflow-ChIPseq.sh  -np\n```\n\n\n### Extra notes on file names\nIf one sets up a lab, it is necessary to have consistent file naming across all the projects. `TCGA`project is a great example for us to follow. A [barcode system](https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode) can make your life a lot easier for downstream analysis.\n\n![](./TCGA_barcode.jpg)\n\nSimilarily, for a ChIP-seq project, it is important to have consistent naming.\nIn Dr.Kunal Rai'lab. We adopted a barcoding system similar to TCGA:\n\ne.g.\n`TCGA-SKCM-M028-11-P008-A-NC-CJT-T`\n\n`TCGA` is the big project name;\n`SKCM` is the tumor name;\n`M028` is the sample name (this should be an unique identifier);\n`11` is the sequencing tag;\nwe use `11` to denote first time IP, first time sequencing, if the reads number is too few, but the IP worked, we just need to resequence the same library. for the resequencing sample, we will use `12` for this. if the total reads number is still too low, `13` could be used. `21` will be second time IP and first time sequencing. etc.\n`P008` is the plate number of that IP experiment, we now use 96-well plate for ChIP-seq, we use this id to track which plate the samples are from.\n`A` is the chromatin mark name or transcription factor name. we have a naming map for this:\n`A` is H3K4me1, `B` is H3K9me3 and `G` is for Input etc.\n\nThe other barcode areas can be used for other information. `NC` means the samples were sequenced in north campus.\n\nIt saves me a lot in the downstream processing. The barcode can be captured by a universal regular expression from the fastq.gz files.\n\nA real experiment comes a fastq.gz name like this\n\n`TCGA-SKCM-M028-R1-P008-C-NC-CJT-T_TTAGGC_L001_R1_006.fastq.gz`\n\nmultiplexing is very common nowadays, the sequencing reads for the same sample may come from different lanes, after de-multiplexing, multiple files for the same sample will be put in the same folder. If you name your files in a consistent way, you can easily merge all the fastq files before mapping. (for DNA sequencing, it is recommended to map the fastq files independently and then merge the mapped bam files with read-group to identify which lane it is from).\n\nIt also helps for merging the fastq files from two different rounds of sequencing. I know sequencing tag `11` and `12` with the same sample name and chromatin mark name are for the same sample, so I can merge them together programatically.\n\nI also know that `G` is a Input control sample, I can then call peaks, make Input subtracted bigwigs etc using a IP vs Input pattern. (A_vs_G, B_vs_G). Same idea can be used for `tumor` and `control` for whole genome sequencing when calling mutations and copynumber.\n\nMany other people out of our lab let me process their data, I can not enforce naming of the files before they carry out the experiments. That's why I require them to give me a `meta.txt` file instead.\n\n### job control\n\nTo kill all of your pending jobs you can use the command:\n\n```bash\nbkill `bjobs -u krai |grep PEND |cut -f1 -d\" \"`\n```\n\nother useful commands:\n\n```\nbjobs -pl\nDisplay detailed information of all pending jobs of the invoker.\n\nbjobs -ps\nDisplay only pending and suspended jobs.\n\nbjobs -u all -a\nDisplay all jobs of all users.\n\nbjobs -d -q short -m apple -u mtang1\nDisplay all the recently finished jobs submitted by john to the\nqueue short, and executed on the host apple.\n```\n\n### rerun some of the jobs\n\n```bash\n\n# specify the name of the rule, all files that associated with that rule will be rerun. e.g. rerun macs2 calling peaks rule,\n./pyflow-ChIPseq -R call_peaks_macs2\n\n## rerun one sample, just specify the name of the target file\n\n./pyflow-ChIPseq -R 02aln/SRR3144652.sorted.bam\n\n# check snakemake -f, -R, --until options\n./pyflow-ChIPseq -f call_peaks_macs2\n```\n\n### checking results after run finish\n\n```bash\n\nsnakemake --summary | sort -k1,1 | less -S\n\n# or detailed summary will give you the commands used to generated the output and what input is used\nsnakemake --detailed-summary | sort -k1,1 \u003e snakemake_run_summary.txt\n```\n\n\n### clean the folders\n\nI use echo to see what will be removed first, then you can remove all later.\n\n```\nfind . -maxdepth 1 -type d -name \"[0-9]*\" | xargs echo rm -rf\n```\n\n\n### Snakemake does not trigger re-runs if I add additional input files. What can I do?\n\nSnakemake has a kind of “lazy” policy about added input files if their modification date is older than that of the output files. One reason is that information what to do cannot be inferred just from the input and output files. You need additional information about the last run to be stored. Since behaviour would be inconsistent between cases where that information is available and where it is not, this functionality has been encoded as an extra switch. To trigger updates for jobs with changed input files, you can use the command line argument –list-input-changes in the following way:\n\n```bash\nsnakemake -n -R `snakemake --list-input-changes`\n\n```\n\n### How do I trigger re-runs for rules with updated code or parameters?\n\n```bash\nsnakemake -n -R `snakemake --list-params-changes`\n```\n\nand\n\n```bash\nsnakemake -n -R `snakemake --list-code-changes`\n```\n\n\n### TO DO list\n\n**provide a output directory** now everything will be output in the current pyflow-ChIPseq directory in a structured fasion. : `00log`, `01seq`, `02fqc`, `03aln` etc\n**work for paired-end ChIPseq as well** now only for single-end.\n**put everything in docker**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrazyhottommy%2Fpyflow-chipseq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrazyhottommy%2Fpyflow-chipseq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrazyhottommy%2Fpyflow-chipseq/lists"}