{"id":20425053,"url":"https://github.com/databio/pepatac_paper_data","last_synced_at":"2026-05-29T06:31:26.412Z","repository":{"id":86287787,"uuid":"302157265","full_name":"databio/pepatac_paper_data","owner":"databio","description":"Information on reproducing plots from the PEPATAC paper","archived":false,"fork":false,"pushed_at":"2021-05-17T16:40:01.000Z","size":45,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2026-02-23T14:48:41.710Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-07T20:50:59.000Z","updated_at":"2024-01-17T20:53:29.000Z","dependencies_parsed_at":"2023-03-13T09:28:31.095Z","dependency_job_id":null,"html_url":"https://github.com/databio/pepatac_paper_data","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/databio/pepatac_paper_data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fpepatac_paper_data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fpepatac_paper_data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fpepatac_paper_data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fpepatac_paper_data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databio","download_url":"https://codeload.github.com/databio/pepatac_paper_data/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fpepatac_paper_data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33640627,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-29T02:00:06.066Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T07:12:08.493Z","updated_at":"2026-05-29T06:31:26.391Z","avatar_url":"https://github.com/databio.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PEPATAC paper analyses\n\n## Get the data files\n\nThe data used for the paper is available from public resources. Included in the `metadata/` subfolder is the \n\"paper_sra_accessions.txt\" file containing a list of sequence read archive accession numbers.\n\nTo obtain the files en masse, you can provide the entire file to NCBI's sra-tools' `fasterq-dump` function like so:\n```\ncat paper_sra_accessions.txt | xargs -n1 fasterq-dump -p -O /path/to/output_dir\n```\n\nTo simplify use of downstream configuration files, you can also create an environment variable (`SRAFQ`) that points to this output directory containing your fastq files.\n\n```\nexport SRAFQ=/path/to/output_dir\n```\n\n## Run the pipeline\n\nAfter downloading, you can process using the pipeline:\n```\nlooper run paper_config.yaml\nlooper run paper_none_config.yaml\n```\n\nAfter completing, generate summary statistics:\n```\nlooper report paper_config.yaml\nlooper report paper_none_config.yaml\n```\n\nThis will produce output variants with prealignments and without for downstream comparisons.\n\nThe [included R markdown file](src/PEPATAC_paper_plots.Rmd) may be followed to reproduce the plots in R from the paper.\n\n## Prealignment comparisons\nTo produce the prealignment timing comparison plots requires three primary steps.\n\n### 1. Obtain source files\n\nThe mitochondrial (mtDNA) and human nuclear genome (hg38) aligning reads are originally derived from the following GEO accessions:\n - GSM2471255\n - GSM2471300\n - GSM2471249\n - GSM2471269\n - GSM2471245\n\n### 2. Run source files through PEPATAC and *keep* prealignment BAM files\n\nWe want to extract mitochondrial reads, so we will keep all prealignment files. The default is to remove them to save disk space.  The included \"source_library_config.yaml\" is our PEP for these samples.\n\n```\nlooper run source_library_config.yaml -x \" --keep\"\n```\n\n### 3. Extract mitochondrial and nuclear genome aligning reads\n\nAfter these samples finish, we want to generate all of the various total read counts necessary of both mtDNA and hg38 aligning reads that we can combine in various ratios to generate 10-100% mixtures from 10M to 200M total reads per mixture.\n\n`./generate_libraries.sh \"mtDNA_reads\" \"hg38_reads\" \"/path/to/source_library_output/results_pipeline/\"`\n\nThis is best accomplished on a cluster or a machine with upwards of 100GB of available RAM.\n\n### 4. Analyze using prealignments and without\n\nSet a environment variable that points to the directory containing your generated libraries named DATA.\n\n`export DATA=/your/path/to/mtDNA_reads/`\n\nRun each version of the PEP project using the same compute resources.\n\n```\nlooper run prealignment_config.yaml --compute cores=8 mem=16000\nlooper run prealignment_none_config.yaml --compute cores=8 mem=16000\n```\n\n### 5. Produce comparison plots\n\n```\nRscript PEPATAC_profile_aggregator.R /path/to/prealignment_config.yaml /path/to/prealignment_none_config.yaml $PROCESSED/pepatac/prealignment_comparison/yes/results_pipeline/ $PROCESSED/pepatac/prealignment_comparison/no/results_pipeline/ /path/to/your/output_dir\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabio%2Fpepatac_paper_data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabio%2Fpepatac_paper_data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabio%2Fpepatac_paper_data/lists"}