{"id":27452188,"url":"https://github.com/epigen/fetch_ngs","last_synced_at":"2025-09-14T16:36:22.174Z","repository":{"id":281830793,"uuid":"943940069","full_name":"epigen/fetch_ngs","owner":"epigen","description":"Workflow to Fetch Public Sequencing Data and Metadata Using iSeq and MrBiomics Module.","archived":false,"fork":false,"pushed_at":"2025-06-23T12:50:56.000Z","size":52,"stargazers_count":13,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-05T00:25:38.216Z","etag":null,"topics":["bam","database","fastq","genomics","next-generation-sequencing","ngs","repository"],"latest_commit_sha":null,"homepage":"https://epigen.github.io/fetch_ngs/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epigen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-06T14:17:17.000Z","updated_at":"2025-09-02T12:05:45.000Z","dependencies_parsed_at":"2025-09-05T00:17:43.217Z","dependency_job_id":"8ac8144e-fc72-4ff2-8826-bb569cf0a1ba","html_url":"https://github.com/epigen/fetch_ngs","commit_stats":null,"previous_names":["epigen/fetch_ngs"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/epigen/fetch_ngs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epigen%2Ffetch_ngs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epigen%2Ffetch_ngs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epigen%2Ffetch_ngs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epigen%2Ffetch_ngs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epigen","download_url":"https://codeload.github.com/epigen/fetch_ngs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epigen%2Ffetch_ngs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273713621,"owners_count":25154614,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-05T02:00:09.113Z","response_time":402,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bam","database","fastq","genomics","next-generation-sequencing","ngs","repository"],"created_at":"2025-04-15T11:41:48.819Z","updated_at":"2025-09-14T16:36:22.143Z","avatar_url":"https://github.com/epigen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![MrBiomics](https://img.shields.io/badge/MrBiomics-red)](https://github.com/epigen/MrBiomics/)\n[![DOI](https://zenodo.org/badge/943940069.svg)](https://doi.org/10.5281/zenodo.15005419)\n[![](https://tokei.rs/b1/github/epigen/fetch_ngs?category=code)]() \n[![](https://tokei.rs/b1/github/epigen/fetch_ngs?category=files)]()\n[![GitHub license](https://img.shields.io/github/license/epigen/fetch_ngs)](https://github.com/epigen/fetch_ngs/blob/main/LICENSE)\n![GitHub Release](https://img.shields.io/github/v/release/epigen/fetch_ngs)\n[![Snakemake](https://img.shields.io/badge/Snakemake-\u003e=8.20.1-green)](https://snakemake.readthedocs.io/en/stable/)\n\n# Fetch Public Sequencing Data and Metadata Using iSeq\nA [Snakemake 8](https://snakemake.readthedocs.io/en/stable/) workflow to fetch (download) and process public sequencing data and metadata from **[GSA](https://ngdc.cncb.ac.cn/gsa/)**, **[SRA](https://www.ncbi.nlm.nih.gov/sra/)**, **[ENA](https://www.ebi.ac.uk/ena/)**, **[GEO](https://www.ncbi.nlm.nih.gov/geo/)** and **[DDBJ](https://www.ddbj.nig.ac.jp/)** databases using [iSeq](https://github.com/BioOmics/iSeq).\n\n\u003e [!NOTE]  \n\u003e This workflow adheres to the module specifications of [MrBiomics](https://github.com/epigen/MrBiomics), an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project's repository.\n\u003e\n\u003e ⭐️ **Star and share modules you find valuable** 📤 - help others discover them, and guide our future work!\n\n\u003e [!IMPORTANT]  \n\u003e **If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI [10.5281/zenodo.15005419](https://doi.org/10.5281/zenodo.15005419).**\n\n![Workflow Rulegraph](./workflow/dags/rulegraph.svg)\n\n# 🖋️ Authors\n- [Stephan Reichl](https://github.com/sreichl)\n- [Christoph Bock](https://github.com/chrbock)\n\n\n# 💿 Software\nThis project wouldn't be possible without the following software and their dependencies.\n\n| Software | Reference (DOI) |\n| :---: | :---: |\n| iSeq | https://github.com/BioOmics/iSeq |\n| pandas         | https://doi.org/10.5281/zenodo.3509134            |\n| Picard | https://broadinstitute.github.io/picard/ |\n| Snakemake | https://doi.org/10.12688/f1000research.29032.2 |\n\n\n# 🔬 Methods\nThis is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (`workflow/envs/*.yaml file`) or post-execution in the result directory (`{module}/envs/*.yaml`). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].\n\n__Data Acquisition \u0026 Processing.__ Public sequencing data were retrieved from [GSA|SRA|ENA|DDBJ] under the accession(s) [accession_ids] using iSeq (ver) [ref]. The data were downloaded as FASTQ files (and converted to unmapped BAM ([uBAM](https://gatk.broadinstitute.org/hc/en-us/articles/360035532132-uBAM-Unmapped-BAM-Format)) files using Picard FastqToSam (ver) [ref], preserving sample information and read groups while supporting both single-end and paired-end sequencing data).  Metadata for each dataset was collected and merged into a single Comprehensive reference file.\n\n**The data acquisition and processing described here were performed using a publicly available Snakemake (ver) [ref] workflow [10.5281/zenodo.15005419](https://doi.org/10.5281/zenodo.15005419).**\n\n# 🚀 Features\nThe workflow performs the following steps that produce the outlined results:\n\n- Data Acquisition\n  - Downloads sequencing data from public repositories **[GSA](https://ngdc.cncb.ac.cn/gsa/)**, **[SRA](https://www.ncbi.nlm.nih.gov/sra/)**, **[ENA](https://www.ebi.ac.uk/ena/)**, and **[DDBJ](https://www.ddbj.nig.ac.jp/)** using various [accession ID types](https://github.com/BioOmics/iSeq/blob/main/README.md#1--i---input)\n  - Extracts comprehensive metadata for each dataset\n  - Supports parallel downloading for improved performance using threads\n- Data Processing\n  - Automatic handling of both single-end and paired-end sequencing data\n  - Creation of a unified comprehensive metadata file with accession IDs and file paths\n  - Optional conversion from `FASTQ` (as `*.fastq.gz`) to [unmapped BAM](https://gatk.broadinstitute.org/hc/en-us/articles/360035532132-uBAM-Unmapped-BAM-Format)(as `*.bam`) format using [Picard's](https://broadinstitute.github.io/picard/) [FastqToSam](https://gatk.broadinstitute.org/hc/en-us/articles/360036351132-FastqToSam-Picard)\n- Metadata-only mode for quick exploration without downloading sequence files (`metadata_only: 1`)\n- Considerations\n  - Dependent on iSeq's supported repositories and accession types\n  - Requires internet connectivity and sufficient storage space for downloaded data\n\nThe workflow produces the following directory structure:\n\n```\n{result_path}/\n└── fetch_ngs/\n    ├── metadata.csv                # merged metadata for all accessions\n    ├── .fastq_to_bam/              # processing marker files\n    │   └── [accession].done\n    └── [accession]/                # one directory per accession\n        ├── [accession].metadata.csv  # metadata for this accession\n        └── [sample].[bam/fastq.gz]   # sequence files\n```\n\n# 🛠️ Usage\nHere are some tips for the usage of this workflow:\n- Run your workflow with `snakemake --resources parallel_downloads=3` to restrict concurrent download jobs to three (worked well for me), thereby reducing the risk of triggering IP blacklisting from excessive parallel FTP connections. This can also be achieved by using the workflow's [profile](./workflow/profiles/default/config.yaml). In case of usage as module, put the parameter into the parent workflow's profile.\n- Specify accession IDs in the configuration file as a list to download multiple datasets in one run\n- Use `metadata_only: 1` for a quick preview of available data before committing to full downloads\n- Choose between `FASTQ` or `BAM` output formats based on your downstream analysis needs\n- For large datasets, consider increasing `threads` and `mem` parameters\n- For super series (e.g., GSE) or projects containing many samples, start by running in `metadata_only: 1` mode to extract run accession IDs. Then use these IDs in the config to enable maximum parallelization, avoiding sequential download and conversion.\n- The merged metadata file can be used as a basis for sample annotation files downstream\n- BAM output format (`output_format: bam`) is recommended for direct integration with BAM compatible downstream analysis workflows\n\n# ⚙️ Configuration\nDetailed specifications can be found here [./config/README.md](./config/README.md)\n\n# 📖 Examples\nExplore detailed examples showcasing module usage in our comprehensive end-to-end [MrBiomics Recipes](https://github.com/epigen/MrBiomics?tab=readme-ov-file#-recipes), including data, configuration, annotation and results:\n- [ATAC-seq Analysis Recipe](https://github.com/epigen/MrBiomics/wiki/ATAC%E2%80%90seq-Analysis-Recipe)\n- [RNA-seq Analysis Recipe](https://github.com/epigen/MrBiomics/wiki/RNA%E2%80%90seq-Analysis-Recipe)\n\n# 🔗 Links\n- [GitHub Repository](https://github.com/epigen/fetch_ngs/)\n- [GitHub Page](https://epigen.github.io/fetch_ngs/)\n- [Zenodo Repository](https://doi.org/10.5281/zenodo.15005419)\n- [Snakemake Workflow Catalog Entry](https://snakemake.github.io/snakemake-workflow-catalog?usage=epigen/fetch_ngs)\n\n# 📚 Resources\n- Recommended compatible [MrBiomics Modules](https://github.com/epigen/MrBiomics/#-modules) for downstream analyses:\n  - [ATAC-seq Data Processing \u0026 Quantification Pipeline](https://github.com/epigen/atacseq_pipeline) for processing, quantification and annotation of chromatin accessibility.\n  - [Genome Browser Track Visualization](https://github.com/epigen/genome_tracks/) for quality control and visual inspection/analysis of genomic regions/genes of interest or top hits.\n  - [\u003cins\u003eSp\u003c/ins\u003elit, F\u003cins\u003eilter\u003c/ins\u003e, Norma\u003cins\u003elize\u003c/ins\u003e and \u003cins\u003eIntegrate\u003c/ins\u003e Sequencing Data](https://github.com/epigen/spilterlize_integrate/) after count quantification.\n  - [Differential Analysis with limma](https://github.com/epigen/dea_limma) to identify and visualize statistically significantly different features (e.g., genes or genomic regions) between sample groups.\n  - [Enrichment Analysis](https://github.com/epigen/enrichment_analysis) for biomedical interpretation of (differential) analysis results using prior knowledge.\n  - [Unsupervised Analysis](https://github.com/epigen/unsupervised_analysis) to understand and visualize similarities and variations between cells/samples, including dimensionality reduction and cluster analysis. Useful for all tabular data including single-cell and bulk sequencing data.\n\n\n# 📑 Publications\nThe following publications successfully used this module for their analyses.\n- [FirstAuthors et al. (202X) Journal Name - Paper Title.](https://doi.org/10.XXX/XXXX)\n- ...\n\n# ⭐ Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=epigen/fetch_ngs\u0026type=Date)](https://star-history.com/#epigen/fetch_ngs\u0026Date)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepigen%2Ffetch_ngs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepigen%2Ffetch_ngs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepigen%2Ffetch_ngs/lists"}