{"id":30019630,"url":"https://github.com/veupathdb/ngs-samples-nextflow","last_synced_at":"2026-02-09T00:31:10.862Z","repository":{"id":275165246,"uuid":"913362760","full_name":"VEuPathDB/ngs-samples-nextflow","owner":"VEuPathDB","description":"Fetch ngs samples from SRA if needed and/or prepare samplesheet for further processing","archived":false,"fork":false,"pushed_at":"2025-10-16T14:49:50.000Z","size":41,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-10-17T17:47:54.181Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Nextflow","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VEuPathDB.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-07T14:40:36.000Z","updated_at":"2025-10-16T14:49:55.000Z","dependencies_parsed_at":"2025-01-31T16:35:43.451Z","dependency_job_id":"20791bdd-7ff0-405b-82a9-1ac58ad68de4","html_url":"https://github.com/VEuPathDB/ngs-samples-nextflow","commit_stats":null,"previous_names":["veupathdb/ngs-samples-nextflow"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/VEuPathDB/ngs-samples-nextflow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VEuPathDB%2Fngs-samples-nextflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VEuPathDB%2Fngs-samples-nextflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VEuPathDB%2Fngs-samples-nextflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VEuPathDB%2Fngs-samples-nextflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VEuPathDB","download_url":"https://codeload.github.com/VEuPathDB/ngs-samples-nextflow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VEuPathDB%2Fngs-samples-nextflow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29251463,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-08T22:49:53.206Z","status":"ssl_error","status_checked_at":"2026-02-08T22:49:51.384Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-06T01:20:09.310Z","updated_at":"2026-02-09T00:31:10.856Z","avatar_url":"https://github.com/VEuPathDB.png","language":"Nextflow","funding_links":[],"categories":[],"sub_categories":[],"readme":"#+title: NGS Samples Nextflow Pipeline\n#+author: VEuPathDB\n#+date: 2024\n\n* Overview\n\nThis Nextflow pipeline prepares NGS (Next Generation Sequencing) FASTQ files and creates standardized samplesheets for downstream bioinformatics analysis. The pipeline handles two primary use cases:\n\n1. *SRA Download*: Downloads FASTQ files from NCBI's Sequence Read Archive (SRA) \n2. *Local Files*: Processes existing local FASTQ files\n\nThe pipeline includes intelligent file concatenation for multi-replicate samples and adaptive read subsampling based on assay type and genome size.\n\n* Key Features\n\n** Multi-Sample Concatenation\n- Automatically detects and concatenates FASTQ files from multiple rows with the same sample ID\n- Maintains proper pairing for paired-end sequencing data\n- Supports both single-end and paired-end data formats\n\n** Intelligent Read Subsampling  \n- Calculates optimal read counts based on assay type (DNASeq/RNASeq) and genome size\n- DNASeq: targets 30× coverage\n- RNASeq: targets 50× coverage (higher due to expression variation)\n- Bounded between 1M-100M reads per sample\n- Uses seqtk for reproducible random subsampling\n\n** Flexible Input Sources\n- Downloads from SRA using prefetch and fasterq-dump\n- Processes local FASTQ files with path normalization\n- Handles compressed (.gz) and uncompressed FASTQ files\n\n* Quick Start\n\n** Prerequisites\n- Nextflow (≥21.04.0)\n- Docker or Singularity/Apptainer\n- Input samplesheet in CSV format\n\n** Basic Usage\n\n#+begin_src bash\n# Download from SRA\nnextflow run main.nf \\\n  --fromSra true \\\n  --input /path/to/samplesheet \\\n  --outDir /path/to/output\n\n# Process local files  \nnextflow run main.nf \\\n  --input /path/to/samplesheet \\\n  --outDir /path/to/output\n\n# RNA-seq with custom genome size\nnextflow run main.nf \\\n  --assayType RNASeq \\\n  --genomeSize 120000000 \\\n  --input /path/to/samplesheet\n#+end_src\n\n* Input Format\n\nThe pipeline expects a CSV samplesheet with the following columns:\n\n| Column | Description | Example |\n|--------|-------------|---------|\n| sample | Sample identifier (can repeat for replicates) | =sample1= |\n| fastq_1 | Path to R1 FASTQ or SRA accession | =data/sample1_R1.fastq.gz= |\n| fastq_2 | Path to R2 FASTQ (optional for paired-end) | =data/sample1_R2.fastq.gz= |\n| var1 | Additional sample metadata | =treatment_A= |\n\n** Example Samplesheet\n#+begin_src csv\nsample,fastq_1,fastq_2,var1\nsample1,SRR123456,,control\nsample1,SRR123457,,control  \nsample2,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,treatment\n#+end_src\n\n*Note*: Multiple rows with the same sample ID will be automatically concatenated.\n\n* Parameters\n\n** Core Parameters\n- =--input=: Directory containing the samplesheet (default: =$launchDir/data/samplesheet=)\n- =--samplesheetName=: Name of the samplesheet file (default: =samplesheet.csv=)\n- =--fromSra=: Download from SRA vs. process local files (default: =false=)\n- =--outDir=: Output directory (default: =$launchDir/ngs-samples-output=)\n\n** Subsampling Parameters\n- =--assayType=: Sequencing assay type - \"DNASeq\" or \"RNASeq\" (default: =\"DNASeq\"=)\n- =--genomeSize=: Target genome size in base pairs (default: =\"3000000000\"= for human)\n- =--maxReads=: Manual override for maximum reads per sample (default: =null=)\n\n* Output\n\nThe pipeline produces:\n- *Processed FASTQ files*: Concatenated and subsampled as needed\n- *Standardized samplesheet*: CSV with absolute paths to processed files\n- *Process logs*: Detailed execution information\n\nOutput samplesheet format:\n#+begin_src csv\nsample,fastq_1,fastq_2,var1\nsample1,/abs/path/sample1_subsampled.fastq.gz,,control\nsample2,/abs/path/sample2_1_subsampled.fastq.gz,/abs/path/sample2_2_subsampled.fastq.gz,treatment\n#+end_src\n\n* Pipeline Architecture\n\nThe pipeline follows this processing flow:\n\n1. *Input parsing*: Read and group samplesheet by sample ID\n2. *File acquisition*: Download from SRA or validate local files  \n3. *Concatenation*: Merge replicates per sample (if multiple entries)\n4. *Subsampling*: Limit reads based on coverage targets\n5. *Formatting*: Generate final samplesheet with absolute paths\n\n** Container Images\n- SRA tools: =biocontainers/sra-tools=\n- Subsampling: =staphb/seqtk:1.4=\n- File processing: =docker.io/veupathdb/alpine_bash:1.0.0=\n\n* Configuration\n\n** Docker (default)\n#+begin_src bash\nnextflow run main.nf -c conf/docker.config\n#+end_src\n\n** Singularity\n#+begin_src bash  \nnextflow run main.nf -c conf/singularity.config\n#+end_src\n\n** HPC with LSF\n#+begin_src bash\nnextflow run main.nf -c conf/lsf.config\n#+end_src\n\n* Examples\n\n** Human Whole Genome Sequencing\n#+begin_src bash\nnextflow run main.nf \\\n  --assayType DNASeq \\\n  --genomeSize 3000000000 \\\n  --input data/human_wgs\n#+end_src\n\n** Yeast RNA-seq\n#+begin_src bash\nnextflow run main.nf \\\n  --assayType RNASeq \\\n  --genomeSize 12000000 \\\n  --input data/yeast_rnaseq\n#+end_src\n\n** Custom Read Limit\n#+begin_src bash\nnextflow run main.nf \\\n  --maxReads 5000000 \\\n  --input data/pilot_study\n#+end_src\n\n* Development\n\n** Testing\n#+begin_src bash\n# Test individual modules\nnextflow test modules/nf-core/sratools/fasterqdump/tests/main.nf.test\nnextflow test modules/nf-core/sratools/prefetch/tests/main.nf.test\n#+end_src\n\n** Contributing\nSee =CLAUDE.md= for detailed development guidance and architecture information.\n\n* Support\n\nFor issues and feature requests, please contact the VEuPathDB development team or create an issue in the project repository.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fveupathdb%2Fngs-samples-nextflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fveupathdb%2Fngs-samples-nextflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fveupathdb%2Fngs-samples-nextflow/lists"}