{"id":19749465,"url":"https://github.com/mskcc/nf-fastq-plus","last_synced_at":"2026-06-17T14:31:54.908Z","repository":{"id":40703140,"uuid":"277581320","full_name":"mskcc/nf-fastq-plus","owner":"mskcc","description":"Generate IGO fastqs, bams, stats and fingerprinting","archived":false,"fork":false,"pushed_at":"2023-01-24T05:28:07.000Z","size":2797,"stargazers_count":1,"open_issues_count":35,"forks_count":0,"subscribers_count":6,"default_branch":"master","last_synced_at":"2026-06-05T11:33:27.340Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mskcc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-07-06T15:35:33.000Z","updated_at":"2024-01-09T08:36:16.000Z","dependencies_parsed_at":"2023-02-06T16:30:38.493Z","dependency_job_id":null,"html_url":"https://github.com/mskcc/nf-fastq-plus","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/mskcc/nf-fastq-plus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2Fnf-fastq-plus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2Fnf-fastq-plus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2Fnf-fastq-plus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2Fnf-fastq-plus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mskcc","download_url":"https://codeload.github.com/mskcc/nf-fastq-plus/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2Fnf-fastq-plus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34453431,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-17T02:00:05.408Z","response_time":127,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T02:26:36.286Z","updated_at":"2026-06-17T14:31:54.891Z","avatar_url":"https://github.com/mskcc.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# nf-fastq-plus\nGenerate IGO fastqs, bams, stats and fingerprinting\n\n## Run\nThere are two options for running the modules in this pipeline - \n* [Demultiplex and Stats](#demultiplex-and-stats): Includes all demultiplexing and stats for a sequencing run\n* [Stats Only](#stats-only): Runs only the stats on a specified demultiplexed directory\n\n**Links for Developers**\n* [Project Structure](#project-structure)\n* [Testing](#testing)\n* [Nextflow Config](#nextflow-config)\n* [Crontab Setup](#crontab-setup)\n* [Docker Container](#docker-container)\n* [Troubleshooting/Common-Issues](#debug)\n\n### Demultiplex and Stats\n**Description**: Runs end-to-end pipeline of demultiplexing and stats. The input of this is the name of the sequencing \nrun\n```\n# Basic\nnextflow main.nf --run ${RUN}\n\n# Skip demultiplexing\nnextflow main.nf --run ${RUN} --force true\n\n# Run demux and stats only on one request\nnextflow main.nf --run ${RUN} --filter ${REUQEST_ID}\n\n# Run in background\nnohup nextflow main.nf --run ${RUN} --force true -bg\n\n# Push pipeline updates to nf-dashboard\nnohup nextflow main.nf --run ${RUN} --force true -with-weblog 'http://dlviigoweb1:4500/api/nextflow/receive-nextflow-event' -bg  \n```\n\n#### Arguments `(--arg)`\n* `--run`: string (required), directory name of the sequencing directory \n  \u003e Eg: `210406_JOHNSAWYERS_0277_000000000-G7H54`\n* `--force`: string (optional), skips the demultiplexing if already completed\n  \u003e Eg: `true`,  `false`\n* `--filter`: string (optional), only run requests that match this filter\n  \u003e Eg: `10000_B` (only runs samples in 10000_B request), `10000` (only runs samples in 10000 request, i.e. NOT `10000_B` samples)\n\n#### Options `(-opt)`\n* `-bg`: run process in background \n* `-with-weblog`: publish events to an API\n\n### Stats Only\n**Description**: Runs stats given a demultiplex output\n```\n# Basic\nnextflow samplesheet_stats_main.nf --ss ${SAMPLE_SHEET} --dir ${DEMULTIPLEX_DIRECTORY} \n\n# Run stats only on one request\nsamplesheet_stats_main.nf --ss ${SAMPLE_SHEET} --dir ${DEMULTIPLEX_DIRECTORY}  --filter ${REUQEST_ID}\n\n# Run in background\nnohup nextflow samplesheet_stats_main.nf --ss ${SAMPLE_SHEET} --dir ${DEMULTIPLEX_DIRECTORY} -bg  \n```\n\n#### Arguments `(--arg)`\n* `--dir`: string (required), Absolute path to the directory name of the demultiplexed directory \n  \u003e Eg: `/igo/work/FASTQ/DIANA_0333_BH53GNDRXY_i7`\n* `--ss`: string (required), Absolute path to the sample sheet that CREATED the value of `--dir`\n  \u003e Eg: `/home/igo/DividedSampleSheets/SampleSheet_210407_DIANA_0333_BH53GNDRXY_i7.csv`\n* `--filter`: string (optional), only run requests that match this filter\n  \u003e Eg: `10000_B` (only runs samples in 10000_B request), `10000` (only runs samples in 10000 request, i.e. NOT `10000_B` samples)\n\n#### Options `(-opt)`\n* `-bg`: run process in background \n         \n### Re-running Pipeline\n* Demultiplexing will fail if the FASTQ directory already exists.\n    * If demultiplexing is required, remove the FASTQ directory\n    * If demultiplexing can be skipped, add the `--force true` option\n* Stats will be skipped if the final BAM for that sample has been written to `${STATS_DIR}/${RUNNAME}/${RUN_TAG}.bam`\n    * If stats need to be re-run, remove relevant BAMs from the `${STATS_DIR}` folder specified in `nextflow.config`\n\n## For Development\n### Please Read:\n* Create a `feature/{YOUR_CHANGE}` branch for new features or `hotfix/{YOUR_FIX}` for future development\n* Before merging your branch into `master`, wait for the GitHub actions to run and verify that all checks pass. **Do not\n merge changes if there are failed tests**. Either talk to IGO Data Team or fix the tests.\n\n### Project Structure\n* Follow the project structure below -\n```\n.\n├── README.md\n├── main.nf\n├── modules\n│   └── m1.nf\n├── nextflow.config\n└── templates\n    └── process.sh\n```\n* `templates`: Where all scripts (bash, python, etc.) will go. Don't rename this directory because nextflow is seutp to \nlook for a directory of this name where the nextflow script is run\n* `modules`: Directory containing nextflow modules that can be imported into `main.nf`\n\n### Adding a new workflow\n* Passing sample-specific parameters (e.g. Reference Genome, Recipe, etc.) is done via a params file w/ `key=value` \nspace-delimited values. To use this file, make sure that a `{PREVIOUS_WORKFLOW}.out.PARAMS` file is passed to the \nworkflow and specified as a path-type channel. Make sure to use the `.out.PARAMS` of the workflow that the `next_wkflw` \nshould be dependent on. I've noticed that nextflow won't pass all outputs of a workflow together (e.g. BAM of one task \nand the run params folder of another task) \n\n**Steps for Adding a New Module**\n1) Add module\n```\n├── modules\n│   └── process.nf\n```\n```\nprocess {PROCESS_NAME} {\n  [ directives ]\n\n  output:\n  ...\n  stdout()\n\n  shell:\n  template '{PROCESS_SCRIPT}'\n}\n```\n* You don't need to import the template script. From the documentation, \"Nextflow looks for the template file in the \ndirectory templates that must exist in the same folder where the Nextflow script file is located\"\n* Note: Add the stdout() as an output if you would like to log the out to the configured log file\n\n2) Add template\n```\n└── templates\n    └── process.sh\n```\n* Write whatever script with the appropriate header (e.g. `#!/bin/bash`) that includes the following\n\t* `Nextflow Inputs`: Inputs defined as nextflow `Input` values. Add (Input) if defined in process `Input` or \n\t(Config) if defined in `nextflow.config\n\t* `Nextflow Outputs`: Outputs that will be defined in the execution context\n\t* `Run`: A sample command of how to run the script\n```\n#!/bin/bash\n# Submits demultiplexing jobs\n# Nextflow Inputs:\n#   RUN_TO_DEMUX_DIR: Absolute path to directory of the Run to demux (Defined as input in nextflow process)\n# Nextflow Outputs:\n#   DEMUXED_RUN: Name of run to demux, given by the name of @RUN_TO_DEMUX_DIR\n# Run:\n#   RUN_TO_DEMUX_DIR=/igo/sequencers/michelle/200814_MICHELLE_0249_AHMNCJDRXX ./demultiplex.sh\n```\n\n3) Emit PARAM file (Only if downstream processes are dependent on the output)\n* If your process is a dependency of downstream processes, emit the PARAMS file (`nextflow.config` parameter \n`RUN_PARAMS_FILE`) so that it can be read directly by the receiving channel along w/ the process's value.\n```\nprocess task {\n  ...\n\n  input:\n  path PARAMS\n  path INPUT\n \n  output:\n  path \"${RUN_PARAMS_FILE}\", emit: PARAMS       # Emit the same params value passed into the task\n  path '*.bam', emit: VALUE\n\n  shell:\n  template 'task.sh'\n}\n\nworkflow wkflw {\n  take:\n    PARAMS\n    INPUT\n\n  main:\n    task( PARAMS, INPUT )\n  \n  emit:\n    PARAMS = task.out.PARAMS                    # Assign PARAMS so that it's available in the main.nf\n    VALUE = task.out.VALUE\n}\n```\n* **Why?** Nextflow channels emit asynchronously. This means that upstream processes will emit and pass to the next \navailable process and not necessarily the expected one. For instance, if process A emits parameters used by all \ndownstream processes and process B emits the value that will be transformed by that parameter, process C will not \nnecessarily receive the proccess A parameters that apply to value emited by process B because each process has an \nindependent, asynchronous channel.\n \n4) (Optional) Add logging\n\nIn the modules, convert the exported member to a workflow that calls an included `log_out` process to log everything \nsent to stdout by the process. See below,\n```\ninclude log_out as out from './log_out'\n\nprocess task {\n  output:\n  stdout()\t\t// Add this to your outputs\n  ...\n\n  shell:\n  '''\n  echo \"Hello World\" \t// Example sent to STD OUT\n  ...\n  '''\n}\n\nworkflow task_wkflw { \t// This is what will actually be exported\n  main:\n    task | out\n}\n```\n\n#### Logging\nThere are three files that log information - \n* `LOG_FILE`: All output is logged here (except commands)\n* `CMD_FILE`: All stat commands are logged to this file\n* `DEMUX_LOG_FILE`: All demultiplexing commands are logged here\n\n### Testing \nDocker Container Actions run our integration tests on GitHub. To test changes, please build the dockerfile from the root\n and verify no errors are generated from the `samplesheet_stats_main_test_hwg.sh` and `cellranger_demux_stats.sh` \n scripts.\n```\ndocker image build -t nf-fastq-plus-playground .\n\n# Test stats-only workflow\ndocker run --entrypoint /nf-fastq-plus/testPipeline/e2e/samplesheet_stats_main_test_hwg.sh -v $(pwd)/../nf-fastq-plus:/nf-fastq-plus nf-fastq-plus-playground\n\n# Test e2e (demux \u0026 stats)\ndocker run --entrypoint /nf-fastq-plus/testPipeline/e2e/cellranger_demux_stats.sh -v $(pwd)/../nf-fastq-plus:/nf-fastq-plus nf-fastq-plus-playground\n```\n\n## Nextflow Config\nModify directory locations, binaries, etc. in the `nextflow.config` file\n\n### Important Files\n``` \nLOG_FILE        # Logs all output from the pipeline\nCMD_FILE        # Logs all commands from the pipeline (e.g. was bcl2fastq run w/ 1 or 0 mistmaches?)\nDEMUX_LOG_FILE  # Logs output of bcl2fastq commands\n```\n\n### Important Directories\n```\nSTATS_DIR                   # Where final BAMS are written to\nSTATSDONEDIR                # Where stat (.txt) files \u0026 cellranger ouptut is written to\nPROCESSED_SAMPLE_SHEET_DIR  # Where split samplesheets go (these are used for demuxing and stats)\nLAB_SAMPLE_SHEET_DIR        # Original source of samplesheets\nCOPIED_SAMPLE_SHEET_DIR     # Where original samplesheets are copied to\nCROSSCHECK_DIR              # Directory used for fingerprinting\nSHARED_SINGLE_CELL_DIR      # Directory used by DLP process to create metadata.yaml (should happen automatically)\n```\n\n### Other\n```\nLOCAL_MEM                   # GB of memory to give a process (e.g. demultiplexing) if executor=local\n```\n\n## Crontab Setup\nThe pipeline can be kicked off automatically by the `crontab/detect_copied_sequencers.sh` script. Add the following\nto enable the crontab\n```\n# crontab -e\nSHELL=/bin/bash\n\n# Add path to bsub executable\nPATH=${PATH}:/igoadmin/lsfigo/lsf10/10.1/linux3.10-glibc2.17-x86_64/bin\n\n# Load the LSF profile prior to running the command\n* * * * * . /igoadmin/lsfigo/lsf10/conf/profile.lsf; lsload; bhosts; /PATH/TO/detect_copied_sequencers.sh \u003e\u003e /PATH/TO/nf-fastq-plus.log 2\u003e\u00261\n```\n\n## Docker Container\n* v1.0.0 - First Release\n* v1.0.1 - Testing against \"ncbi-genomes-2021-09-23\" for GRCh38.p13\n* v1.0.2 - Testing against \"GRCh38.p13.dna.primary.assembly.fa\" for GRCh38.p13\n\nNote - Changes to the github actions using the docker file need to be tagged, e.g.\n```\nVERSION_NUMBER=...      # e.g. \"v2\"\ngit add ...\ngit commit -m \"My change\"\ngit tag -a -m \"very important change\" ${VERSION_NUMBER} \ngit push --follow-tags\n```\n[REF](https://docs.github.com/en/actions/creating-actions/creating-a-docker-container-action)\n\n## Debug\n### Demultiplexing\n\nI) Samplesheet index length doesn't match `RunInfo.xml` length\n\nBCL2FASTQ\n* Use `use-bases-mask`\n\nDRAGEN   \n    ```\n    ERROR: Sample Sheet Error: SampleSheet.csv sample #1 (index 'GGTGAACC') has an index of length 8 bases, \n    but a length of 10 was expected based upon RunInfo.xml or the OverrideCycles setting.\n    ```\nSolution: Mask the indices by adding the `OverrideCycles` option to the SampleSheet \n    \n    ```\n    [Settings],,,,,,,,\n    OverrideCycles,Y151;I8N2;I8N2;Y151\n    ,,,,,,,,\n    ```\n    \nII) Only the `___MD.txt` file is available in the `STATSDONEDIR` for a sample\n* Delete the \"final sample BAM\" in `STATSDONEDIR`, e.g.\n    ```\n    RUN_NAME=...        # ROSALIND_0001_AHNKNJDSX2\n    SAMPLE_NAME=....    # 01001_A_1\n    rm /igo/staging/stats/${RUN_NAME}/*${SAMPLE_NAME}*.ba*\n    ```\n* Why? The final sample bam was created, but the stats for that BAM did not finish or were interrupted.\n    * This can happen when the pipeline is interrupted after writing the \"final\" sample `.bam` to `SAMPLE_BAM_DIR`. The \n    `generate_run_params_wkflw` checks whether this `.bam` file is written\n    * Issue to address this [nf-fastq-plus issue 202](https://github.com/mskcc/nf-fastq-plus/issues/202)\n    \nIII) My github action change isn't reflecting in the integrated tests\n* See tag notes in [Docker Container](#docker-container)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmskcc%2Fnf-fastq-plus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmskcc%2Fnf-fastq-plus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmskcc%2Fnf-fastq-plus/lists"}