{"id":48633306,"url":"https://github.com/sanjaysgk/ipg","last_synced_at":"2026-05-18T01:10:09.998Z","repository":{"id":279138449,"uuid":"937825275","full_name":"sanjaysgk/ipg","owner":"sanjaysgk","description":"Immunopeptidogenomics — cryptic peptide database construction from RNA-seq. An nf-core-style port of the 31-step IPG pipeline (Scull et al. Mol Cell Proteomics 2021), built for the Li and Purcell Labs at Monash University.","archived":false,"fork":false,"pushed_at":"2026-04-20T04:38:54.000Z","size":789,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-20T04:43:58.710Z","etag":null,"topics":["bioinformatics","cancer-immunotherapy","cancer-research","cryptic-peptides","gatk4","hla","immunopeptidogenomics","immunopeptidomics","mass-spectrometry","mhc","monash-university","mutect2","neoantigens","nextflow","nf-core","pixi","reproducible-research","rnaseq","star-aligner","stringtie"],"latest_commit_sha":null,"homepage":"https://doi.org/10.1016/j.mcpro.2021.100143","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sanjaysgk.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATIONS.md","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-02-24T00:46:46.000Z","updated_at":"2026-04-15T01:26:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"f3e0a953-fcec-4a01-a563-b0de283e1949","html_url":"https://github.com/sanjaysgk/ipg","commit_stats":null,"previous_names":["sanpme66/ipg","sanjaysgk/ipg"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/sanjaysgk/ipg","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanjaysgk%2Fipg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanjaysgk%2Fipg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanjaysgk%2Fipg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanjaysgk%2Fipg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sanjaysgk","download_url":"https://codeload.github.com/sanjaysgk/ipg/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanjaysgk%2Fipg/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33161412,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-17T22:39:12.733Z","status":"ssl_error","status_checked_at":"2026-05-17T22:39:10.741Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","cancer-immunotherapy","cancer-research","cryptic-peptides","gatk4","hla","immunopeptidogenomics","immunopeptidomics","mass-spectrometry","mhc","monash-university","mutect2","neoantigens","nextflow","nf-core","pixi","reproducible-research","rnaseq","star-aligner","stringtie"],"created_at":"2026-04-09T06:03:47.890Z","updated_at":"2026-05-18T01:10:09.993Z","avatar_url":"https://github.com/sanjaysgk.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sanjaysgk/ipg\n\n[![GitHub Actions CI Status](https://github.com/sanjaysgk/ipg/actions/workflows/ci.yml/badge.svg)](https://github.com/sanjaysgk/ipg/actions/workflows/ci.yml)\n[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)\n[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A525.10.4-23aa62.svg)](https://www.nextflow.io/)\n[![pixi](https://img.shields.io/badge/dev_env-pixi-yellow.svg)](https://pixi.sh)\n[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)\n[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000\u0026logo=docker)](https://www.docker.com/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n\n\u003e **Immunopeptidogenomics — cryptic peptide database construction from RNA-seq.**\n\u003e An [nf-core](https://nf-co.re)-style port of the 31-step bash pipeline used by the\n\u003e [Li Lab](https://research.monash.edu/en/persons/chen-li) and\n\u003e [Purcell Lab](https://research.monash.edu/en/persons/anthony-purcell) at Monash University,\n\u003e implementing the cryptic peptide discovery method described in\n\u003e [Scull et al. _Mol Cell Proteomics_ 2021](https://doi.org/10.1016/j.mcpro.2021.100143).\n\n## What this pipeline does\n\n`sanjaysgk/ipg` turns paired-end RNA-seq reads into a **cryptic peptide search database**\nsuitable for downstream MS/MS immunopeptidomics searches. Starting from FASTQs, it:\n\n1. Aligns reads with **STAR** (two-pass) and infers strandedness with **RSeQC**.\n2. Assembles novel transcripts with **StringTie** and reconciles them with the reference\n   annotation via **gffcompare** (`-R -V -C` for the consensus combined GTF).\n3. Builds a variant-calling-ready BAM via the GATK4 RNA-seq Best Practices chain\n   (FastqToSam → MergeBamAlignment → MarkDuplicates → SplitNCigarReads).\n4. Recalibrates base qualities with two passes of **BaseRecalibrator + ApplyBQSR**.\n5. Calls somatic variants with **Mutect2** in tumour-only mode against a gnomAD-style\n   germline allele-frequency database, then **CalculateContamination** /\n   **FilterMutectCalls** / **SelectVariants** for clean PASS-only sites-only VCFs.\n6. Builds the cryptic peptide FASTA database via the IPG custom C tools\n   ([`curate_vcf`](https://github.com/sanjaysgk/immunopeptidogenomics),\n   [`revert_headers`](https://github.com/sanjaysgk/immunopeptidogenomics),\n   [`alt_liftover`](https://github.com/sanjaysgk/immunopeptidogenomics),\n   [`triple_translate`](https://github.com/sanjaysgk/immunopeptidogenomics),\n   [`squish`](https://github.com/sanjaysgk/immunopeptidogenomics)) plus\n   `gff3sort` and `gffread` from bioconda.\n\nOptionally, a **post-MS analysis** step (`--step post_ms`) runs the two-phase\n`db_compare` + `origins` workflow on PEAKS (or other search engine) results to\nidentify and annotate cryptic-only peptides.\n\nAn **MS search** step (`--step ms_search`) runs up to four open-source search\nengines (**MSFragger**, **Comet**, **Sage**, **PEAKS**) in parallel against the\ncryptic FASTA, applies **mokapot** FDR control per engine, rescores PSMs with\n**MS2Rescore**, and merges results at 1% peptide-level FDR. Optional\nimmunoinformatics gates — `--run_netmhcpan`, `--run_netmhciipan`,\n`--run_gibbscluster`, `--run_flashlfq`, `--run_blastp_host` — pick up the\nintegrated peptide table and emit a per-sample HTML report summarising HLA\nbinding, motif clusters, quantification, and host-background hits. See\n[`docs/usage.md`](docs/usage.md) for the full invocation.\n\nThe 31 legacy steps are grouped into **seven typed nf-core subworkflows**:\n\n```mermaid\n%%{init: {\n  \"theme\": \"base\",\n  \"themeVariables\": {\n    \"primaryColor\":         \"#e0f2fe\",\n    \"primaryTextColor\":     \"#0c4a6e\",\n    \"primaryBorderColor\":   \"#0369a1\",\n    \"lineColor\":            \"#0369a1\",\n    \"secondaryColor\":       \"#fef3c7\",\n    \"tertiaryColor\":        \"#dcfce7\",\n    \"fontFamily\":           \"ui-sans-serif, system-ui, -apple-system, sans-serif\",\n    \"fontSize\":             \"14px\"\n  }\n}}%%\nflowchart LR\n    %% ----- nodes -----\n    INPUT([\"fa:fa-dna \u003cb\u003ePaired-end FASTQ\u003c/b\u003e\u003cbr/\u003e+ samplesheet.csv\"]):::input\n\n    subgraph QC1[\"1. ALIGN_QC \u0026nbsp; \u003cspan style='color:#64748b;font-size:11px'\u003e(steps 1–3)\u003c/span\u003e\"]\n        direction TB\n        STAR1[\"STAR 2-pass align\"]\n        SORT1[\"samtools sort + index\"]\n        RSEQC[\"RSeQC infer_experiment\"]\n        STAR1 --\u003e SORT1 --\u003e RSEQC\n    end\n\n    subgraph TA[\"2. TRANSCRIPT_ASSEMBLY \u0026nbsp; \u003cspan style='color:#64748b;font-size:11px'\u003e(steps 4–5)\u003c/span\u003e\"]\n        direction TB\n        STRINGTIE[\"StringTie\"]\n        GFFCOMPARE[\"gffcompare\u003cbr/\u003e(-R -V -C)\"]\n        STRINGTIE --\u003e GFFCOMPARE\n    end\n\n    subgraph BP[\"3. BAM_PREP \u0026nbsp; \u003cspan style='color:#64748b;font-size:11px'\u003e(steps 6–12)\u003c/span\u003e\"]\n        direction TB\n        SORTQN[\"samtools sort -n\"]\n        F2S[\"GATK4 FastqToSam\"]\n        MERGE[\"GATK4 MergeBamAlignment\"]\n        MARKDUP[\"GATK4 MarkDuplicates\"]\n        SPLIT[\"GATK4 SplitNCigarReads\"]\n        VAL[\"GATK4 ValidateSamFile\u003cbr/\u003e\u003cspan style='color:#64748b;font-size:11px'\u003e(audit only)\u003c/span\u003e\"]\n        SORTQN --\u003e F2S --\u003e MERGE --\u003e MARKDUP --\u003e SPLIT --\u003e VAL\n    end\n\n    subgraph BQ[\"4. BQSR \u0026nbsp; \u003cspan style='color:#64748b;font-size:11px'\u003e(steps 13–16)\u003c/span\u003e\"]\n        direction TB\n        BR1[\"BaseRecalibrator (1st)\"]\n        APPLY[\"ApplyBQSR\"]\n        BR2[\"BaseRecalibrator (2nd)\"]\n        AC[\"AnalyzeCovariates\u003cbr/\u003e\u003cspan style='color:#64748b;font-size:11px'\u003e(QC plot)\u003c/span\u003e\"]\n        BR1 --\u003e APPLY --\u003e BR2 --\u003e AC\n    end\n\n    subgraph MC[\"5. MUTECT_CALLING \u0026nbsp; \u003cspan style='color:#64748b;font-size:11px'\u003e(steps 17–23)\u003c/span\u003e\"]\n        direction TB\n        M2[\"Mutect2 tumour-only\"]\n        LOM[\"LearnReadOrientationModel\"]\n        GPS[\"GetPileupSummaries\"]\n        CC[\"CalculateContamination\"]\n        FMC[\"FilterMutectCalls\"]\n        SV[\"SelectVariants\u003cbr/\u003e(PASS, sites-only)\"]\n        CV[\"curate_vcf\u003cbr/\u003e\u003cspan style='color:#64748b;font-size:11px'\u003e(IPG)\u003c/span\u003e\"]\n        M2 --\u003e LOM\n        M2 --\u003e GPS --\u003e CC\n        LOM --\u003e FMC\n        CC --\u003e FMC\n        M2 --\u003e FMC --\u003e SV --\u003e CV\n    end\n\n    subgraph DC[\"6. DB_CONSTRUCT \u0026nbsp; \u003cspan style='color:#64748b;font-size:11px'\u003e(steps 24–31)\u003c/span\u003e\"]\n        direction TB\n        GFF3[\"gff3sort\"]\n        GFR[\"gffread\"]\n        TT[\"triple_translate\u003cbr/\u003e\u003cspan style='color:#64748b;font-size:11px'\u003e(IPG)\u003c/span\u003e\"]\n        VAR{{\"--include_variant_peptides\u003cbr/\u003etrue?\"}}:::flag\n        IFR[\"IndexFeatureFile + FARM\u003cbr/\u003e+ revert_headers + alt_liftover\u003cbr/\u003e\u003cspan style='color:#64748b;font-size:11px'\u003e(unmasked + indel branches)\u003c/span\u003e\"]\n        SQUISH[\"squish\u003cbr/\u003e\u003cspan style='color:#64748b;font-size:11px'\u003e(IPG)\u003c/span\u003e\"]\n        GFF3 --\u003e GFR --\u003e TT --\u003e SQUISH\n        VAR --\u003e|yes| IFR --\u003e SQUISH\n    end\n\n    QC2[\"FastQC\"]:::qc\n    MQC[\"fa:fa-chart-bar \u003cb\u003eMultiQC report\u003c/b\u003e\"]:::report\n    OUT([\"fa:fa-database \u003cb\u003ecryptic_peptide.fasta\u003c/b\u003e\u003cbr/\u003e(MS/MS-ready DB)\"]):::deliverable\n\n    %% ----- edges -----\n    INPUT --\u003e QC2\n    INPUT --\u003e STAR1\n    SORT1 --\u003e STRINGTIE\n    SORT1 --\u003e SORTQN\n    SPLIT --\u003e BR1\n    APPLY --\u003e M2\n    GFFCOMPARE --\u003e GFF3\n    SV --\u003e IFR\n    SQUISH --\u003e OUT\n\n    QC2 --\u003e MQC\n    RSEQC --\u003e MQC\n    MARKDUP --\u003e MQC\n    AC --\u003e MQC\n    GFFCOMPARE --\u003e MQC\n    FMC --\u003e MQC\n\n    %% ----- classes -----\n    classDef input        fill:#fef9c3,stroke:#ca8a04,stroke-width:2px,color:#713f12\n    classDef deliverable  fill:#bbf7d0,stroke:#16a34a,stroke-width:3px,color:#14532d\n    classDef report       fill:#e9d5ff,stroke:#9333ea,stroke-width:2px,color:#581c87\n    classDef qc           fill:#fce7f3,stroke:#db2777,stroke-width:2px,color:#831843\n    classDef flag         fill:#fef3c7,stroke:#d97706,stroke-width:2px,color:#92400e\n\n    style QC1 fill:#f0f9ff,stroke:#0284c7,stroke-width:2px\n    style TA  fill:#f0fdf4,stroke:#16a34a,stroke-width:2px\n    style BP  fill:#fef2f2,stroke:#dc2626,stroke-width:2px\n    style BQ  fill:#fefce8,stroke:#ca8a04,stroke-width:2px\n    style MC  fill:#faf5ff,stroke:#9333ea,stroke-width:2px\n    style DC  fill:#fff7ed,stroke:#ea580c,stroke-width:2px\n```\n\n## Quick start\n\n### 1. Install the dev environment\n\nThe repository ships a [pixi](https://pixi.sh) project that pins every tool\n(nextflow 25.10.4, nf-core 3.5.2, nf-test 0.9.5, GATK4, STAR, samtools, bcftools,\nstringtie, gffcompare, gffread, RSeQC, FastQC, MultiQC, OpenJDK 17, plus the\nlinting toolchain) to bit-for-bit reproducible versions via `pixi.lock`.\n\n```bash\ngit clone https://github.com/sanjaysgk/ipg.git\ncd ipg\npixi install\n```\n\nIf you don't have pixi: `curl -fsSL https://pixi.sh/install.sh | bash`.\n\n### 2. Build the chr22 test bundle (~5 minutes, one time)\n\n```bash\npixi run bash bin/build_test_bundle.sh\n```\n\nBy default this writes to `/fs04/scratch2/xy86/sanjay/ipg-test-data/`. Override\non a non-Monash machine:\n\n```bash\nTEST_BUNDLE_DIR=/some/path \\\nREFERENCE_DIR=/path/to/GRCh38 \\\nVARIANT_DIR=/path/to/variant-resources \\\nFASTQ_R1=/path/to/sample_R1.fq.gz \\\nFASTQ_R2=/path/to/sample_R2.fq.gz \\\npixi run bash bin/build_test_bundle.sh\n```\n\nThe bundle is **never committed to git** — only the build script is. See\n[`bin/build_test_bundle.sh`](bin/build_test_bundle.sh) for the exact recipe.\n\n### 3. Run the pipeline against the test bundle\n\n| Profile combination                | Container engine             | Use when                                                                                  |\n| ---------------------------------- | ---------------------------- | ----------------------------------------------------------------------------------------- |\n| `-profile test,pixi`               | none — tools from local PATH | Fastest iteration on a workstation/HPC compute node where pixi can run all tools natively |\n| `-profile test,singularity`        | apptainer / singularity      | Standard HPC; pulls biocontainers from `community.wave.seqera.io`                         |\n| `-profile test,docker`             | docker                       | Laptops, cloud, GitHub Actions CI                                                         |\n| `-profile test,monash,singularity` | singularity                  | Monash M3 SLURM cluster (`comp` partition, `xy86` account)                                |\n\n```bash\n# Fastest (no containers, all tools from pixi):\npixi run nextflow run . -profile test,pixi --outdir results\n```\n\nThe pipeline takes **~2 minutes** end-to-end against the chr22 test bundle on a\nsingle Linux box and produces the cryptic peptide database at\n`results/squish/\u003csample\u003e_cryptic.fasta`.\n\n### 4. Run on real data\n\nCreate a samplesheet `samplesheet.csv`:\n\n```csv\nsample,fastq_1,fastq_2,strandedness\nD100_liver,/path/to/D100-liver_R1.fastq.gz,/path/to/D100-liver_R2.fastq.gz,reverse\nD101_pancreas,/path/to/D101-pancreas_R1.fastq.gz,/path/to/D101-pancreas_R2.fastq.gz,reverse\n```\n\nThen run with explicit reference paths:\n\n```bash\npixi run nextflow run . \\\n    -profile monash,singularity \\\n    --input            samplesheet.csv \\\n    --outdir           /fs04/scratch2/xy86/sanjay/ipg-results \\\n    --fasta            /path/to/GRCh38.primary_assembly.genome.fa \\\n    --fasta_fai        /path/to/GRCh38.primary_assembly.genome.fa.fai \\\n    --fasta_dict       /path/to/GRCh38.primary_assembly.genome.dict \\\n    --gtf              /path/to/gencode.v44.primary_assembly.annotation.gtf \\\n    --star_index       /path/to/human_genome_index_GRCh38 \\\n    --rseqc_bed        /path/to/gencode_assembly.bed \\\n    --dbsnp            /path/to/dbsnp138.vcf \\\n    --dbsnp_tbi        /path/to/dbsnp138.vcf.idx \\\n    --known_indels     /path/to/Homo_sapiens_assembly38.known_indels.vcf.gz \\\n    --known_indels_tbi /path/to/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi \\\n    --mills            /path/to/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \\\n    --mills_tbi        /path/to/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi \\\n    --germline_resource     /path/to/small_exac_common_3.hg38.vcf.gz \\\n    --germline_resource_tbi /path/to/small_exac_common_3.hg38.vcf.gz.tbi\n```\n\n## The `--include_variant_peptides` flag\n\nBy default the cryptic peptide DB is built from the **reference assembly only**,\nmatching the legacy Scull et al. 2021 / D122_Lung run (verified empirically by\ninspecting the legacy `squish.log`). Pass `--include_variant_peptides true` to\nalso fold the alt-reference variant-derived peptide branches (unmasked + indel)\ninto the final database:\n\n```bash\npixi run nextflow run . -profile monash,singularity \\\n    --input samplesheet.csv \\\n    --include_variant_peptides true \\\n    [other reference args]\n```\n\n\u003e [!IMPORTANT] \u003e `--include_variant_peptides` does **not** switch the variant caller to matched\n\u003e tumour-normal mode. Variant calling is **always** performed in tumour-only\n\u003e Mutect2 mode against a gnomAD-style germline allele-frequency database; this\n\u003e pipeline does **not** support matched tumour-normal calling. The flag controls\n\u003e only whether the discovered variants get folded into the final cryptic peptide\n\u003e DB. Set `true` only when the sample is expected to harbour biologically\n\u003e meaningful somatic variants (e.g. tumour tissue, hypermutated cell lines,\n\u003e MMR-deficient samples). Leave at `false` for normal tissue, cell lines, or any\n\u003e sample where variant peptides would mostly add noise. This is a pipeline-level\n\u003e flag — to mix modes for a heterogeneous cohort, run the pipeline twice.\n\n## Post-MS analysis (`--step post_ms`)\n\nAfter running the DB construction pipeline, the cryptic peptide FASTA is searched\nagainst MS/MS data using PEAKS Online (or another search engine such as\nMSFragger, Comet, or Sage). The resulting PSM CSVs are then analysed with the\ntwo-phase `db_compare` + `origins` workflow\n([Scull et al. 2021](https://doi.org/10.1016/j.mcpro.2021.100143)):\n\n```\nPhase 1: db_compare_v2.R  →  cryptic_only.txt\n         origins -s        →  origins_discard.txt + origins_unconventional.txt\n\nPhase 2: db_compare_v2.R (with -j discard -u unconventional)\n         →  unambiguous_unconventional.txt\n         origins (full Ensembl mode)  →  deep origin annotation\n```\n\n### Post-MS samplesheet\n\nCreate a CSV with one row per sample:\n\n```csv\nsample,cryptic_psm_csv,uniprot_psm_csv,cryptic_decoy_score,uniprot_decoy_score\nD122_liver,/path/to/D122_Liver_Cryptic_DB.db.psms.csv,/path/to/D122_Liver_Uniprot_DB.db.psms.csv,44.43,36.48\nD101_heart,/path/to/D101_Heart_Cryptic_DB.db.psms.csv,/path/to/D101_Heart_Uniprot_DB.db.psms.csv,3.64,3.25\n```\n\nThe `cryptic_decoy_score` and `uniprot_decoy_score` columns are the `-10lgP`\ndecoy thresholds from the respective PEAKS searches.\n\n### Running post-MS analysis\n\n```bash\npixi run nextflow run . -profile pixi,monash \\\n    --step post_ms \\\n    --post_ms_input       post_ms_samplesheet.csv \\\n    --uniprot_fasta       /path/to/uniprotkb_human_canonical_isoform.fasta \\\n    --transcriptome_fasta /path/to/D122_liver_transcriptome.fa \\\n    --prefix_tracking     /path/to/prefix.tracking \\\n    --outdir              results_post_ms\n```\n\nThe `--transcriptome_fasta` and `--prefix_tracking` files are outputs from the\nDB construction step (`gffread` and `gffcompare` respectively). You can find them\nin the pipeline output directory from the previous run.\n\n### Post-MS output\n\n```\nresults_post_ms/\n├── post_ms/\n│   ├── phase1/\n│   │   ├── db_compare/     Phase 1 cryptic-only peptide lists + plots\n│   │   └── origins/        Phase 1 origins (simple mode) — discard + unconventional lists\n│   └── phase2/\n│       ├── db_compare/     Phase 2 refined unambiguous unconventional peptides\n│       └── origins/        Phase 2 origins (full Ensembl annotation)\n└── pipeline_info/\n```\n\n## Output\n\n```\nresults/\n├── star/                    STAR alignment BAMs + .Log.final.out\n├── samtools/                sorted/indexed BAMs\n├── rseqc/                   strandedness inference reports\n├── stringtie/               assembled transcript GTFs\n├── gffcompare/              prefix.combined.gtf, prefix.tracking, .stats\n├── gatk4/                   Mutect2 / BQSR / contamination tables\n├── revert/                  alt-reference FASTAs (only when --include_variant_peptides=true)\n├── gff3sort/                sorted assembly GTF\n├── tt/                      triple_translate per-branch peptide FASTAs\n├── squish/\n│   └── \u003csample\u003e_cryptic.fasta   ← THE DELIVERABLE\n├── multiqc/\n│   └── multiqc_report.html      aggregated QC report\n└── pipeline_info/\n    ├── execution_report_\u003ctimestamp\u003e.html\n    ├── execution_timeline_\u003ctimestamp\u003e.html\n    └── pipeline_dag_\u003ctimestamp\u003e.html\n```\n\n## Profiles\n\n| Profile                     | Purpose                                                                                                  |\n| --------------------------- | -------------------------------------------------------------------------------------------------------- |\n| `test`                      | Use the chr22 test bundle (built by `bin/build_test_bundle.sh`)                                          |\n| `pixi`                      | Run every process from the local pixi env, no containers                                                 |\n| `singularity` / `apptainer` | Pull biocontainers via singularity/apptainer (HPC default)                                               |\n| `docker`                    | Pull biocontainers via docker (laptop / cloud / CI default)                                              |\n| `monash`                    | SLURM executor on the Monash M3 `comp` partition under the `xy86` project, with shared singularity cache |\n\n## Architecture\n\n```\nsanjaysgk/ipg/\n├── main.nf                              entry point\n├── workflows/ipg.nf                     main workflow — chains the 6 subworkflows\n├── subworkflows/local/\n│   ├── align_qc/                        steps 1-3\n│   ├── transcript_assembly/             steps 4-5\n│   ├── bam_prep/                        steps 6-12\n│   ├── bqsr/                            steps 13-16\n│   ├── mutect_calling/                  steps 17-23\n│   ├── db_construct/                    steps 24-31 (branches on --include_variant_peptides)\n│   └── post_ms_analysis/               --step post_ms: 2-phase db_compare + origins\n├── modules/\n│   ├── nf-core/                         23 upstream nf-core modules (STAR, samtools, GATK4, etc.)\n│   └── local/                           10 local modules:\n│       ├── curate_vcf/                  IPG custom C tool (kescull)\n│       ├── revert_headers/              IPG custom C tool (kescull)\n│       ├── alt_liftover/                IPG custom C tool (kescull)\n│       ├── triple_translate/            IPG custom C tool (kescull)\n│       ├── squish/                      IPG custom C tool (kescull)\n│       ├── origins/                     IPG custom C tool — peptide origin annotation (kescull)\n│       ├── db_compare/                  R script — cryptic vs UniProt PSM comparison (kescull)\n│       ├── gff3sort/                    bioconda gff3sort wrapper\n│       ├── gatk4_validatesamfile/       missing-from-upstream wrapper\n│       └── gatk4_fastaalternatereferencemaker/  missing-from-upstream wrapper\n├── containers/\n│   └── ipg-tools/                       Reproducible Docker build of the kescull C tools\n│                                        from a pinned commit SHA → ghcr.io/sanjaysgk/ipg-tools\n├── conf/\n│   ├── base.config\n│   ├── modules.config                   per-process ext.args / ext.prefix / errorStrategy\n│   ├── test.config                      chr22 test bundle paths\n│   └── monash.config                    Monash M3 SLURM\n├── bin/\n│   └── build_test_bundle.sh             reproducible chr22 test bundle builder\n├── tests/                               nf-test pipeline-level tests\n├── pixi.toml                            dev env definition (committed)\n└── pixi.lock                            dev env lockfile (committed, bit-reproducible)\n```\n\n## Authors\n\n- **Sanjay SG Krishna** ([@sanjaysgk](https://github.com/sanjaysgk)) — pipeline port,\n  Li Lab, Monash University\n- **Kate Scull** — original IPG method + custom C tools, Purcell Lab, Monash University\n  (see [`kescull/immunopeptidogenomics`](https://github.com/kescull/immunopeptidogenomics))\n- **Chen Li** — supervision, Li Lab, Monash University\n- **Anthony W. Purcell** — supervision, Purcell Lab, Monash University\n\n## Citation\n\nIf you use `sanjaysgk/ipg` in your research, please cite the original method paper:\n\n\u003e **Scull KE, Pandey K, Ramarathinam SH, Purcell AW.** \u003e _Immunopeptidogenomics: harnessing RNA-seq to illuminate the dark immunopeptidome._\n\u003e Mol Cell Proteomics. 2021;20:100143.\n\u003e doi: [10.1016/j.mcpro.2021.100143](https://doi.org/10.1016/j.mcpro.2021.100143)\n\nA full reference list for every tool in the pipeline lives in [`CITATIONS.md`](CITATIONS.md).\n\nThis pipeline is built on top of [Nextflow](https://www.nextflow.io) and the\n[nf-core](https://nf-co.re) framework:\n\n\u003e **The nf-core framework for community-curated bioinformatics pipelines.**\n\u003e Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU,\n\u003e Di Tommaso P, Nahnsen S.\n\u003e _Nat Biotechnol._ 2020 Mar;38(3):276-278.\n\u003e doi: [10.1038/s41587-020-0439-x](https://doi.org/10.1038/s41587-020-0439-x).\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsanjaysgk%2Fipg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsanjaysgk%2Fipg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsanjaysgk%2Fipg/lists"}