{"id":47193418,"url":"https://github.com/ylab-hi/atroplex","last_synced_at":"2026-05-26T00:02:12.275Z","repository":{"id":339238167,"uuid":"1114895742","full_name":"ylab-hi/atroplex","owner":"ylab-hi","description":"pan-transcriptome splicing isoform analysis","archived":false,"fork":false,"pushed_at":"2026-05-19T16:38:55.000Z","size":767,"stargazers_count":1,"open_issues_count":6,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-19T19:52:39.951Z","etag":null,"topics":["index","isoforms","pan-transcriptome","transcript"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ylab-hi.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-12T03:37:01.000Z","updated_at":"2026-05-19T16:38:48.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ylab-hi/atroplex","commit_stats":null,"previous_names":["ylab-hi/atroplex"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/ylab-hi/atroplex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ylab-hi%2Fatroplex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ylab-hi%2Fatroplex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ylab-hi%2Fatroplex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ylab-hi%2Fatroplex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ylab-hi","download_url":"https://codeload.github.com/ylab-hi/atroplex/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ylab-hi%2Fatroplex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33497930,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-25T14:31:05.219Z","status":"ssl_error","status_checked_at":"2026-05-25T14:31:02.878Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["index","isoforms","pan-transcriptome","transcript"],"created_at":"2026-03-13T11:33:19.337Z","updated_at":"2026-05-26T00:02:12.267Z","avatar_url":"https://github.com/ylab-hi.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Atroplex\n\nA pan-transcriptome indexing and analysis toolkit for long-read sequencing data. Atroplex builds a spatial index with graph structure from multiple annotation sources, discovers novel transcripts from BAM input, classifies transcripts against the index, exports per-sample GTF files from the index, and provides detailed analysis of exon/segment sharing, splicing complexity, and isoform diversity.\n\n## Installation\n\n### Docker (recommended)\n\n```bash\ndocker build -t atroplex .\ndocker run --rm -v $(pwd):/data atroplex --help\n```\n\n### From source\n\nRequires CMake 3.14+, a C++20 compiler, and htslib installed via pkg-config.\n\n```bash\ncmake -B build -S .\ncmake --build build\n./build/atroplex --help\n```\n\n### Dependencies\n\n- **cxxopts** (v3.3.1): Command-line argument parsing (fetched automatically)\n- **genogrove** (v0.21.0): Genomic interval data structures and graph structure (fetched automatically)\n- **htslib**: Reading BAM/SAM files (system dependency via pkg-config)\n- **zlib**: Compression support\n\n## Quick Start\n\n```bash\n# Build index from a single annotation file\natroplex build -b annotation.gtf\n\n# Build pan-transcriptome from multiple sources via manifest\natroplex build -m manifest.tsv\n\n# Build with expression filter\n# (manifest's `expression_attribute` column declares which GFF attribute\n# to read on each sample — here: `counts` for TALON samples)\natroplex build -m manifest.tsv --min-counts 3\n\n# Build only specific chromosomes (for targeted analysis)\natroplex build -m manifest.tsv --chromosomes chr1,chr22,chrX\n\n# Run full per-sample inspection (sharing, splicing hubs, diversity)\natroplex inspect -m manifest.tsv -o results/\n\n# Classify transcripts against the index\natroplex query -i transcripts.gtf -m manifest.tsv -o results/\n\n# Classify with differential transcript usage\natroplex query -i transcripts.gtf -m manifest.tsv --contrast treated:control\n\n# Discover novel transcripts from long-read data\natroplex discover -i reads.bam -m manifest.tsv -o results/\n\n# Export per-sample GTF files from a built index\natroplex export -g index.ggx -o export/\n\n# Compact a built index by physically removing absorbed segments\natroplex compact -g index_dir/ -o compacted/\n```\n\n## Subcommands\n\n### `atroplex build` — Build pan-transcriptome index\n\nBuilds a genogrove index from one or more annotation files or BAM files. Each input file is treated as a separate entry in the pan-transcriptome, with features deduplicated across files and sample/source provenance tracked on every exon and segment.\n\n```bash\n# From manifest (full metadata per sample)\natroplex build -m ENCODE/manifest.tsv -o results/\n\n# From annotation files directly (metadata parsed from GFF headers)\natroplex build -b gencode.gtf -b sample1.gtf -b sample2.gtf\n\n# With expression filtering — thresholds are per-attribute and each sample\n# declares which attributes to read via the manifest's `expression_attribute`\n# column. A TALON manifest row with `expression_attribute = counts` is\n# filtered by --min-counts; a StringTie row with `expression_attribute = cov,TPM`\n# is filtered by both --min-cov and --min-TPM (AND semantics, missing\n# attributes on a given transcript are pass-through).\natroplex build -m manifest.tsv --min-counts 3\natroplex build -m manifest.tsv --min-cov 1 --min-TPM 0.5\n\n# Build only specific chromosomes\natroplex build -m manifest.tsv --chromosomes chr1,chr22\n\n# Drop sample transcripts at novel loci (keep only annotated gene regions)\natroplex build -m manifest.tsv --annotated-loci-only\n\n# Disable ISM absorption\natroplex build -m manifest.tsv --no-absorb\n\n# Physical tombstone removal (slower, produces smaller .ggx for distribution)\natroplex build -m manifest.tsv --prune-tombstones\n```\n\nBuild always produces a serialized index (`.ggx`) and a build summary (`.ggx.summary`).\n\n### `atroplex inspect` — Full pan-transcriptome inspection\n\nPerforms detailed per-sample inspection: overview / per-source / biotype\nbreakdowns, exon \u0026 segment sharing, conserved-exon detail, and splicing hubs\n(per-sample PSI + entropy + branch fan-out). Splicing events (cassette / alt-5′ / alt-3′ / IR / alt-terminal / mutually-exclusive) are opt-in via `--events`.\n\n```bash\n# Full inspection from manifest\natroplex inspect -m ENCODE/manifest.tsv -o results/\n\n# Full inspection from annotation files\natroplex inspect -b gencode.gtf -b sample1.gtf -o results/\n\n# Filter to segments in \u003e= 5 samples (annotations always kept)\natroplex inspect -m manifest.tsv -o results/ --min-samples 5\n\n# Relax the conservation threshold: conserved == present in \u003e= 95% of samples\natroplex inspect -m manifest.tsv -o results/ --conserved-fraction 0.95\n\n# Narrow the hub catalog (only exons with \u003e= 20 distinct downstream targets)\natroplex inspect -m manifest.tsv -o results/ --min-hub-branches 20\n\n# Include splicing event catalog (cassette, alt-5'/3', IR, etc.)\natroplex inspect -m manifest.tsv -o results/ --events\n```\n\nInspect-specific options: `--min-samples` (skip segments in \u003c N samples, annotations always kept), `--conserved-fraction \u003c(0,1]\u003e` (fraction of sample-typed entries a feature must appear in to be classified as conserved; default `1.0` = strict \"in every sample\"; relax for a dropout-tolerant conserved core), `--min-hub-branches \u003cN\u003e` (minimum unique downstream targets for an exon to register as a splicing hub; must be `\u003e= 2`; default `10`; raise to narrow the catalog when hubs are abundant, lower to surface less-branched events), `--events` (write per-gene splicing event catalog, off by default)\n\n### `atroplex query` — Classify transcripts against the index\n\nClassifies input transcripts (GTF/GFF) against the pan-transcriptome index using SQANTI-like structural categories (FSM, ISM, NIC, NNC, etc.). Optionally performs differential transcript usage (DTU) analysis between sample groups.\n\n```bash\n# Classify transcripts\natroplex query -i transcripts.gtf -m manifest.tsv -o results/\n\n# With differential transcript usage between groups (chi-squared + BH-FDR)\n# group_a and group_b must be values of the manifest's `group` column\n# (or auto-inferred from the `_repNN` suffix on sample IDs). DTU requires\n# a .qtx sidecar to have been written during build.\natroplex query -i transcripts.gtf -m manifest.tsv --contrast treated:control --fdr 0.05\n```\n\n### `atroplex discover` — Discover novel transcripts\n\nClusters aligned long reads by splice junction signature and matches them against the index to classify transcripts as known, compatible, or novel.\n\n```bash\natroplex discover -i reads.bam -m manifest.tsv -o results/\n```\n\n### `atroplex export` — Reconstruct per-sample GTFs from the index\n\nWalks a built index and writes one GTF file per sample with gene, transcript, and exon lines. Expression values (when available) are emitted as GTF attributes. Supports filters to restrict output by sample, gene, region, biotype, source, or sample frequency.\n\n```bash\n# Export all samples from a pre-built index (pass the directory containing the .ggx)\natroplex export -g index_dir/ -o export/\n\n# Export a specific sample, protein-coding genes only\natroplex export -g index_dir/ --sample HL60_M1_rep1 --biotype protein_coding\n\n# Export conserved features in a genomic region\natroplex export -g index_dir/ --region chr22:10000000-15000000 --conserved-only\n\n# Export the dropout-tolerant conserved core (segments in \u003e= 95% of samples)\natroplex export -g index_dir/ --conserved-only --conserved-fraction 0.95\n\n# Export features present in at least 3 samples, from HAVANA\natroplex export -g index_dir/ --min-samples 3 --source HAVANA\n```\n\nExport-specific options: `--sample`, `--gene`, `--region chr:start-end`, `--min-samples`, `--conserved-only`, `--conserved-fraction \u003c(0,1]\u003e` (sample-typed-entry fraction the segment must hit for `--conserved-only`; default `1.0` = strict; mirrors the inspect convention with annotations excluded from the denominator), `--biotype`, `--source` (all filters are AND'd).\n\n### `atroplex compact` — Compact a built index\n\nPhysically removes absorbed (tombstoned) segments from an existing `.ggx`, producing a smaller index without changing query semantics. Use this when an index was built without `--prune-tombstones` and you want to reclaim space later. The companion `.qtx` is already remapped against live segments at build time, so it is copied through unchanged alongside `.ggx.summary`.\n\n```bash\n# Compact an existing index into a separate output directory\natroplex compact -g index_dir/ -o compacted/\n\n# Compact a structure-only index (no .qtx alongside the .ggx)\natroplex compact -g index_dir/ -o compacted/ --no-qtx\n```\n\nBy default, `compact` fails fast if no `.qtx` is found alongside the input `.ggx` — pass `--no-qtx` to opt out. It also refuses to write to the input directory, so a partial write can never overwrite the original.\n\nCompact-specific options: `--no-qtx`.\n\n### Common Options\n\n| Option | Description |\n|--------|-------------|\n| `-o, --output-dir` | Output directory (default: input file directory) |\n| `-p, --prefix` | Output file prefix (default: derived from manifest or first input) |\n| `-t, --threads` | Number of threads (default: 1) |\n| `--progress` | Show progress output |\n| `-m, --manifest` | Sample manifest file (TSV) |\n| `-b, --build-from` | Build from GFF/GTF file(s) |\n| `-g, --genogrove` | Directory containing a pre-built genogrove index (`.ggx` + optional `.qtx` sidecar) |\n| `-k, --order` | Genogrove tree order (default: 3) |\n| `--min-counts` | Minimum `counts` value for transcripts whose sample declares `counts` in its manifest `expression_attribute` column (default: -1, disabled) |\n| `--min-TPM` | Minimum `TPM` value for transcripts whose sample declares `TPM` (default: -1, disabled) |\n| `--min-FPKM` | Minimum `FPKM` value for transcripts whose sample declares `FPKM` (default: -1, disabled) |\n| `--min-RPKM` | Minimum `RPKM` value for transcripts whose sample declares `RPKM` (default: -1, disabled) |\n| `--min-cov` | Minimum `cov` value for transcripts whose sample declares `cov` (default: -1, disabled) |\n| `--no-absorb` | Disable ISM segment absorption into longer parent segments |\n| `--fuzzy-tolerance` | Max bp difference for fuzzy exon boundary matching (default: 5) |\n| `--prune-tombstones` | Physically remove absorbed segments from the grove post-build (slower, smaller .ggx) |\n| `--include-scaffolds` | Keep transcripts on unplaced scaffolds, alt contigs, fix patches, and decoy sequences. Default: off — GFF/BAM ingest is filtered to canonical main chromosomes only (`chr1..chr22`, `chrX`, `chrY`, `chrM`). Enable for non-human/non-mouse species or when you specifically need scaffold contributions. |\n| `--chromosomes` | Restrict index to specific chromosomes (comma-separated, e.g. `chr1,chr22,chrX`). Accepts both prefixed and bare names. Default: all chromosomes. |\n| `--annotated-loci-only` | Only keep sample transcripts that overlap an annotation segment. Novel intergenic loci are discarded; novel isoforms at annotated loci inherit the annotation gene identity. |\n\n## Input Files\n\n### Sample Manifest (TSV)\n\nA tab-separated file specifying input files with metadata. This is the recommended way to build a pan-transcriptome with full sample tracking.\n\n```tsv\nfile\tid\ttype\tassay\tbiosample\tcondition\tspecies\tplatform\tpipeline\texpression_attribute\tdescription\ngencode.v49.gtf\tGENCODE_v49\tannotation\t.\t.\t.\tHomo sapiens\t.\tGENCODE\t.\tReference annotation\nsample1.gtf\tHL60_M1_rep1\tsample\tRNA-seq\tHL-60\tM1 macrophage\tHomo sapiens\tPacBio Sequel II\tTALON\tcounts\tHL-60 M1 replicate 1\nsample2.gtf\tHL60_M1_rep2\tsample\tRNA-seq\tHL-60\tM1 macrophage\tHomo sapiens\tPacBio Sequel II\tTALON\tcounts\tHL-60 M1 replicate 2\nstrtie1.gtf\tSTRTIE_01\tsample\tRNA-seq\tbrain\thealthy\tHomo sapiens\tIllumina NovaSeq\tStringTie\tcov,TPM\tStringTie sample with both cov and TPM\n```\n\n- Tab-separated, `\".\"` for empty values (VCF convention)\n- `file` column is required; all others optional\n- `type`: `\"annotation\"` for reference annotations, `\"sample\"` for experimental data (default: `\"sample\"`)\n- Column names are case-insensitive\n- Relative paths resolved from manifest directory\n- `id` auto-generated from filename if not provided\n- Optional `group` column for replicate grouping; if absent, groups are auto-inferred by stripping `_repNN` suffix from sample IDs\n- Optional `expression_attribute` column declares which GFF attributes carry quantitative expression for this sample: `counts`, `TPM`, `FPKM`, `RPKM`, `cov`, or a comma-separated list like `cov,TPM` for StringTie samples that carry multiple quantifications. Empty or `.` means no expression filtering for the sample. The FIRST declared attribute is what gets stored per-feature and appears in per-sample output column headers (e.g., `SAMPLE1.counts`, `STRTIE_01.cov`)\n\nThe `type` field matters: only entries with `type = \"sample\"` count toward \"conserved\" thresholds. By default a feature is conserved when it is present in every sample-typed entry; relax with `--conserved-fraction` (e.g. `0.95` for a dropout-tolerant core). Reference annotations participate in structural analysis but don't inflate sample counts.\n\n### GFF/GTF Files\n\nWhen using `--build-from` without a manifest, metadata is parsed from GFF/GTF headers:\n\n```\n##id: GENCODE_v49_GRCh38\n##description: Evidence-based annotation of the human genome\n##species: Homo sapiens\n##annotation_source: GENCODE\n##annotation_version: v49\n##type: annotation\n```\n\n#### Required Attributes\n\nEach exon feature must have `gene_id` and `transcript_id` in column 9.\n\n#### Expression Attributes\n\nEach sample declares which GFF attribute(s) carry quantitative expression via the manifest's `expression_attribute` column. Supported values: `counts`, `TPM`, `FPKM`, `RPKM`, `cov`. Multiple may be listed comma-separated (e.g., `cov,TPM` for StringTie samples). Empty or `.` disables expression filtering and storage for that sample.\n\n- The **first** declared attribute is what gets stored per-feature via `expression_store` and appears as the column header suffix in per-sample outputs (e.g., `ENCSR_01.counts`, `STRTIE_01.cov`).\n- All declared attributes are **evaluated against their matching CLI threshold** (`--min-counts`, `--min-TPM`, etc.) with **AND semantics** — a transcript is kept only if every (declared attribute that has an active threshold) meets it. Missing attributes on a given transcript are pass-through.\n- Samples that declare no `expression_attribute` (or use `.`) are never filtered on expression — same pass-through semantics as annotations.\n- A CLI threshold that no manifest sample declares emits a warning at build time so you know the filter had no effect.\n\nExample: a mixed ENCODE+StringTie build where TALON samples are filtered on raw read counts and StringTie samples are filtered on StringTie coverage, while GENCODE passes through unfiltered:\n\n```tsv\nfile             id          type        expression_attribute\ngencode.gtf      GENCODE     annotation  .\nencsr_01.gtf     ENCSR_01    sample      counts\nencsr_02.gtf     ENCSR_02    sample      counts\nstrtie_01.gtf    STRTIE_01   sample      cov,TPM\n```\n\n```bash\natroplex build -m manifest.tsv --min-counts 3 --min-cov 1 --min-TPM 0.5\n```\n\nUse `scripts/inject_expression.py` to add TALON quantification counts into GTF files before building.\n\n### BAM/SAM Files\n\nBAM files can be used as input to `build` (via manifest or `--build-from`). Reads are clustered by splice junction signature, and clusters are converted to exon chains and segments using the same absorption rules as GFF input. Read counts serve as expression values.\n\n#### Chromosome Name Normalization\n\nAtroplex normalizes chromosome names to UCSC/GENCODE style automatically:\n\n| Input | Normalized |\n|-------|------------|\n| `1`, `2`, ... `22` | `chr1`, `chr2`, ... `chr22` |\n| `X`, `Y` | `chrX`, `chrY` |\n| `MT` | `chrM` |\n| `chr1` (already prefixed) | `chr1` (unchanged) |\n\nThis allows mixing annotations from different sources (e.g., GENCODE + Ensembl).\n\n## ISM Absorption\n\nDuring index construction, Incomplete Splice Match (ISM) transcripts — truncated versions of full-length transcripts — are absorbed into their parent segments rather than creating separate entries. This reduces noise from degradation artifacts and technical truncation in long-read data.\n\nAbsorption rules (in execution order):\n\n| Rule | Pattern | Action |\n|------|---------|--------|\n| 0 | Identical exon structure (FSM) | Merge metadata |\n| 5 | Same intron chain, TSS/TES within 50bp | Absorb |\n| 6 | Mono-exon overlapping gene, no intron crossing | Drop |\n| 7 | Mono-exon spanning exon-intron-exon (intron retention) | Keep |\n| 8 | Mono-exon intergenic | Drop |\n| 1 | 5' ISM (contiguous subset at 5' end) | Keep |\n| 2 | 3' ISM (1-2 missing exons from 5' end) | Absorb |\n| 3 | 3' degradation (3+ missing from 5' end) | Drop vs ref, Keep vs sample |\n| 4 | Internal fragment (both ends missing) | Drop vs ref, Keep vs sample |\n\nAfter creating a new segment, reverse absorption applies the same rules to existing shorter segments. Matching uses pointer identity first, then fuzzy coordinate matching within `--fuzzy-tolerance` bp. Absorption can be disabled with `--no-absorb`.\n\n## Output Files\n\n### Build output (all subcommands that build a grove)\n\n| File | Description |\n|------|-------------|\n| `{prefix}.ggx` | Serialized grove index |\n| `{prefix}.ggx.summary` | Build summary (genes, transcripts, segments, exons, biotypes, per-chromosome) |\n\n### `atroplex inspect` output\n\nOutputs land directly under `{output-dir}/`, grouped by category\nsubfolders (no wrapper folder — matches the convention of the other\nsubcommands):\n\n```\n{output-dir}/\n  overview/\n    {basename}.overview.tsv          Global counts (genes, transcripts, segments, exons)\n    {basename}.per_sample.tsv        Per-sample metrics (one row per sample)\n    {basename}.per_source.tsv        Per-GFF-source metrics (HAVANA / ENSEMBL / TALON / ...)\n    {basename}.biotype.tsv           Gene + transcript biotype breakdown (long-form)\n  sharing/\n    {basename}.exon_sharing.tsv      Exon sharing summary (metrics × samples)\n    {basename}.segment_sharing.tsv   Segment sharing summary (metrics × samples)\n    {basename}.conserved_exons.tsv   Per-exon detail for exons meeting the conservation threshold (`--conserved-fraction`, default = all samples)\n  splicing_hubs/\n    {basename}.splicing_hubs.tsv     Hub exons (≥10 downstream branches) with per-sample PSI + entropy\n    {basename}.branch_details.tsv    Per-(hub × target) branch fraction + expression\n  splicing_events/                   (only with --events)\n    {basename}.splicing_events.tsv   Classified events: cassette / alt-5′ / alt-3′ / IR / alt-terminal / mutex\n```\n\n#### Exon Sharing Summary (`.exon_sharing.tsv`)\n\nMetrics as rows, samples as columns, plus a `total` column:\n\n| Metric | Description |\n|--------|-------------|\n| `total` | Total exons in this sample |\n| `exclusive` | Exons only in this sample |\n| `shared` | Exons in 2+ but not all samples |\n| `conserved` | Exons meeting the conservation threshold (`--conserved-fraction`, default = all samples) |\n| `constitutive` | Exons in all transcripts of their gene |\n| `alternative` | Exons in only some transcripts of their gene |\n\n#### Segment Sharing Summary (`.segment_sharing.tsv`)\n\nSame format: `total`, `exclusive`, `shared`, `conserved`.\n\n#### Conserved Exons Detail (`.conserved_exons.tsv`)\n\nOne row per exon present in **all samples**. Per-sample columns include transcript counts and expression values (expression columns only for sample-type entries).\n\n#### Splicing Hubs (`.splicing_hubs.tsv`)\n\nExons with at least 10 unique downstream exon targets, indicating complex alternative splicing decision points. Per-sample columns include branch counts, shared/unique classification, transcript counts, Shannon entropy, PSI, and expression.\n\n#### Branch Details (`.branch_details.tsv`)\n\nOne row per (hub exon, downstream target) pair with per-sample branch usage fractions and expression.\n\n### `atroplex query` output\n\n| File | Description |\n|------|-------------|\n| `{basename}.query.tsv` | Per-transcript classification with per-sample presence/expression |\n| `{basename}.query.summary.txt` | Classification summary (counts per category) |\n| `{basename}.{group_a}_vs_{group_b}.dtu.tsv` | DTU results (when `--contrast` is provided) |\n\nOnly matched transcripts are emitted in `query.tsv` — unmatched transcripts (intergenic, antisense) are counted in the summary but omitted from the TSV since they carry no per-sample or structural information.\n\n#### Classification columns (`query.tsv`)\n\n| Column | Description |\n|--------|-------------|\n| `transcript_id` | Query transcript identifier (from input GTF `transcript_id` attribute) |\n| `gene_id` | Gene ID of the best-matching segment in the index (may be an annotated gene like ENSG… or a tool-generated ID like MSTRG.* for novel loci) |\n| `gene_name` | Gene name of the best-matching segment (`.` when no gene name is available) |\n| `structural_category` | SQANTI-like classification: FSM, ISM, NIC, NNC, genic_intron, genic_genomic |\n| `subcategory` | Refinement: ISM → 5prime_fragment / 3prime_fragment / internal_fragment; NIC → combination / intron_retention / exon_skipping / alternative_3end / alternative_5end; NNC → novel_donor / novel_acceptor / novel_both / novel_exon |\n| `junction_match_score` | Fraction of query splice junctions matching the best reference segment (0.0–1.0) |\n| `matching_junctions` | Number of query junctions that match the reference |\n| `query_junctions` | Total splice junctions in the query transcript |\n| `ref_junctions` | Total splice junctions in the best-matching reference segment |\n| `known_donors` | Query donor sites (exon 3' ends) matching any known donor in the index |\n| `known_acceptors` | Query acceptor sites (exon 5' starts) matching any known acceptor in the index |\n| `novel_donors` | Query donor sites not found in the index |\n| `novel_acceptors` | Query acceptor sites not found in the index |\n| `n_samples` | Number of samples in the index that contain the matched segment |\n| `{sample_id}.present` | Per-sample presence (1/0) — one column per entry in the index |\n| `{sample_id}.{expr_type}` | Per-sample expression at the matched segment (only when `.qtx` sidecar is available; `.` = no data) |\n\n#### DTU columns (`*.dtu.tsv`)\n\n| Column | Description |\n|--------|-------------|\n| `gene_id` | Gene identifier |\n| `gene_name` | Gene name |\n| `transcript_id` | Transcript being tested |\n| `prop_{group_a}` | Proportion of gene expression from this transcript in group A |\n| `prop_{group_b}` | Proportion of gene expression from this transcript in group B |\n| `delta_proportion` | Difference in proportions (group A − group B) |\n| `p_value` | Chi-squared test p-value (Wilson-Hilferty approximation) |\n| `fdr` | Benjamini-Hochberg adjusted p-value |\n| `significant` | `yes` / `no` based on `--fdr` threshold |\n\n### `atroplex discover` output\n\n| File | Description |\n|------|-------------|\n| `{input}.atroplex.tsv` | Per-cluster match results (SQANTI-like) |\n| `{input}.atroplex.summary.txt` | Match statistics |\n\n### `atroplex export` output\n\nOne GTF file per exported sample in the output directory:\n\n| File | Description |\n|------|-------------|\n| `{sample_id}.gtf` | Reconstructed GTF with gene/transcript/exon lines, expression as attributes |\n\n## Key Concepts\n\n### Pan-Transcriptome Index\n\nAtroplex builds a combined index from multiple annotation sources. Each input file (reference annotation or sample assembly) is registered with metadata, and every feature (exon, segment) tracks which samples and sources it came from. Features at identical coordinates are deduplicated:\n\n- **Exons**: Deduplicated by genomic coordinates (chromosome + strand + start + end)\n- **Segments**: Deduplicated by exon structure (ordered list of exon coordinates within a transcript)\n\n### Two-Level Feature System\n\n1. **Segments** are spatially indexed transcript paths used for coarse queries. A segment represents one or more transcripts that share the same exon structure. Segments participate in interval tree queries.\n\n2. **Exons** are graph-only external keys used for fine-grained verification. They are linked into chains via edges and are not spatially indexed.\n\nThe graph connects segments to their first exon (SEGMENT_TO_EXON edge), and exons to subsequent exons (EXON_TO_EXON edges). Each edge carries a numeric segment index for ID-based traversal at branching exons.\n\n### Structural Categories (SQANTI-like)\n\nThe `query` and `discover` subcommands classify transcripts into structural categories:\n\n| Category | Description |\n|----------|-------------|\n| FSM | Full Splice Match — all junctions match a reference transcript |\n| ISM | Incomplete Splice Match — subset of reference junctions |\n| NIC | Novel In Catalog — novel combination of known splice sites |\n| NNC | Novel Not in Catalog — at least one novel splice site |\n| GENIC_INTRON | Mono-exon entirely within an intron |\n| GENIC_GENOMIC | Overlaps both intron and exon regions |\n| ANTISENSE | Overlaps gene on opposite strand |\n| INTERGENIC | No gene overlap |\n\n### Sample Types\n\nEntries are classified as `\"annotation\"` (reference catalogs like GENCODE) or `\"sample\"` (experimental assemblies). This distinction affects:\n\n- **Conserved/sharing statistics**: Only sample-type entries count toward \"all samples\" thresholds\n- **Expression output**: Expression columns only appear for sample-type entries in TSV output files\n- **Absorption rules**: Rules 3 and 4 treat annotation parents differently from sample parents\n\n## Scripts\n\n### inject_expression.py\n\nInjects expression counts from TALON quantification TSV into GTF files:\n\n```bash\npython scripts/inject_expression.py \\\n  input.gtf counts.tsv output.expr.gtf \\\n  --header id=SAMPLE_001 \\\n  --header species=\"Homo sapiens\"\n```\n\nMatches on `talon_transcript` attribute in GTF to `transcript_ID` in TSV. Adds `counts \"N\"` attribute to transcript lines.\n\n### visualize_stats.R\n\nGenerates multi-panel visualization from analysis output:\n\n```bash\nRscript scripts/visualize_stats.R \u003cstats_dir\u003e [output_prefix]\n```\n\nRequires: ggplot2, tidyr, dplyr, scales, patchwork. Produces PDF and PNG.\n\n## License\n\nGPLv3. See [LICENSE](LICENSE) for details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fylab-hi%2Fatroplex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fylab-hi%2Fatroplex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fylab-hi%2Fatroplex/lists"}