{"id":26880172,"url":"https://github.com/sstadick/perbase","last_synced_at":"2025-05-16T06:07:44.699Z","repository":{"id":38296538,"uuid":"296654732","full_name":"sstadick/perbase","owner":"sstadick","description":"Per-base per-nucleotide depth analysis","archived":false,"fork":false,"pushed_at":"2025-01-16T19:49:06.000Z","size":288,"stargazers_count":127,"open_issues_count":20,"forks_count":16,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-05-10T10:48:22.111Z","etag":null,"topics":["bioinformatics","cli-app","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sstadick.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-09-18T15:07:05.000Z","updated_at":"2025-04-15T07:05:28.000Z","dependencies_parsed_at":"2025-04-11T13:47:17.826Z","dependency_job_id":null,"html_url":"https://github.com/sstadick/perbase","commit_stats":{"total_commits":150,"total_committers":5,"mean_commits":30.0,"dds":"0.033333333333333326","last_synced_commit":"abdfb7595ac8c42cc24d93b80175fb3e4585d7c6"},"previous_names":[],"tags_count":38,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sstadick%2Fperbase","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sstadick%2Fperbase/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sstadick%2Fperbase/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sstadick%2Fperbase/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sstadick","download_url":"https://codeload.github.com/sstadick/perbase/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254478193,"owners_count":22077676,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","cli-app","rust"],"created_at":"2025-03-31T13:34:59.623Z","updated_at":"2025-05-16T06:07:42.641Z","avatar_url":"https://github.com/sstadick.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg width=\"500\" height=\"250\" src=\"./perbase.png\"\u003e\n\u003c/p\u003e\n\n![Publish](https://github.com/sstadick/perbase/workflows/Publish/badge.svg)\n![Rust](https://github.com/sstadick/perbase/workflows/Rust/badge.svg)\n[![API docs](https://img.shields.io/badge/API-documentation-blue.svg)](https://docs.rs/perbase)\n[![Crates.io](https://img.shields.io/crates/v/perbase.svg)](https://crates.io/crates/perbase)\n[![Conda](https://anaconda.org/anaconda/anaconda/badges/installer/conda.svg)](https://anaconda.org/bioconda/perbase)\n\nA highly parallelized utility for analyzing metrics at a per-base level.\n\nIf a metric is missing, or performance is lacking. Please file a bug/feature ticket in issues.\n\n## Why?\n\nWhy `perbase` when so many other tools are out there? `perbase` leverages Rust's concurrency system to automagically parallelize over your input regions. This leads to orders of magnitude faster runtimes that scale with the compute resources that you have available. Additionally, `perbase` aims to be more accurate than other tools. E.g.: `perbase` counts DELs toward depth, `bam-readcount` does not, `perbase` does not count REF_SKIPs toward depth, `sambamba` does.\n\n## Installation\n\n```bash\nconda install -c bioconda perbase\n# OR\ncargo install perbase\n```\n\nYou can also download a binary from the [releases](https://github.com/sstadick/perbase/releases) page.\n\n## Tools\n\n### base-depth\n\nThe `base-depth` tool walks over every position in the BAM/CRAM file and calculates the depth, as well as the number of each nucleotide at the given position. Additionally, it counts the numbers of Ins/Dels at each position.\n\nThe output columns are as follows:\n\n| Column         | Description                                                                                        |\n| -------------- | -------------------------------------------------------------------------------------------------- |\n| REF            | The reference sequence name                                                                        |\n| POS            | The position on the reference sequence                                                             |\n| REF_BASE       | The reference base at the position, column excluded if no reference was supplied                   |\n| DEPTH          | The total depth at the position SUM(A, C, T, G, DEL)                                               |\n| A              | Total A nucleotides seen at this position                                                          |\n| C              | Total C nucleotides seen at this position                                                          |\n| G              | Total G nucleotides seen at this position                                                          |\n| T              | Total T nucleotides seen at this position                                                          |\n| N              | Total N nucleotides seen at this position                                                          |\n| INS            | Total insertions that start at the base to the right of this position                              |\n| DEL            | Total deletions covering this position                                                             |\n| REF_SKIP       | Total reference skip operations covering this position                                             |\n| FAIL           | Total reads failing filters that covered this position (their bases were not counted toward depth) |\n| NEAR_MAX_DEPTH | Flag to indicate if this position came within 1% of the max depth specified                        |\n\n```bash\nperbase base-depth ./test/test.bam\n```\n\nExample output\n\n```text\nREF     POS     REF_BASE        DEPTH   A       C       G       T       N       INS     DEL     REF_SKIP        FAIL    NEAR_MAX_DEPTH\nchr1    709636  T       16      0       0       0       16      0       0       0       0       0   false\nchr1    709637  T       16      0       4       0       12      0       0       0       0       0   false\nchr1    709638  A       16      16      0       0       0       0       0       0       0       0   false\nchr1    709639  G       16      0       0       16      0       0       0       0       0       0   false\nchr1    709640  A       16      16      0       0       0       0       0       0       0       0   false\nchr1    709641  A       16      16      0       0       0       0       0       0       0       0   false\nchr1    709642  G       16      0       0       16      0       0       0       0       0       0   false\nchr1    709643  G       16      0       0       16      0       0       0       0       0       0   false\nchr1    709644  T       16      0       0       0       16      0       0       0       0       0   false\nchr1    709645  G       16      0       0       16      0       0       0       0       0       0   false\n```\n\nIf the `--mate-fix` flag is passed, each position will first check if there are any mate overlaps and choose the mate with the hightest MAPQ, breaking ties by choosing the first mate that passes filters. Mates that are discarded are not counted toward `FAIL` or `DEPTH`.\n\nIf the `--reference-fasta` is supplied, the `REF_BASE` field will be filled in. The reference must be indexed an match the BAM/CRAM header of the input.\n\nThe output can be compressed and indexed as follows:\n\n```bash\nperbase base-depth -Z ./test/test.bam -o output.tsv.gz\ntabix -S 1 -s 1 -b 2 -e 2 ./output.tsv.gz\n# Query all positions overlapping region\ntabix output.tsv.gz chr1:5-10\n```\n\nUsage:\n\n```text\nCalculate the depth at each base, per-nucleotide\n\nUSAGE:\n    perbase base-depth [FLAGS] [OPTIONS] \u003creads\u003e\n\nFLAGS:\n    -Z, --bgzip                     \n            Optionally bgzip the output\n\n    -h, --help                      \n            Prints help information\n\n    -k, --keep-zeros                \n            Keep positions even if they have 0 depth\n\n    -m, --mate-fix                  \n            Fix overlapping mates counts, see docs for full details\n\n    -M, --skip-merging-intervals    \n            Skip mergeing togther regions specified in the optional BED or BCF/VCF files.\n            \n            **NOTE** If this is set it could result in duplicate output entries for regions that overlap. **NOTE** This\n            may cause issues with downstream tooling.\n    -V, --version                   \n            Prints version information\n\n    -z, --zero-base                 \n            Output positions as 0-based instead of 1-based\n\n\nOPTIONS:\n    -B, --bcf-file \u003cbcf-file\u003e\n            A BCF/VCF file containing positions of interest. If specified, only bases from the given positions will be\n            reported on\n    -b, --bed-file \u003cbed-file\u003e\n            A BED file containing regions of interest. If specified, only bases from the given regions will be reported\n            on\n    -C, --channel-size-modifier \u003cchannel-size-modifier\u003e\n            The fraction of a gigabyte to allocate per thread for message passing, can be greater than 1.0 [default:\n            0.15]\n    -c, --chunksize \u003cchunksize\u003e\n            The ideal number of basepairs each worker receives. Total bp in memory at one time is (threads - 2) *\n            chunksize [default: 1000000]\n    -L, --compression-level \u003ccompression-level\u003e\n            The level to use for compressing output (specified by --bgzip) [default: 2]\n\n    -T, --compression-threads \u003ccompression-threads\u003e\n            The number of threads to use for compressing output (specified by --bgzip) [default: 4]\n\n    -F, --exclude-flags \u003cexclude-flags\u003e                      \n            SAM flags to exclude, recommended 3848 [default: 0]\n\n    -f, --include-flags \u003cinclude-flags\u003e                      \n            SAM flags to include [default: 0]\n\n    -D, --max-depth \u003cmax-depth\u003e\n            Set the max depth for a pileup. If a positions depth is within 1% of max-depth the `NEAR_MAX_DEPTH` output\n            field will be set to true and that position should be viewed as suspect [default: 100000]\n    -Q, --min-base-quality-score \u003cmin-base-quality-score\u003e\n            Minium base quality for a base to be counted toward [A, C, T, G]. If the base is less than the specified\n            quality score it will instead be counted as an `N`. If nothing is set for this no cutoff will be applied\n    -q, --min-mapq \u003cmin-mapq\u003e                                \n            Minimum MAPQ for a read to count toward depth [default: 0]\n\n    -o, --output \u003coutput\u003e                                    \n            Output path, defaults to stdout\n\n        --ref-cache-size \u003cref-cache-size\u003e\n            Number of Reference Sequences to hold in memory at one time. Smaller will decrease mem usage [default: 10]\n\n    -r, --ref-fasta \u003cref-fasta\u003e                              \n            Indexed reference fasta, set if using CRAM\n\n    -t, --threads \u003cthreads\u003e                                  \n            The number of threads to use [default: 32]\n\n\nARGS:\n    \u003creads\u003e    \n            Input indexed BAM/CRAM to analyze\n```\n\n### only-depth\n\nThe `only-depth` tool walks over the input BAM/CRAM file and calculates the depth over all positions specified by either a BED file or in the BAM/CRAM header. Adjacent positions that have the same depth will be merged together to form a non-inclusive range (see example output).\n\nThere are two distinct modes that `only-depth` can run in, gated by the `--fast-mode` flag. When running in fast-mode, only depth over the area a read covers is only determined by the reads start and end postions, and no cigar related info is taken into account. `--mate-fix` may still be used in this mode, and areas where mates overlap will not be counted twice.\n\nWithout the `--fast-mode` flag, the depth at each position is determined in a manner similar to `base-depth` where `DEL` will count toward depth, but `REF_SKIP` will not. Additionally, any reads that fail the `--exclude-flags` will not be counted toward depth. Lastly, `--mate-fix` can be applied to avoid counting regions twice where mates may overlap.\n\nRegarding mate fixes, `perbase` will make \"fixes\" based only on the counted regions in a read. For example, if you have a read that goes from \"chr1:0-1000\" with a CIGAR of \"25M974N1M\", and the mate aligns nicely at \"chr1:45-70\" with CIGAR \"25M\", the mate will count toward the depth over \"chr1:45-74\". This is in contrast to other tools that will reject the mate even though it overlaps a region of R1 that is not counted toward depth.\n\nFor the fastest possible output, use `only-depth --fast-mode`.\n\n**Note** that it is possible that two adjacent positions may not merge if they fall at a `--chunksize` boundary. If this is an issue you can set the `--chunksize` to the size of the largest contig in question. At a future date this may be fixed or a post processing tool may be provided to fix it. For most use cases this should not be a problem. Additionally, you can pipe into `merge-adjacent` which will fix it as well. EX: `perbase only-depth -m file.bam | perbase merge-adjacent \u003e out.tsv`.\n\nExample output of `perbase only-depth --mate-fix --zero-base  ./test/test.bam`:\n\n```text\nREF     POS     END     DEPTH\nchr2    0       4       1\nchr2    4       9       2\nchr2    9       12      3\nchr2    12      14      2\nchr2    14      17      3\nchr2    17      19      4\nchr2    19      23      5\nchr2    23      34      4\nchr2    34      39      3\nchr2    39      49      1\nchr2    49      54      2\nchr2    54      64      3\nchr2    64      74      4\nchr2    74      79      3\nchr2    79      84      2\nchr2    84      89      1\n```\n\nIf a BED-like output is needed, `--bed-format -z` flags can be set, which will write a 0-based, no-header TSV output with an empty 4th column and the depth in the 5th column.\n\nUsage:\n\n```text\nCalculate the only the depth at each base\n\nUSAGE:\n    perbase only-depth [FLAGS] [OPTIONS] \u003creads\u003e\n\nFLAGS:\n        --bed-format                \n            Output BED-like output format with the depth in the 5th column. Note, `-z` can be used with this to change\n            coordinates to 0-based to be more BED-like\n    -Z, --bgzip                     \n            Optionally bgzip the output\n\n    -x, --fast-mode                 \n            Calculate depth based only on read starts/stops, see docs for full details\n\n    -h, --help                      \n            Prints help information\n\n    -k, --keep-zeros                \n            Keep positions even if they have 0 depth\n\n    -m, --mate-fix                  \n            Fix overlapping mates counts, see docs for full details\n\n    -n, --no-merge                  \n            Skip merging adjacent bases that have the same depth\n\n    -M, --skip-merging-intervals    \n            Skip mergeing togther regions specified in the optional BED or BCF/VCF files.\n            \n            **NOTE** If this is set it could result in duplicate output entries for regions that overlap. **NOTE** This\n            may cause issues with downstream tooling.\n    -V, --version                   \n            Prints version information\n\n    -z, --zero-base                 \n            Output positions as 0-based instead of 1-based\n\n\nOPTIONS:\n    -B, --bcf-file \u003cbcf-file\u003e\n            A BCF/VCF file containing positions of interest. If specified, only bases from the given positions will be\n            reported on. Note that it may be more efficient to calculate depth over regions if your positions are\n            clustered tightly together\n    -b, --bed-file \u003cbed-file\u003e\n            A BED file containing regions of interest. If specified, only bases from the given regions will be reported\n            on\n    -C, --channel-size-modifier \u003cchannel-size-modifier\u003e\n            The fraction of a gigabyte to allocate per thread for message passing, can be greater than 1.0 [default:\n            0.001]\n    -c, --chunksize \u003cchunksize\u003e\n            The ideal number of basepairs each worker receives. Total bp in memory at one time is (threads - 2) *\n            chunksize [default: 1000000]\n    -L, --compression-level \u003ccompression-level\u003e\n            The level to use for compressing output (specified by --bgzip) [default: 2]\n\n    -T, --compression-threads \u003ccompression-threads\u003e\n            The number of threads to use for compressing output (specified by --bgzip) [default: 4]\n\n    -F, --exclude-flags \u003cexclude-flags\u003e                    \n            SAM flags to exclude, recommended 3848 [default: 0]\n\n    -f, --include-flags \u003cinclude-flags\u003e                    \n            SAM flags to include [default: 0]\n\n    -q, --min-mapq \u003cmin-mapq\u003e                              \n            Minimum MAPQ for a read to count toward depth [default: 0]\n\n    -o, --output \u003coutput\u003e                                  \n            Output path, defaults to stdout\n\n    -r, --ref-fasta \u003cref-fasta\u003e                            \n            Indexed reference fasta, set if using CRAM\n\n    -t, --threads \u003cthreads\u003e                                \n            The number of threads to use [default: 32]\n\n\nARGS:\n    \u003creads\u003e    \n            Input indexed BAM/CRAM to analyze\n```\n\n## merge-adjacent\n\n`merge-adjacent` is a utility to merge overlapping regions in a BED-like file.\n\nIt will take a file with four columns and no header as long as the columns are like:\n\n```text\n\u003ccontig\u003e\\t\u003cstart\u003e\\t\u003cstop\u003e\\t\u003cdepth\u003e\\n\n```\n\nOr it can take files with three columns with headers that are like\n\n```text\n\u003cREF|chrom\u003e\\t\u003cPOS|chromStart\u003e\\t\u003cEND|chromEnd\u003e\\t\u003cDEPTH|COV\u003e\n```\n\nThe `END|chromEnd` column is optional.\n\n```text\nperbase-merge-adjacent 0.7.5-alpha.0\nSeth Stadick \u003csstadick@gmail.com\u003e\nMerge adjacent intervals that have the same depth. Input must be sorted like: `sort -k1,1 -k2,2n in.bed \u003e in.sorted.bed`\n\nGenerally accepts any file with no header tha is \u003cchrom\u003e\\t\u003cstart\u003e\\t\u003cstop\u003e\\t\u003cdepth\u003e. The \u003cstop\u003e is optional. See\ndocumentation for explaination of headers that are accepted.\n\nUSAGE:\n    perbase merge-adjacent [FLAGS] [OPTIONS] [in-file]\n\nFLAGS:\n    -Z, --bgzip        \n            Optionally bgzip the output\n\n    -h, --help         \n            Prints help information\n\n    -n, --no-header    \n            Indicate if the input file does not have a header\n\n    -V, --version      \n            Prints version information\n\n\nOPTIONS:\n    -T, --compression-level \u003ccompression-level\u003e\n            The level to use for compressing output (specified by --bgzip) [default: 2]\n\n    -T, --compression-threads \u003ccompression-threads\u003e\n            The number of threads to use for compressing output (specified by --bgzip) [default: 32]\n\n    -o, --output \u003coutput\u003e                              \n            The output location, defaults to STDOUT\n\n\nARGS:\n    \u003cin-file\u003e    \n            Input bed-like file, defaults to STDIN\n```\n\nEX:\n\n```bash\nperbase only-depth indexed.bam | perbase merge-adjacent \u003e out.tsv\n```\n\n## Similar Projects\n\n- [`sambamba depth`](https://github.com/biod/sambamba/wiki/%5Bsambamba-depth%5D-documentation)\n- [`samtools depth`](http://www.htslib.org/doc/samtools-depth.html)\n- [`mosdepth`](https://github.com/brentp/mosdepth)\n- [`bam-readcount`](https://github.com/genome/bam-readcount)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsstadick%2Fperbase","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsstadick%2Fperbase","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsstadick%2Fperbase/lists"}