{"id":13710149,"url":"https://github.com/epi2me-labs/modbam2bed","last_synced_at":"2026-01-29T08:44:52.489Z","repository":{"id":37531680,"uuid":"387611258","full_name":"epi2me-labs/modbam2bed","owner":"epi2me-labs","description":null,"archived":false,"fork":false,"pushed_at":"2024-08-23T12:18:01.000Z","size":5082,"stargazers_count":47,"open_issues_count":0,"forks_count":8,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-10T21:49:04.844Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epi2me-labs.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-19T22:42:35.000Z","updated_at":"2025-03-21T13:51:09.000Z","dependencies_parsed_at":"2024-08-23T13:42:58.507Z","dependency_job_id":"35ab2bd0-0934-43b2-8ae0-13cfbc1d941c","html_url":"https://github.com/epi2me-labs/modbam2bed","commit_stats":{"total_commits":82,"total_committers":2,"mean_commits":41.0,"dds":"0.012195121951219523","last_synced_commit":"2ec4bed04fdb8f32d2e2164e5c4bd57ea6d72844"},"previous_names":[],"tags_count":32,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epi2me-labs%2Fmodbam2bed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epi2me-labs%2Fmodbam2bed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epi2me-labs%2Fmodbam2bed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epi2me-labs%2Fmodbam2bed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epi2me-labs","download_url":"https://codeload.github.com/epi2me-labs/modbam2bed/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248305850,"owners_count":21081562,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T23:00:52.409Z","updated_at":"2026-01-29T08:44:47.450Z","avatar_url":"https://github.com/epi2me-labs.png","language":"C","funding_links":[],"categories":["Software packages"],"sub_categories":["DNA modification analysis"],"readme":"![Oxford Nanopore Technologies logo](https://github.com/epi2me-labs/modbam2bed/raw/master/images/ONT_logo_590x106.png)\n\n\nWe have a new bioinformatic resource that replaces the functionality of this project! See our new repository here: \n[modkit](https://github.com/nanoporetech/modkit/).\n\nThis repository is now unsupported and we do not recommend its use. Please contact Oxford Nanopore: support@nanoporetech.com for help with your application if it is not possible to upgrade.\n\n\n******************\n\n\nModified-base BAM to bedMethyl\n------------------------------\n\nA program to aggregate modified base counts stored in a\n[modified-base BAM](https://samtools.github.io/hts-specs/SAMtags.pdf) (Section 2.1) file to \na [bedMethyl](https://www.encodeproject.org/data-standards/wgbs/) file.\n\nA Python module is also available to obtain modified base information\nfrom BAM files in a convenient form. It is envisaged that this will eventually\nbe replaced by an implementation in [pysam](https://pysam.readthedocs.io/en/latest/index.html).\n\n### Installation\n\nThe program is available from our conda channel, so can be installed with:\n\n    mamba create -n modbam2bed -c bioconda -c conda-forge -c epi2melabs modbam2bed\n\nPackages are available for both Linux and MacOS.\n\nAlternatively to install from the source code, clone the repository and then use make:\n\n    git clone --recursive https://github.com/epi2me-labs/modbam2bed.git\n    make modbam2bed\n    ./modbam2bed\n\nSee the Makefile for more information. The code has been tested on MacOS (with\ndependencies from brew) and on Ubuntu 18.04 and 20.04.\n\n### Usage\n\nThe code requires aligned reads with the `Mm` and `Ml` tags (`MM` and `ML` also supported),\nand the reference sequence used for alignment.\n\nThe below is a snapshot of the command-line interface; it may not be up-to-date, please\nrefer to the program `--help` option for the most accurate guidance.\n\n```\nUsage: modbam2bed [OPTION...] \u003creference.fasta\u003e \u003creads.bam\u003e [\u003creads.bam\u003e ...]\nmodbam2bed -- summarise one or more BAM with modified base tags to bedMethyl. \n\n General options:\n      --aggregate            Output additional aggregated (across strand)\n                             counts, requires --cpg or --chg.\n      --combine              Create output with combined modified counts: i.e.\n                             alternative modified bases within the same family\n                             (same canonical base) are included.\n  -c, --pileup               Output (full) raw base counts rather than BED\n                             file.\n  -e, --extended             Output extended bedMethyl including counts of\n                             canonical, modified, and filtered bases (in that\n                             order).\n  -m, --mod_base=BASE        Modified base of interest, one of: 5mC, 5hmC, 5fC,\n                             5caC, 5hmU, 5fU, 5caU, 6mA, 5oxoG, Xao. (Or modA,\n                             modC, modG, modT, modU, modN for generic modified\n                             base).\n  -p, --prefix=PREFIX        Output file prefix. Only used when multiple output\n                             filters are given.\n  -r, --region=chr:start-end Genomic region to process.\n  -t, --threads=THREADS      Number of threads for BAM processing.\n\n Base filtering options:\n  -a, --canon_threshold=THRESHOLD\n                             Deprecated. The option will be removed in a future\n                             version. Please use --threshold.\n  -b, --mod_threshold=THRESHOLD   Deprecated. The option will be removed in a\n                             future version. Please use --threshold.\n      --chg                  Output records filtered to CHG sites.\n      --chh                  Output records filtered to CHH sites.\n      --cpg                  Output records filtered to CpG sites.\n  -f, --threshold=THRESHOLD  Bases with a call probability \u003c THRESHOLD are\n                             filtered from results (default 0.66).\n  -k, --mask                 Respect soft-masking in reference file.\n\n Read filtering options:\n  -d, --max_depth=DEPTH      Max. per-file depth; avoids excessive memory\n                             usage.\n  -g, --read_group=RG        Only process reads from given read group.\n      --haplotype=VAL        Only process reads from a given haplotype.\n                             Equivalent to --tag_name HP --tag_value VAL.\n      --tag_name=TN          Only process reads with a given tag (see\n                             --tag_value).\n      --tag_value=VAL        Only process reads with a given tag value.\n\n  -?, --help                 Give this help list\n      --usage                Give a short usage message\n  -V, --version              Print program version\n\nMandatory or optional arguments to long options are also mandatory or optional\nfor any corresponding short options.\n```\n\n### Method and output format\n\nOxford Nanopore Technogies' sequencing chemistries and basecallers can detect\nany number of modified bases. Compared to traditional methods which force a\nfalse dichoctomy between say cytosine and 5-methylcytosine, this rich biology\nneeds to be remembered when interpreting modified base calls.\n\nThe htslib pileup API is used to create a matrix of per-strand base counts\nincluding substitutions, modified bases and deletions. Inserted bases are not\ncounted. Bases of an abiguous nature (refered to as \"filtered\" below), as\ndefined by the filter threshold probabilities option `-b` are masked and used\n(along with substitutions and deletions) in the definition of the \"score\"\n(column 5) and \"coverage\" (column 10) entries of the bedMethyl file.\n\nIn the case of `?`-style `MM` subtags, where a lack of a recorded call should\nnot be taken as implying a canonical-base call, the \"no call\" count is incremented.\nThe \"no call\" count is used in the calculation of \"coverage\" and also the denominator\nof \"score\".\n\nIn summary, a base is determined as being either \"canonical\", \"modified\", \"filtered\",\nor \"no call\". The final output includes a modification frequency and score and\ncoverage information in order to assess the reliability of the frequency.\n\n**Call filtering**\n\nTo determine the base present at a locus in a read, the query base in the\nBAM record is examined along with the modified base information. A \"canonical\"\nbase probability is calculated as `1 - sum(P_mod)`, with `P_mod` being\nthe set of probabilities associated with all the modifications enumerated\nin the BAM record. The base form with largest probability is taken as the\nbase present subject to the user-specified threshold. If the probability\nis below the threshold the call is masked and contributes to the \"filtered\"\nbase count rather than the \"canonical\" or \"modified\" counts.\n\n**Special Handling of alternative modified bases (`--combine` option)**\n\nTo intepret the case of multiple modifications being listed in\nthe BAM, `modbam2bed` can operate in two modes:\n\n* *default*: alternative modified bases in the same family as the requested\n  modification are counted separatedly as \"other\" --- neither in\n  the \"canonical\" count of the \"modified\" count.\n* `--combine`: alternative modified bases are lumped together into the \n  \"modified\" count and ultimately into a single modification frequency.\n\n***A particular case where `--combine` is useful is when comparing to the result of bisulfite sequencing.***\n\n**Output format**\n\n\u003e The description of the [bedMethyl](https://www.encodeproject.org/data-standards/wgbs/)\n\u003e format on the ENCODE project website is rather loose. The definitions below are chosen pragmatically.\n\nThe table below describes precisely the entries in each column of the output BED\nfile. Columns seven to nine inclusive are included for compatibility with the BED\nfile specification, the values written are fixed and no meaning should be derived\nfrom them. Columns 5, 10, and 11 are defined in terms of counts of observed\nbases to agree with reasonable interpretations of the bedMethyl specifications:\n\n * N\u003csub\u003ecanon\u003c/sub\u003e - canonical (unmodified) base count, (contigent on the use of `--combine`, see above.)\n * N\u003csub\u003emod\u003c/sub\u003e - modified base count.\n * N\u003csub\u003efilt\u003c/sub\u003e - count of bases where read does not contain a substitution or deletion\n   with respect to the reference, but the modification status is ambiguous: these bases\n   were filtered from the calculation of the modification frequency.\n * N\u003csub\u003esub\u003c/sub\u003e - count of reads with a substitution with respect to the reference.\n * N\u003csub\u003edel\u003c/sub\u003e - count of reads with a deletion with respect to the reference.\n * N\u003csub\u003eno call\u003c/sub\u003e - counts of reads with an absent modification call (but not a substitution or deletion).\n * N\u003csub\u003ealt mod\u003c/sub\u003e - counts of reads with and alternative modification call (but not a substitution or deletion).\n\nSince these interpretations may differ from other tools an extended output is\navailable (enabled with the `-e` option) which includes three additional columns\nwith verbatim base counts.\n\n| column | description                                                                                                                                                                                                                                                                  |\n|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| 1      | reference sequence name                                                                                                                                                                                                                                                      |\n| 2      | 0-based start position                                                                                                                                                                                                                                                       |\n| 3      | 0-based exclusive end position (invariably start + 1)                                                                                                                                                                                                                        |\n| 4      | Abbreviated name of modified-base examined                                                                                                                                                                                                                                   |\n| 5      | \"Score\" 1000 * (N\u003csub\u003emod\u003c/sub\u003e + N\u003csub\u003ecanon\u003c/sub\u003e) / (N\u003csub\u003emod\u003c/sub\u003e + N\u003csub\u003ecanon\u003c/sub\u003e + N\u003csub\u003eno call\u003c/sub\u003e + N\u003csub\u003ealt mod\u003c/sub\u003e + N\u003csub\u003efilt\u003c/sub\u003e + N\u003csub\u003esub\u003c/sub\u003e + N\u003csub\u003edel\u003c/sub\u003e). The quantity reflects the extent to which the calculated modification frequency in Column 11 is confounded by the alternative calls. The denominator here is the total read coverage as given in Column 10. |\n| 6      | Strand (of reference sequence). Forward \"+\", or reverse \"-\".                                                                                                                                                                                                                 |\n| 7-9    | Ignore, included simply for compatibility.                                                                                                                                                                                                                                   |\n| 10     | Read coverage at reference position including all canonical, modified, undecided (no calls and filtered), substitutions from reference, and deletions.  N\u003csub\u003emod\u003c/sub\u003e + N\u003csub\u003ecanon\u003c/sub\u003e + N\u003csub\u003eno call\u003c/sub\u003e + N\u003csub\u003ealt mod\u003c/sub\u003e + N\u003csub\u003efilt\u003c/sub\u003e + N\u003csub\u003esub\u003c/sub\u003e + N\u003csub\u003edel\u003c/sub\u003e                                        |\n| 11     | Percentage of modified bases, as a proportion of canonical and modified (excluding no calls, filtered, substitutions, and deletions).  100 \\* N\u003csub\u003emod\u003c/sub\u003e / (N\u003csub\u003emod\u003c/sub\u003e  + N\u003csub\u003ealt mod\u003c/sub\u003e + N\u003csub\u003ecanon\u003c/sub\u003e)                                                                                       |\n| 12\\*    | N\u003csub\u003ecanon\u003c/sub\u003e                                                                                                                                                                                                                                                            |\n| 13\\*    | N\u003csub\u003emod\u003c/sub\u003e                                                                                                                                                                                                                                                         |\n| 14\\*    | N\u003csub\u003efilt\u003c/sub\u003e those bases with a modification probability falling between given thresholds.                                                                                                                                                                           |\n| 15\\*    | N\u003csub\u003eno call\u003c/sub\u003e those bases for which the query base was the correct canonical base for the modified base being considered, but no call was made (see the definition of the `.` and `?` flags in the SAM tag specification).                                                                                                                                                                           |\n| 16\\*    | N\u003csub\u003ealt mod\u003c/sub\u003e those bases for which the query base was the correct canonical base for the modified base being considered, but and alternative modification was present.                                                                                                                                                                           |\n\n\\* Included in extended output only.\n\n\n### Limitations\n\nThe code has not been developed extensively and currently has some limitations:\n\n * Support for motif filtering is limited to CpG, CHG, and CHH, sites. Without\n   this filtering enabled all reference positions that are the canonical base\n   (on forward or reverse strand) equivalent to the modified base under\n   consideration are reported.\n * Insertion columns are completely ignored for simplicitly (and avoid\n   any heuristics).\n * Second strand `MM` subtags (i.e. `MM:C-m` as compared with `MM:C+m`)\n   are not supported. These are not typically used so shouldn't affect most users.\n   If such a tag is detected and warning will be thrown and the tag ignored. These tags\n   do come in to play for duplex basecalls.\n\n### Python package\n\nA Python package is available on [PyPI](https://pypi.org/project/modbampy/) which\ncontains basic functionality for parsing BAM files with modified-base information.\nIt is envisaged that this will eventually be replaced by an implementation in\n[pysam](https://pysam.readthedocs.io/en/latest/index.html). As such the interface\nis supplements but does not integrate or replace pysam.\n\nThe package can be installed with:\n\n```\npip install modbampy\n```\n\nThe package contains simply to modes of use. Firstly an interface to iterate\nover reads in a BAM file and report modification sites:\n\n```\nfrom modbampy import ModBam\nwith ModBam(args.bam) as bam:\n    for read in bam.reads(args.chrom, args.start, args.end):\n        for pos_mod in read.mod_sites:\n            print(*pos_mod)\n```\n\nEach line of the above reports the\n\n* read_id,\n* reference position,\n* query (read) position,\n* reference strand (+ or -),\n* modification strand (0 or 1, as defined in the HTSlib tag specification. This is invariable 0),\n* canonical base associated with modification,\n* modified base,\n* modified-base score (scaled to 0-255).\n\nA second method is provided which mimics the couting procedure implemented in\n`modbam2bed`:\n\n```\nfrom modbampy import ModBam\nwith ModBam(args.bam) as bam:\n    positions, counts = bam.pileup(\n        args.chrom, args.start, args.end\n        low_threshold=0.33, high_threshold=0.66, mod_base=\"m\")\n```\n\nThe result is two [numpy](https://numpy.org/) arrays. The first indicates the reference\npositions associated with the counts in the second array. Each row of the second array\n(`counts` above) enumerates the observed counts of bases in the order:\n\n    a c g t A C G T d D m M f F n N\n\nwhere uppercase letters refer to bases on the forward strand, lowercase letters\nrelate to the reverse strand:\n\n* A, C, G, T are the usual DNA bases,\n* D indicates deletion counts,\n* M modified base counts,\n* F filtered counts - bases in reads with a modified-base record but which were filtered\n  according to the thresholds provided.\n* N no call base counts.\n\n**Extras**\n\nThe read iterator API also contains a minimal set of functionality mirroring properties of \nalignments available from pysam. See the [code](https://github.com/epi2me-labs/modbam2bed/blob/master/modbampy/__init__.py)\nfor further details.\n\n### Acknowledgements\n\nWe thank [jkbonfield](https://github.com/jkbonfield) for developing the modified base\nfunctionality into the htslib pileup API, and [Jared Simpson](https://github.com/jts)\nfor testing and comparison to his independently developed code.\n\n### Help\n\n**Licence and Copyright**\n\n© 2021- Oxford Nanopore Technologies Ltd.\n\n`modbam2bed` is distributed under the terms of the Mozilla Public License 2.0.\n\n**Research Release**\n\nResearch releases are provided as technology demonstrators to provide early\naccess to features or stimulate Community development of tools. Support for\nthis software will be minimal and is only provided directly by the developers.\nFeature requests, improvements, and discussions are welcome and can be\nimplemented by forking and pull requests. However much as we would\nlike to rectify every issue and piece of feedback users may have, the\ndevelopers may have limited resource for support of this software. Research\nreleases may be unstable and subject to rapid iteration by Oxford Nanopore\nTechnologies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepi2me-labs%2Fmodbam2bed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepi2me-labs%2Fmodbam2bed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepi2me-labs%2Fmodbam2bed/lists"}