{"id":25596889,"url":"https://github.com/nanoporetech/remora","last_synced_at":"2025-04-06T22:09:50.958Z","repository":{"id":42528221,"uuid":"430866539","full_name":"nanoporetech/remora","owner":"nanoporetech","description":"Methylation/modified base calling separated from basecalling.","archived":false,"fork":false,"pushed_at":"2024-09-17T16:25:22.000Z","size":47155,"stargazers_count":168,"open_issues_count":12,"forks_count":22,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-04-06T08:02:15.672Z","etag":null,"topics":["basecalling","methylation","nanopore"],"latest_commit_sha":null,"homepage":"https://nanoporetech.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nanoporetech.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-22T21:13:12.000Z","updated_at":"2025-04-02T01:37:22.000Z","dependencies_parsed_at":"2024-03-13T23:43:00.508Z","dependency_job_id":null,"html_url":"https://github.com/nanoporetech/remora","commit_stats":{"total_commits":190,"total_committers":6,"mean_commits":"31.666666666666668","dds":"0.38421052631578945","last_synced_commit":"0a96bfff7419514c3b9d01173b506900c14a3350"},"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nanoporetech%2Fremora","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nanoporetech%2Fremora/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nanoporetech%2Fremora/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nanoporetech%2Fremora/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nanoporetech","download_url":"https://codeload.github.com/nanoporetech/remora/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247451643,"owners_count":20940939,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["basecalling","methylation","nanopore"],"created_at":"2025-02-21T12:35:12.629Z","updated_at":"2025-04-06T22:09:50.880Z","avatar_url":"https://github.com/nanoporetech.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":".. image:: /ONT_logo.png\n  :width: 800\n  :alt: [Oxford Nanopore Technologies]\n  :target: https://nanoporetech.com/\n\nRemora\n\"\"\"\"\"\"\n\nRemora models predict methylation/modified base status separated from basecalling.\nThe Remora repository is focused on the preparation of modified base training data and training modified base models.\nSome functionality for running Remora models and investigation of raw signal is also provided.\nFor production modified base calling use `Dorado \u003chttps://github.com/nanoporetech/dorado\u003e`_.\nFor recommended modified base downstream processing use `modkit \u003chttps://github.com/nanoporetech/modkit\u003e`_.\nFor more advanced modified base data preparation from \"randomers\" see the `Betta release community note \u003chttps://community.nanoporetech.com/posts/betta-tool-release\u003e`_ and reach out to customer support to inquire about access (customer.support@nanoporetech.com).\n\nInstallation\n------------\n\nInstall from pypi:\n\n::\n\n   pip install ont-remora\n\nInstall from github source for development:\n\n::\n\n   git clone git@github.com:nanoporetech/remora.git\n   pip install -e remora/[tests]\n\nIt is recommended that Remora be installed in a virtual environment.\nFor example ``python3 -m venv venv; source venv/bin/activate``.\n\nFor GPU optimization using torch, ensure that a version of torch compatible with the system GPU/CUDA drivers is installed.\nNote that Remora does not attempt to resolve the correct version of torch.\nSee the `torch installation page \u003chttps://pytorch.org/get-started/locally/\u003e`_ for compatible drivers and installation instructions.\n\nAs an example to install Remora with CUDA 11.8 drivers the following command can be used:\n\n::\n\n   pip install torch --index-url https://download.pytorch.org/whl/cu118\n   pip install ont-remora\n\nSee help for any Remora sub-command with the ``-h`` flag.\n\nGetting Started\n---------------\n\nRemora models predict modified bases anchored to canonical basecalls or reference sequence from a nanopore read.\n\nThe Remora training/prediction input unit (referred to as a chunk) consists of:\n\n1. Section of normalized signal\n2. Canonical bases attributed to the section of signal\n3. Mapping between these two\n\nChunks have a fixed signal length defined at data preparation/model training time.\nThese values are saved with the Remora model to extract chunks in the same manner at inference.\nA fixed position within the chunk is defined as the \"focus position\" around which the fixed signal chunk is extracted.\nBy default, this position is the center of the \"focus base\" being interrogated by the model.\n\nThe canonical bases and mapping to signal (a.k.a. \"move table\") are combined for input into the neural network in several steps.\nFirst each base is expanded to the k-mer surrounding that base (as defined by the ``--kmer-context-bases`` hyper-parameter).\nThen each k-mer is expanded according to the move table.\nFinally each k-mer is one-hot encoded for input into the neural network.\nThis procedure is depicted in the figure below.\n\n.. image:: images/neural_network_sequence_encoding.png\n   :width: 600\n   :alt: Neural network sequence encoding\n\nData Preparation\n----------------\n\nRemora data preparation begins from a POD5 file containing signal data and a BAM file containing basecalls from the POD5 file.\nNote that the BAM file must contain the move table (``--emit-moves`` in Dorado) and the MD tag (default in Dorado with mapping and ``--MD`` argument for minimap2).\nIf using minimap2 for alignment use ``samtools fastq -T \"*\" [in.bam] | minimap2 -y -ax lr:hq [ref.fa] - | samtools view -b -o [out.bam]`` in order to transfer the move table tags through the alignment step since minimap2 does not support SAM/BAM input.\n\nThe following example generates training data from canonical (PCR) and modified (M.SssI treatment) samples in the same fashion as the released 5mC CG-context models.\nExample reads can be found in the Remora repository (see ``test/data/`` directory).\n\nK-mer tables for applicable conditions can be found in the `kmer_models repository \u003chttps://github.com/nanoporetech/kmer_models\u003e`_.\n\n.. code-block:: bash\n\n  remora \\\n    dataset prepare \\\n    can_reads.pod5 \\\n    can_mappings.bam \\\n    --output-path can_chunks \\\n    --refine-kmer-level-table levels.txt \\\n    --refine-rough-rescale \\\n    --motif CG 0 \\\n    --mod-base-control\n  remora \\\n    dataset prepare \\\n    mod_reads.pod5 \\\n    mod_mappings.bam \\\n    --output-path mod_chunks \\\n    --refine-kmer-level-table levels.txt \\\n    --refine-rough-rescale \\\n    --motif CG 0 \\\n    --mod-base m 5mC\n\nThe above commands each produce a core Remora dataset stored in the directory defined by ``--output-path``.\nCore datasets contain memory mapped numpy files for each core array (chunk data) and a JSON format metadata config file.\nThese memory mapped files allow efficient access to very large datasets.\n\nBefore Remora, 3.0 datasets were stored as numpy array dictionaries.\nUpdating datasets can be accomplished with the ``scripts/update_dataset.py`` script included in the repository.\n\nComposing Datasets\n******************\n\nCore datasets (or other composed datasets) can be composed to produce a new dataset.\nThe ``remora dataset make_config`` command creates these config files specifying the composition of the new dataset.\nWhen reading batches from these combined datasets, the default behavior will be to draw chunks randomly from the entire set of chunks.\nThis setting is useful for multiple flowcells of the same condition.\n\nThe ``--dataset-weights`` argument produces a config which generates batches with a fixed proportion of chunks from each input dataset.\nThis setting is useful when combining different data types, for example control and modified datasets.\n\nThe ``remora dataset merge`` command is supplied to merge datasets, copying the data into a new core Remora dataset.\nThis may increase efficiency of data access for datasets composed of many core datasets, but only supports the default behavior from the ``make_config`` command (sampling over all chunks).\n\nThe ``remora dataset copy`` command is provided in order to move datasets to a new location.\nThis can be useful when handling config datasets composed of many core datasets.\nCopying a dataset is especially useful to achieve higher training speeds when core datasets are stored on a network file system (NFS).\n\nComposed dataset config files can also be specified manually.\nConfig files are JSON format files containing a single list.\nEach element in the list represents one dataset.\nDatasets must specify the path to the core dataset (or another config) and the weight of this sub-dataset.\nOptionally the dataset hash and a filter file may be included for each dataest.\nThese values may be specified via a list (ordered: path, weight, hash, filter), or via a dictionary (keys: path, weight, hash, filter).\nThe ``make_config`` output config file will contain the dataset hash to ensure the contents of a dataset are unchanged whe reading.\nSee the ``remora dataset make_filter`` command for more details on filters.\n\nMetadata attributes from each core dataset are checked for compatibility and merged where applicable.\nChunk raw data are loaded from each core dataset at specified proportions to construct batches at loading time.\nIn a break from Remora \u003c3.0, datasets allow \"infinite iteration\", where each core dataset is drawn from indefinitely and independently to supply training chunks.\nFor validation from a fixed set of chunks, finite iteration is also supported.\n\nTo generate a dataset config from the datasets created above one can use the following command.\n\n.. code-block:: bash\n\n  remora \\\n    dataset make_config \\\n    train_dataset.jsn \\\n    can_chunks \\\n    mod_chunks \\\n    --dataset-weights 1 1 \\\n    --log-filename train_dataset.log\n\nModel Training\n--------------\n\nModels are trained with the ``remora model train`` command.\nFor example a model can be trained with the following command.\n\n.. code-block:: bash\n\n  remora \\\n    model train \\\n    train_dataset.jsn \\\n    --model remora/models/ConvLSTM_w_ref.py \\\n    --device 0 \\\n    --chunk-context 50 50 \\\n    --output-path train_results\n\nThis command will produce a \"best\" model in torchscript format for use in Bonito, ``remora infer``, or ``remora validate`` commands.\nModels can be exported for use in Dorado with the ``remora model export train_results/model_best.pt train_results_dorado_model`` command.\n\nModel Inference\n---------------\n\nFor testing purposes, inference within Remora is provided.\nFor standard model architectures and inference methods, using the exported Dorado model during basecalling is recommended.\n\n.. code-block:: bash\n\n  remora \\\n    infer from_pod5_and_bam \\\n    can_signal.pod5 \\\n    can_mappings.bam \\\n    --model train_results/model_best.pt \\\n    --out-file can_infer.bam \\\n    --log-filename can_infer.log \\\n    --device 0\n  remora \\\n    infer from_pod5_and_bam \\\n    mod_signal.pod5 \\\n    mod_mappings.bam \\\n    --model train_results/model_best.pt \\\n    --out-file mod_infer.bam \\\n    --log-filename mod_infer.log \\\n    --device 0\n\nThe ``remora validate from_modbams`` command is deprecated and will be removed in a future version of Remora.\nThe ``modkit validate`` command is now recommended for this purpose.\n\nReference-anchored Inference\n****************************\n\nReference-anchored inference allows users to make per-read per-site modified base calls against the reference sequence to which a read is mapped.\nThis is in contrast to standard Remora model inference where calls are made against the basecalls.\nThis mode can be useful to explore modified bases around which the canonical basecaller does not perform well.\nThis inference mode is toggled by the ``--reference-anchored`` argument to the ``remora infer from_pod5_and_bam`` command.\n\nThe output BAM file from this command will take each mapped read and replace the basecalls with the mapped reference bases.\nThe move table will be transferred to the mapped reference bases and interpolated over mapping reference deletions in order to make enable extraction of Remora chunks for inference.\n\nNote that this means that the canonical basecalls will show 0 errors over the entire output BAM file.\nThe intended purpose of this output is only to store the modified base status for each read at each applicable base.\nAny analysis of basecall metrics should not use the output of this command.\n\nPre-trained Models\n------------------\n\nSee the selection of current released models with ``remora model list_pretrained``.\nPre-trained models are stored remotely and can be downloaded using the ``remora model download`` command or will be downloaded on demand when needed.\n\nModels may be run from `Bonito \u003chttps://github.com/nanoporetech/bonito\u003e`_.\nSee Bonito documentation to apply Remora models.\n\nMore advanced research models may be supplied via `Rerio \u003chttps://github.com/nanoporetech/rerio\u003e`_.\nThese files require download from Rerio and then the path to this download must be provided to Remora.\nNote that older ONNX format models require Remora version \u003c 2.0.\n\nDownloaded or trained models can be inspected with the ``remora model inspect`` command to view the metadata attributes of the model.\n\nPython API and Raw Signal Analysis\n----------------------------------\n\nRaw signal plotting is available via the ``remora analyze plot ref_region`` command.\n\nThe ``plot ref_region`` command is useful for gaining intuition into signal attributes and visualize signal shifts around modified bases.\nAs an example using the test data, the following command produces the plots below.\nNote that only a single POD5 file per sample is allowed as input and that the BAM records must contain the ``mv`` and ``MD`` tags (see the see \"Data Preparation\" section above for details).\n\n.. code-block:: bash\n\n  remora \\\n    analyze plot ref_region \\\n    --pod5-and-bam can_reads.pod5 can_mappings.bam \\\n    --pod5-and-bam mod_reads.pod5 mod_mappings.bam \\\n    --ref-regions ref_regions.bed \\\n    --highlight-ranges mod_gt.bed \\\n    --refine-kmer-level-table levels.txt \\\n    --refine-rough-rescale \\\n    --log-filename log.txt\n\n.. image:: images/plot_ref_region_fwd.png\n   :width: 600\n   :alt: Plot reference region image (forward strand)\n\n.. image:: images/plot_ref_region_rev.png\n   :width: 600\n   :alt: Plot reference region image (reverse strand)\n\nThe Remora API to access, manipulate and visualize nanopore reads including signal, basecalls, and reference mapping is described in more detail in the ``notebooks`` section of this repository.\n\nTerms and Licence\n-----------------\n\nThis is a research release provided under the terms of the Oxford Nanopore Technologies' Public Licence.\nResearch releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools.\nSupport for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests.\nMuch as we would like to rectify every issue, the developers may have limited resource for support of this software.\nResearch releases may be unstable and subject to rapid change by Oxford Nanopore Technologies.\n\n© 2021-2024 Oxford Nanopore Technologies Ltd.\nRemora is distributed under the terms of the Oxford Nanopore Technologies' Public Licence.\n\nResearch Release\n----------------\n\nResearch releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnanoporetech%2Fremora","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnanoporetech%2Fremora","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnanoporetech%2Fremora/lists"}