https://github.com/brentp/spacepile

convert reads from repeated measures of same piece of DNA into spaced matricies for deep learners.
https://github.com/brentp/spacepile

deep-learning genomics sequence-alignment

Last synced: 4 months ago
JSON representation

convert reads from repeated measures of same piece of DNA into spaced matricies for deep learners.

Host: GitHub
URL: https://github.com/brentp/spacepile
Owner: brentp
License: mit
Created: 2022-12-16T18:19:37.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-04-21T18:44:21.000Z (about 2 years ago)
Last Synced: 2025-02-02T04:41:12.761Z (5 months ago)
Topics: deep-learning, genomics, sequence-alignment
Language: Rust
Homepage:
Size: 31.3 KB
Stars: 14
Watchers: 3
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # spacepile (WIP)

# proposed usage

```Python

import spacepile

# spacing will be filled by `space` call.

spacing = np.zeros((len(cigs), max_width), dtype=np.uint32)

spacepile.space(spacing, cigs, posns)

# now with label (truth)

spacing = np.zeros((len(cigs) + 1, max_width), dtype=np.uint32)

spacepile.space(spacing, cigs, posns, label_cigs, label_posn)

# how to return mapping from spacing column to label? after removing label_only insertions?

```

`spacepile` efficiently creates (multi) alignments from replicate measurements of the same piece of DNA as

from **duplex** sequencing, single molecule consensus, pac-bio **CCS**, or nanopore [**cyclomics**](https://cyclomics.com/).

`spacepile` accepts alignments (cigars) of grouped sequences and _spaces_ them into matricies that can then be sent to any deep-learner.

Note that this is done nicely in [deepconsensus](https://github.com/google/deepconsensus) which is a major source

of inspiration for this work, but that is specific to PacBio and (likely) less efficient at extraction.

`spacepile` **will** also provide means to facilitate training (agnostic to the learner) by providing tools to

sample and augment reads and to separate training from test datasets.

Currently `spacepile` can be used by downloading and `pip install`ing the python wheel from the releases.

```Python

import spacepile

print(spacepile)

import numpy as np

import pysam

# make matricies of at most this width. usually this will be ~ 10% longer than read-length

max_width = 9

# NOTE, these would likely be drawn from a bam file that is sorted by fragment (an option in samtools sort) or grouped by UMI

# these are available as aln.cigartuples in pysam.

# here we show 3 reads for an example.

cigs = [

        [(pysam.CMATCH, 4)],                                     # read 1. 4M     ACTG

        [(pysam.CMATCH, 2), (pysam.CDEL, 2), (pysam.CMATCH, 1)], # read 2. 2M2D1M  CTC

        [(pysam.CMATCH, 2), (pysam.CINS, 3), (pysam.CMATCH, 2)], # read 3. 2M3I2M ACGGGTG

        ]

posns = [0, 1, 0] # the 2nd read starts 1 base after the other reads.

# idxs will be filled by `space` call.

idxs = np.zeros((len(cigs), max_width), dtype=np.uint32)

spacepile.space(idxs, cigs, posns)

#  now we have the indexes into the original sequence or base-qualities or IPDs. etc.

SPACE = np.iinfo(np.uint32).max - 1

END = np.iinfo(np.uint32).max

assert np.array_equal(idxs,

                [[  0,    1,SPACE,SPACE,SPACE,    2,    3,END  ,END],

                 [END,    0,SPACE,SPACE,SPACE,    1,SPACE,SPACE,  2],

                 [  0,    1,    2,    3,    4,    5,    6,END  ,END]])

# sequences likely retrieved from pysam's aln.query_sequence or aln.query_alignment_sequence

raw_sequences = ["ACTG", "CTC", "ACGGGTG"]

base_qs =       [[ 40, 50, 60, 70], [60, 60, 60], [45, 45, 45, 55, 55, 55, 60]]

sequences = np.zeros((len(cigs), max_width), dtype=np.int16)

bqs = np.zeros_like(sequences)

for i, s in enumerate(raw_sequences):

    sequences[i, :len(s)] = np.array(list(s), dtype='U1').view(np.int32) # Q: is there a better way to do this in numpy?

    bqs[i, :len(s)] = np.array(base_qs[i], dtype=np.int32)

# NOTE we will put sequences and bqs in the same output matrix.

mat = np.zeros((2 * len(cigs), max_width), dtype=np.int16)

spacepile.translate(idxs, sequences, mat[:len(cigs)])

# NOTE how the columns generally match

assert np.array_equal(mat[:len(cigs)],

                   [[65,67,-2,-2,-2,84,71,-1,-1],

                    [-1,67,-2,-2,-2,84,-2,-2,67],

                    [65,67,71,71,71,84,71,-1,-1]])

spacepile.translate(idxs, bqs, mat[len(cigs):])

assert np.array_equal(mat[len(cigs):],

			[[40,50,-2,-2,-2,60,70,-1,-1],

			 [-1,60,-2,-2,-2,60,-2,-2,60],

			 [45,45,45,55,55,55,60,-1,-1]])

# now repeat for many sequences and send many mats to a learner.

```

# Research

## steps

### training

#### read selection

Need enough unique reads from across genome to train. Given duplex-seq is often deep,

we want to sample wisely so we're not training on the same position 10,000 times.

#### calibration

[Paper covering this topic](https://arxiv.org/abs/1706.04599)

https://scikit-learn.org/stable/modules/calibration.html

https://arxiv.org/abs/2210.00045 ( from google 9/30/2022)

NOTE: this one seems best. Just get pytorch model into sklearn-like class.

https://github.com/dirichletcal/dirichlet_python

See this python module: https://github.com/classifier-calibration/PyCalib

The output from the model is not calibrated -- the error rates do not match the probabilities.

This is a known problem.

There is some work on this here:

https://github.com/gpleiss/temperature_scaling (needs logits, not softmax)

https://github.com/dirichletcal/experiments_dnn

But this may require further work because of the positional info where the ends of reads

often have lower quality.

### spacing

This implements the multi-alignment spacing which takes a set of reads and returns

something that is sent to the network. Currently this is implemented in python.

A rust version is in progress.

This involves clustering (by tag) and selection to max_depth so we can have a fixed

network size.

### consensus

Take a trained model and a bam and write a new fastq with well-calibrated base-qualities.

## build notes

```

docker run --rm -v $(pwd):/io ghcr.io/pyo3/maturin build --release

pip install target/wheels/spacepile-0.1.0-cp37-abi3-linux_x86_64.whl

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/brentp/spacepile

Awesome Lists containing this project

README