An open API service indexing awesome lists of open source software.

https://github.com/hrbigelow/depngs

Next-gen Sequencing analysis tools
https://github.com/hrbigelow/depngs

bam-files bioinformatics mutation-analysis ngs

Last synced: 2 months ago
JSON representation

Next-gen Sequencing analysis tools

Awesome Lists containing this project

README

          

# dep - Diversity Estimation Probabilistically

A tool to estimate nucleotide base composition in non-clonal samples from next-gen
sequence alignment

## Synopsis

dep dist [options] samples.rdb sample_pairings.rdb ref.fasta
dep pileup [options] samples.rdb ref.fasta out.pileup

# full usage
man -l dep.1
man -l pileup_depth_stats.1
man -l pileup_to_bindepth.1

# Introduction

Dep is a tool that estimates nucleotide base composition of non-clonal samples or
mutational distance between two non-clonal samples. For such samples, using a
traditional diploid genotype caller violates its assumption of a locus being one of
the 10 possible discrete diploid genotypes AA, CC, GG, TT, AC, AG, AT, CG, CT, GT.
Making this false assumption is worse if trying to detect changes in the genotype
between pairs of samples (such as timepoints of the same patient). This is because
one of the non-clonal samples may close to a threshold of 25\% X + 75\% Y, which is
forced to be genotyped as XY or YY, neither of which are very close to the actual
nucleotide composition. And, if a pair of samples fall on either side of this
threshold, there may be no significant change in the base composition, yet
individually genotyping them will flag a change. Even though such a situation may be
rare in absolute terms, it can be common among the pool of flagged changes.

Typically, the problem setting is as follows. The replicating virus or cancer genome
can mutate at each division, with the result that a population is non-clonal. Each
position in the genome potentially has a different selective or mutational pressure.
Taking the population as a whole, we want to measure the fraction of the population
having A, C, G, and T at each position in the genome. Knowing this allows biologists
to measure the effect of a drug on the mutational profile, or to spot locations that
correlate with survival or death.

While cancer cells divide and mutate, the immune system recognizes some of them
as bad, and is able to kill them off. Other cells acquire mutations that allow
them to evade attack for awhile until the immune system adapts. A Darwinian
natural selection dynamic takes hold. In between adaptations, a mostly
decimated population of cancer cells can regrow from a very small subpopulation
('clonal expansion').

Think of the game like this: The immune system is AlphaGo playing a billion Go
games simultaneously, each against a different amateur. At each time step, the
amateurs each make a random move. AlphaGo responds and the process repeats.
Also, at each time step, some fraction of games end, with the amateurs losing.
When these games end, there is room for any remaining amateurs to clone
themselves with the same board state until there are a billion games in play.

So, at any given time, in a very rare cases, the amateur might make a really
good move, and since he can clone his board state, the overall setup provides a
brute-force opportunity for the amateurs to win.

Breaking from the metaphor now, the goal is to spot which positions in the
genome have undergone clonal expansion, and when. This is done by comparing
the measured base composition of different cancer tissue samples at successive
timepoints. And to get an accurate picture, one must be able to detect rare
subpopulations.

So the problem is this: Now imagine I give you an urn with a billion balls,
some unknown fraction of red, blue, green, and yellow. You can only take out
5000 balls, and you actually can't observe the color of each ball directly.
You have a machine that can measure the color, but it is not 100% accurate. In
fact, it outputs a color and a confidence score (based on its own internal
metrics) telling you how likely it is to be correct. For the sake of argument,
let's say we can take this confidence score as accurate.

So you get your 5000 measurements of (color, confidence score), and now you
want to estimate the fraction of different colors in the full urn. The model I chose
is a two-stage model:

# Implementation

dep is a multi-threaded filter which consumes one or more .bam files and outputs one
or more result files. It is a unix-like filter in that it outputs an analysis for
each genomic position in the order given in the inputs. The analysis can be any of:

- base composition estimation with confidence intervals
- mutational distance estimation with confidence intervals
- indel composition
- indel mutational distance estimation
- pileup

At the top level of execution it is a parallel for-loop with fixed memory buffer
specified by the user. writing the output in the same order as the input, under the
race condition of different worker finishing times, presents a technical challenge.
This is solved using a linked list of output result buffers. The linked-list
structure preserves the order. Each buffer can be loaded by a worker thread, or
unloaded by a single writer thread. To prevent worker thread starvation, there are
more buffers than threads.