Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mrc-ide/tapestry
https://github.com/mrc-ide/tapestry
Last synced: 10 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/mrc-ide/tapestry
- Owner: mrc-ide
- License: mit
- Created: 2022-08-24T17:43:24.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-12-15T17:20:14.000Z (about 1 year ago)
- Last Synced: 2024-12-21T03:52:00.249Z (14 days ago)
- Language: C++
- Size: 5.91 MB
- Stars: 1
- Watchers: 6
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![build](https://github.com/mrc-ide/Tapestry/actions/workflows/build.yml/badge.svg)
Malaria infections often contain multiple genotypes, and when sequenced together
these produce a complex signal that is a mixture of the individual genotypes.
Building on the framework of earlier programs like DEploid and DEploidIBD,
Tapestry attempts to pull these individual genotypes apart by exploiting allele
frequency imbalances within a sample, while simultaneously estimating segments
of identity by descent (IBD) between sequences. Unlike previous programs,
Tapestry uses advanced MCMC methods to ensure that results are robust even for
high complexity of infection (COI).## Table of Contents
- [Install](#install)
- [Quick usage](#quick_usage)
- [Helper scripts](#helper_scripts)
- [To do](#to_do)
- [Model](#model)## Install
### Build from source**Step 1:** Install [HTSlib](https://github.com/samtools/htslib) dependency.
e.g.
```
git clone https://github.com/samtools/htslib
cd htslib
autoreconf -i # Build the configure script and install files it uses
./configure # Optional but recommended, for choosing extra functionality
make
make install
```**Step 2:** Clone Tapestry.
```
git clone https://github.com/mrc-ide/Tapestry.git
cd Tapestry
```
**Step 3:** Compile Tapestry using [CMake](https://cmake.org/).For a (slower) debugging version:
```
cd build
cmake ..
make
```For a (faster) release version:
```
mkdir release
cd release
cmake .. -DCMAKE_BUILD_TYPE="Release"
make
```**Optional**: By default the tests are not compiled. To compile them, open [CMakeLists.txt](https://github.com/mrc-ide/Tapestry/blob/feature/likelihood/CMakeLists.txt) in a text editor, and change line 16:
```
set(COMPILE_TESTS OFF)
```...to...
```
set(COMPILE_TESTS ON)
```Repeat Step 3 and then run tests with the executable `./test_tapestry`. The testing framework is [GoogleTest](https://github.com/google/googletest).
## Quick usage
The executable `tapestry` will be in your `/build` or `/release`.```
$ ./release/tapestry infer --help
Run inference from an filtered VCF.
Usage: ./release/tapestry infer [OPTIONS]Options:
-h,--help Print this help message and exitInput and output:
-i,--input_vcf TEXT:FILE REQUIRED
Path to input VCF file.
-s,--target_sample TEXT REQUIRED
Target sample in VCF.
-o,--output_dir TEXT Output directory.Model Hyperparameters:
-K,--COI INT:INT in [1 - 6] Complexity of infection.
-e,--error_ref FLOAT:INT in [0 - 1]
Probability of REF->ALT error.
-E,--error_alt FLOAT:INT in [0 - 1]
Probability of ALT->REF error.
-v,--var_wsaf FLOAT:POSITIVE
Controls dispersion in WSAF. Larger is less dispersed.
-r,--recomb_rate FLOAT:POSITIVE
Recombination rate in kbp/cM.
-b,--n_wsaf_bins INT:INT in [100 - 10000]
Number of WSAF bins in Betabin lookup table.MCMC Parameters:
-w,--w_proposal FLOAT:POSITIVE
Controls variance in proportion proposals.
```## unravel: A python package for plotting Tapestry outputs
This repository includes a small python package under `/python` for plotting the outputs from Tapestry. Install it like so:```
cd python
conda update conda
conda env create -f environment.yml
conda activate unravel
pip install -e .
```You should now have access to `unravel` on the commad line:
```
$ unravel sample --help
Usage: unravel sample [OPTIONS]Plot Tapestry outputs for an individual sample
Options:
-i, --input_dir PATH Directory containing Tapestry outputs, for an
individual sample. [required]
--help Show this message and exit.
```## To do
### Major
- Copying of Particle inside of ProposalEngine could become slow when large number of parameters. Can we use pointers?
- Some objects (e.g. Model) are could get very large. We should consider allocation on heap to avoid Stack overflow.
- ProposalEngine does not make any allowance for asymmetric proposal distributions### Minor
You can see all minor TODO's with:```
grep -nC5 "TODO" src/*```
## Model
We imagine a scenario where our sequencing data is generated by a fixed number of genetically distinct strains, which may or may not share regions of identity by descent. Each strain is imagined to comprise a fixed proportion of the infection. The fraction of reads that are derived from each strain is influenced by this proportion.
### Likelihood
The likelihood is formulated as a Hidden Markov Model:
$$
P(\vec{X}, \vec{S} | \Theta) = P(S_1) P(X_1 | S_1) \prod_{i=1}^{L} P(S_i|S_{i-1}) P(X_i | S_i)
$$As such, we can compute it by defining initiation, transition, and emission probabilities.
#### Emission probabilities, $P(X_i|S_i)$
Given a set of proportions and haplotype states, we can compute the proportion of the sample comprised of the alternative allele (or rather the expected WSAF without accounting for error), as:
$$
q_i = \sum_{j=1}^{K} w_jh_{ij}
$$We then adjust for sequencing error by assuming two fixed error rates, $e_0$ and $e_1$:
$$ \pi_i = q_i(1-e_1) + (1 - q_i)e_0$$
If we sequenced every parasite genome in the host, our resultant observed WSAF, $x_i$, would be exactly $\pi$. However, in practice we generate a finite number of reads sampled from the infection through a random process. The sampling variation is modelled using a [Beta-binomial distribution](https://en.wikipedia.org/wiki/Beta-binomial_distribution):
$$ X_i \sim BetaBin(a_i + r_i, \alpha, \beta) $$
We re-parameterise the distribution to have better control over its mean and variance:
$$\alpha=v\pi$$
$$\beta=v(1-\pi)$$
With this parameterisation, the expected error-adjusted WSAF is:
$$ E[X_i|v, \pi_i] = \frac{\alpha}{\alpha+\beta} = \pi_i $$
as we sought. Additionally we have:
$$Var(X_i|v, \pi) \propto \frac{1}{v}$$
which gives us good control over the variance.
If reads were sampled independently, randomly, and with replacement from the underlying strain proportions, we would expect the observed WSAF to be binomially distributed. However, we would like to allow additional dispersion across SNPs. For example, genomic context can influence sequencing performance, and that context will be different for each SNP; SNPs with identical haplotype configurations should have more than just binomial variance in their observed WSAF. The $v$ term in the beta-binomial allows us to capture this additional dispersion.
Rather than explicitly inferring haplotypes at each site, we marginalise over all possible haplotype configurations by making our final emission probability a finite mixture of beta-binomial distributions. Each haplotype configuration, $\vec{h}$, corresponds to a particular subset of the $K$ strains carrying the alternative allele. Note that $|\vec{h}| = 2^K$, and we index these configurations with a superscript $b$. For example, if $K=2$, the $\vec{h}$ would be $\Set{0,0}, \Set{0, 1}, \Set{1, 0}, \Set{1, 1}$. In essence, each $b$ generates a mode in the multi-modal WSAF distribution.
Our emission probability at each site becomes:
$$ P(X_i|S_i, w_j, p_i, v, e_0, e_1) = \sum_{b=1}^{2^{K}} P(a_i, r_i | \vec{h^{b}}, S_i, w_j, v, e_0, e_1) P(\vec{h^{b}}|S_i, p)$$
The IBD configuration $S_i$ limits which haplotype configurations are possible, by restricting strains in IBD to have the same haplotype state. In addition, we assume each group of strains in IBD is sampled independently and randomly. Defining the number of IBD groups carrying the alternative allele as $C_{h=1|S_i}$. Then,
$$ P(\vec{h^{b}}|S_i, p)=p^{C_{h=1|S_i}}(1-p)^{K-C_{h=1|S_i}}$$
#### Initiation probabilities, $P(S_1)$
TODO
#### Transition probabilities, $P(S_i|S_{i-1})$
TODO
### Parameters
#### Data
| Parameter | Description |
| --------- | ----------- |
| $L$ | Number of SNPs. |
| $i$ | SNP index, $i \in \Set{1, ..., L}$. |
| $r_{i}$ | Reference (REF) allele count. |
| $a_{i}$ | Alternative (ALT) allele count. |
| $x_{i} := \frac{a_{i}}{r_{i}+a_{i}} $ | Observed within-sample alternative allele frequency (WSAF). |
| $p_{i}$ | Estimated population-level alternative allele frequency. |
| $d_{i,i+1}$ | Physical distance between SNPs, in basepairs. |#### Model
| Parameter | Description |
| --------- | ----------- |
| $K$ | Number of strains in sample, i.e. complexity of infection (COI). |
| $j$ | Strain index, $j \in \Set{1, ..., K}$. |
| $w_j$ | Abundance of strain $j$ as a fraction; proportion of sample comprised of strain $j$. |
| $h_{ij}$ | Haplotype state with $h_{ij} \in \Set{0, 1}$ for REF and ALT, respectively.|
| $q$ | The expected WSAF, without error adjustment. |
| $\pi$ | The error-adjusted expected WSAF. |
| $s_i$ | Index for the IBD state, $S_i \in \Set{1, ..., B_K}$, where $B_K$ is the [Bell number](https://en.wikipedia.org/wiki/Bell_number). |#### Hyper-parameters
| Parameter | Description | Value | Reference |
| --------- | ----------- | ----- | --------- |
| $\rho$ | Recombination rate. | 13.5 kbp per centiMorgan | [Miles et al. (2016)](https://genome.cshlp.org/content/26/9/1288.full) |
| $e_0$ | REF to ALT read count error rate. | 0.01 | Calibrated from Pf3k |
| $e_1$ | ALT to REF read count error rate. | 0.05 | Calibrated from Pf3k |
| $v$ | Term setting dispersion in WSAF. | 500 | Calibrated from Pf3k |