Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/at-cg/PHI

Pangenome graph-based genome inference
https://github.com/at-cg/PHI

Last synced: 30 days ago
JSON representation

Pangenome graph-based genome inference

Awesome Lists containing this project

README

        


PHI Logo

##


PHI (Pangenome-based Haplotype Inference)

## Getting Started

### Prerequisites

Before using PHI, please ensure that **Miniforge** is installed: [Miniforge Installation Guide](https://github.com/conda-forge/miniforge). This package installer is used for installing a few dependencies such as VG and samtools. To run PHI, you also need a Gurobi license. You can get a free academic license [here](https://www.gurobi.com/academia/academic-program-and-licenses/). You should download and save `gurobi.lic` file in your home directory.

## Get PHI

```bash
git clone https://github.com/at-cg/PHI
cd PHI
# Install dependencies (Miniforge is required)
./Installdeps
export PATH="$(pwd)/extra/bin:$PATH"
export LD_LIBRARY_PATH="$(pwd)/extra/lib:$LD_LIBRARY_PATH"
make

# test run
./PHI -t32 -g test/MHC_4.gfa.gz -r test/CHM13_reads.fq.gz -o CHM13.fa

# test run with VCF file as input
./vcf2gfa.py -v test/MHC_4.vcf.gz -r test/MHC-CHM13.0.fa.gz | bgzip > test/MHC_4_vcf.gfa.gz
./PHI -t32 -g test/MHC_4_vcf.gfa.gz -r test/CHM13_reads.fq.gz -o CHM13.fa
```

#### Adding Binary and Library Paths to `.bashrc`
To ensure that the `extra/bin` and `extra/lib` directories are automatically loaded for every terminal session, you can export them to your `~/.bashrc`. This will make sure the required binaries and libraries for `PHI` are available.

```bash
# Add extra/bin and extra/lib to .bashrc
echo 'export PATH="$(pwd)/extra/bin:$PATH"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="$(pwd)/extra/lib:$LD_LIBRARY_PATH"' >> ~/.bashrc
source ~/.bashrc
```

## Table of Contents

- [Getting Started](#started)
- [Get PHI](#get_phi)
- [Introduction](#intro)
- [Results](#results)
- [Future work](#future)
- [Publications](#pub)

## Introduction
PHI is a pangenome-based genotyping method. It estimates complete haplotype sequence from low-coverage sequencing data (short-reads or long-reads of a haploid genome). Users should provide a pangenome graph reference in either:
- Graph Format ([GFA v1.1](http://gfa-spec.github.io/GFA-spec/GFA1.html#gfa-11)): A sequence graph-based representation of the pangenome graph. Graph should be acyclic.
- Variant Call Format ([VCF](https://samtools.github.io/hts-specs/VCFv4.2.pdf)): A list of multi-sample, multi-allelic phased variants along with a reference genome.

Output of PHI is the haplotype sequence (FASTA) associated with the optimal inferred path from the graph. It identifies a path in the pangenome graph that maximizes the matches between the path and read k-mers while minimizing recombination events (haplotype switches) along the path. We implemented integer programming to compute an optimal solution. The integer program is solved optimally using the [Gurobi optimizer](https://www.gurobi.com). Details of these formulations are described in our [paper](#publications).

## Results
We benchmarked PHI (v1.0) using short-read datasets sampled from MHC sequences of five haplotypes (APD, DBB, MANN, QBL, and SSTO). This data was generated by [Houwaart et al. (2022)](https://doi.org/10.1111/tan.15020). These datasets were downsampled to various coverages ranging from 0.1x to 10x. We built a pangenome graph using [Minigraph-Cactus](https://github.com/ComparativeGenomicsToolkit/cactus/tree/master), comprising 49 complete [MHC sequences](https://doi.org/10.5281/zenodo.6617246). To assess the accuracy of PHI, we evaluated the edit distance between the inferred haplotype sequences and the MHC sequences from Houwaart et al. that were determined using de novo assembly and curation.


F1-score

> Edit distance between ground-truth haplotype sequences and the sequences estimated by different tools (PHI, VG, and PanGenie). Lower edit distance implies higher accuracy. PHI provides advangate over existing methods on low-coverage inputs.

In PHI, we have implemented two integer programs (referred to as ILP and IQP respectively). They both solve the same problem, but differ in terms of their runtime and memory-usage. IQP is generally faster but it requires more memory. Users can select between the two using command line argument (see `./PHI -h`).


F1-score

> Performance comparison between ILP and IQP.

The scripts to reproduce the results are available [here](data).

## Future Work
- Add support for diploid genome estimation.
- Scale to pangenome graphs having larger number of genomes.

## Publications
- **Ghanshyam Chandra, Md Helal Hossen, Stephan Scholz, Alexander T Dilthey, Daniel Gibney and Chirag Jain**. "[Integer programming framework for pangenome-based genome inference](https://doi.org/10.1101/2024.10.27.620212)". *bioRxiv* 2024.