Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/dlcgold/muPBWT

A PBWT-based light index for UK Biobank scale genotype data.
https://github.com/dlcgold/muPBWT
1000genomes pbwt run-length-encoding ukbiobank
Last synced: 4 months ago
JSON representation
A PBWT-based light index for UK Biobank scale genotype data.
Host: GitHub
URL: https://github.com/dlcgold/muPBWT
Owner: dlcgold
License: gpl-3.0
Created: 2023-01-14T10:03:06.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-03-16T23:14:22.000Z (4 months ago)
Last Synced: 2024-03-17T05:22:51.665Z (4 months ago)
Topics: 1000genomes, pbwt, run-length-encoding, ukbiobank
Language: C++
Homepage:
Size: 30.9 MB
Stars: 6
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

awesome-uk-biobank - μ-PBWT - based light index enabling storage and use of UK Biobank scale genotype data on a commodity laptop | (Genetics / Accelerometer)
README

        [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mupbwt/README.html)

[![Conda](https://img.shields.io/conda/v/bioconda/mupbwt?color=green)](https://anaconda.org/bioconda/mupbwt)

[![Conda](https://img.shields.io/conda/dn/bioconda/mupbwt?color=green&label=conda%20%7C%20downloads)](https://anaconda.org/bioconda/mupbwt)

[![GitHub stars](https://img.shields.io/github/stars/dlcgold/muPBWT.svg)](https://github.com/dlcgold/muPBWT/stargazers)

# μ-PBWT

A PBWT-based light index  for UK Biobank scale genotype data.

## Conda install

μ-PBWT is available for Gnu/Linux on [conda](https://docs.conda.io/en/latest/) ([bioconda](https://bioconda.github.io/) channel):

```shell

conda install -c bioconda mupbwt

```

## Build from source

Prepare the cmake for building the current project in ‘.’ into the ‘build’ folder

```shell

cmake -S . -B build 

```

Build μ-PBWT:

```shell

cmake --build build

```

## Install from source (optional)

Install μ-PBWT (default in `/usr/local/bin/`, `sudo` required):

```shell

cmake --install build

```

Use `--prefix ` for custom path.

## Usage

File format supported:

- BCF/VCF

- [MaCS](https://github.com/gchen98/macs)

```shell

cd build

```

```shell

Usage: ./mupbwt [options]

Options:

  -i, --input_file 	 vcf/bcf file for panel

  -s, --save 	  path to save index

  -l, --load 	 path to load index

  -o, --output 	 path to query output

  -q, --query 	 path to query file (vcf/bcf)

  -m, --macs	use macs as file format for both input and query file

  -v, --verbose	 extra prints

  -d, --details	 print memory usage details

  -h, --help	 show this help message and exit

```

Build the index:

```shell

./mupbwt -i  -s 

```

Query the index:

```shell

./mupbwt -l  -q  -o  

```

Query without save the index:

```shell

./mupbwt -i  -q  -o 

```

Query and  save the index:

```shell

./mupbwt -i  -s  -q  -o 

```

Using examples in `sample_data`:

```shell

./mupbwt -i sample_data/panel.bcf -s sample_data/index.ser

./mupbwt -l sample_data/index.ser -q sample_data/query.bcf -o sample_data/sample_data_results 

./mupbwt -i sample_data/panel.bcf -q sample_data/query.bcf -o sample_data/sample_data_results

./mupbwt -i sample_data/panel.bcf -s sample_data/index.ser -q sample_data/query.bcf -o sample_data/sample_data_results

```

Load the index and print details to stdout:

```shell

./mupbwt -l  -d

```

An output example is:

```shell

> ./mupbwt -l sample_data/index.ser -d

built/loaded in: 0.015628 s

----

Total haplotypes: 900

Total sites: 499

----

Total runs: 27512

Average runs: 55

----

run: 0.0386925 megabytes

thr: 0.0387306 megabytes

uv: 0.0380135 megabytes

samples: 0.0833178 megabytes

rlpbwt (mapping): 0.201148 megabytes

phi panels: 0.414757 megabytes

phi support: 0.126385 megabytes

phi data structure (panels + support): 0.541142 megabytes

rlpbwt: 0.74229 megabytes

----

estimated dense size: 36.4132 megabytes

----

```

### Input

Only bialleic case is supported. In case of vcf/bcf [bcftools](https://github.com/samtools/bcftools) can be used to filter the input:

```shell

bcftools view -m2 -M2  -v snps  > 

```

### Output

Output file follow the standard proposed in [Durbin's PBWT](https://github.com/richarddurbin/pbwt). 

Each row contain a SMEM:

```

MATCH            

```

For example:

```

MATCH	99	150	414	430	17

```

Row index and query index are incrementally so the name of the sample and the precise haplotype can be calculated using the output of [bcftools](https://github.com/samtools/bcftools). 

The command:

```shell

bcftools query -l  > samples.txt

```

store in `samples.txt` all the samples name, in order. So, for example, row indices 0 and 1 corresponds to the two haplotypes of the first sample, row indices 2 and 3 to the second one etc...

Eventually you can use `script/mem_sample.py`:

```

> python mem_sample.py -h

usage: mem_sample.py [-h] [-i INPUT] [-p PANEL] [-q QUERIES] [-o OUTPUT]

options:

  -h, --help            show this help message and exit

  -i INPUT, --input INPUT

                        SMEM file in Durbin's format

  -p PANEL, --panel PANEL

                        panel as VCF/BCF (optional)

  -q QUERIES, --queries QUERIES

                        queries as VCF/BCF (optional)

  -o OUTPUT, --output OUTPUT

                        output file

```

Esample:

```

python mem_sample.py -i sample_data/sample_data_results -p sample_data/panel.bcf -q sample_data/query.bcf -o sample_data/sample_data_results_new

```

Only one between `PANEL` and `QUERIES` can be specified.

New SMEM file will contain in each row:

```

MATCH            

```

For example, assuming both `PANEL` and `QUERIES`:

```

MATCH	1318026_1	4919834_0	414	430	17

```

## Results

Results on high-coverage whole genome sequencing data from UK Biobank (chromosome 20):

| **Region**              | **#Samples** | **#Sites**   | **Size BCF (GB)** | **μ-PBWT (GB)** | **Construction time (hh:mm)** | **Construction memory peak (GB)** |

|-------------------------|--------------|--------------|-------------------|-----------------|-------------------------------|-----------------------------------|

| chr20:60061-4060065     | 150119       | 865267       | 1.9               | 0.88            | 06:25                         | 2.27                              |

| chr20:4060066-8060066   | 150119       | 880899       | 2                 | 0.85            | 06:28                         | 2.22                              |

| chr20:8060067-12515479  | 150119       | 961591       | 2.1               | 0.77            | 07:04                         | 2.05                              |

| chr20:12515480-16768988 | 150119       | 917468       | 2                 | 0.73            | 06:47                         | 1.97                              |

| chr20:16768989-21050967 | 150119       | 931010       | 2                 | 0.71            | 06:53                         | 1.92                              |

| chr20:21050968-31549151 | 150119       | 1919134      | 4.2               | 1.20            | 13:54                         | 3.06                              |

| chr20:31549152-38282825 | 150119       | 1436549      | 2.8               | 0.99            | 10:25                         | 2.63                              |

| chr20:38282826-43181963 | 150119       | 1056144      | 2.2               | 0.76            | 07:42                         | 2.06                              |

| chr20:43181964-47619489 | 150119       | 955970       | 2                 | 0.79            | 06:56                         | 2.09                              |

| chr20:47619490-51789198 | 150119       | 923178       | 2                 | 0.80            | 06:44                         | 2.12                              |

| chr20:51789199-55789212 | 150119       | 911452       | 2                 | 0.81            | 06:45                         | 2.13                              |

| chr20:55789213-59874964 | 150119       | 925442       | 2                 | 0.84            | 06:49                         | 2.20                              |

| chr20:59874965-64334101 | 150119       | 1096089      | 2.4               | 0.93            | 08:00                         | 2.42                              |

| **Total**               | **150119**   | **13780193** | **29.6**          | **11.08**       | **-**                         | **29.15**                         |

Results on 1000 Genome Project phase 3 data including the average number of runs per site:

| **Chr**   | **#Samples** 
|-----------|----- 
| 1         | 2454 
| 2         | 2454 
| 3         | 2454 
| 4         | 2454 
| 5         | 2454 
| 6         | 2454 
| 7         | 2454 
| 8         | 2454 
| 9         | 2454 
| 10        | 2454 
| 11        | 2454 
| 12        | 2454 
| 13        | 2454 
| 14        | 2454 
| 15        | 2454 
| 16        | 2454 
| 17        | 2454 
| 18        | 2454 
| 19        | 2454 
| 20        | 2454 
| 21        | 2454 
| 22        | 2454 
| **Total** | **2454**

| **#Sites**   | **#Runs/site** | **Size BCF (GB)** | **μ-PBWT (GB)** | **Construction time (hh:mm)** | **Construction memory peak (GB)** | ---------|--------------|----------------|-------------------|-----------------|-------------------------------|-----------------------------------| | 6196151      | 11             | 0.78              | 1.44            | 00:19                         | 4.59                              | | 6786300      | 10             | 0.84              | 1.47            | 00:21                         | 4.76                              | | 5584397      | 10             | 0.71              | 1.20            | 00:18                         | 4.24                              | | 5480936      | 10             | 0.71              | 1.19            | 00:17                         | 4.28                              | | 5037955      | 9              | 0.63              | 1.08            | 00:16                         | 4.22                              | | 4800101      | 10             | 0.64              | 1.06            | 00:15                         | 4.28                              | | 4517734      | 10             | 0.58              | 1.03            | 00:14                         | 4.34                              | | 4417368      | 10             | 0.56              | 0.97            | 00:14                         | 4.30                              | | 3414848      | 11             | 0.43              | 0.81            | 00:11                         | 2.54                              | | 3823786      | 10             | 0.50              | 0.87            | 00:12                         | 2.77                              | | 3877543      | 10             | 0.49              | 0.84            | 00:12                         | 2.71                              | | 3698099      | 10             | 0.47              | 0.82            | 00:12                         | 2.63                              | | 2727881      | 10             | 0.35              | 0.60            | 00:9                          | 2.14                              | | 2539149      | 11             | 0.32              | 0.58            | 00:8                          | 2.18                              | | 2320474      | 12             | 0.29              | 0.57            | 00:7                          | 2.30                              | | 2596072      | 12             | 0.32              | 0.63            | 00:8                          | 2.28                              | | 2227080      | 12             | 0.28              | 0.55            | 00:7                          | 2.32                              | | 2171378      | 11             | 0.28              | 0.51            | 00:7                          | 2.23                              | | 1751878      | 13             | 0.23              | 0.45            | 00:6                          | 1.43                              | | 1739315      | 11             | 0.22              | 0.41            | 00:5                          | 1.30                              | | 1054447      | 14             | 0.14              | 0.30            | 00:3                          | 1.26                              | | 1055454      | 14             | 0.14              | 0.29            | 00:3                          | 1.24                              | | **77818346** | **11**         | **9.91**          | **17.67**       | **-**                         | **64.34**                         |

Note that total building times are not printed due to the fact that all the computations have been done in parallel.

The pipeline for 1000 Genome Project phase 3 data is available at [dlcgold/muPBWT-1KGP-workflow](https://github.com/dlcgold/muPBWT-1KGP-workflow).

## Reference

μ-PBWT results are currently available on [Bioinformatics](https://academic.oup.com/bioinformatics/article/39/9/btad552/7265394).

Bibtex:

```

@article{10.1093/bioinformatics/btad552,

    author = {Cozzi, Davide and Rossi, Massimiliano and Rubinacci, Simone and Gagie, Travis and Köppl, Dominik and Boucher, Christina and Bonizzoni, Paola},

    title = "{μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data}",

    journal = {Bioinformatics},

    volume = {39},

    number = {9},

    pages = {btad552},

    year = {2023},

    month = {09},

    abstract = "{The Positional Burrows–Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20\\% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.}",

    issn = {1367-4811},

    doi = {10.1093/bioinformatics/btad552},

    url = {https://doi.org/10.1093/bioinformatics/btad552},

    eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/9/btad552/51556136/btad552.pdf},

}

```