{"id":13592617,"url":"https://github.com/dlcgold/muPBWT","last_synced_at":"2025-04-08T23:33:36.749Z","repository":{"id":65457504,"uuid":"588871425","full_name":"dlcgold/muPBWT","owner":"dlcgold","description":"A PBWT-based light index for UK Biobank scale genotype data.","archived":false,"fork":false,"pushed_at":"2024-10-29T23:16:41.000Z","size":57378,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-10-30T01:50:48.571Z","etag":null,"topics":["1000genomes","pbwt","run-length-encoding","ukbiobank"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dlcgold.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-01-14T10:03:06.000Z","updated_at":"2023-11-17T21:43:24.000Z","dependencies_parsed_at":"2023-02-13T20:55:17.024Z","dependency_job_id":"c2d5f338-40a4-4b01-8bc8-b6631d534e29","html_url":"https://github.com/dlcgold/muPBWT","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dlcgold%2FmuPBWT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dlcgold%2FmuPBWT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dlcgold%2FmuPBWT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dlcgold%2FmuPBWT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dlcgold","download_url":"https://codeload.github.com/dlcgold/muPBWT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223346883,"owners_count":17130522,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["1000genomes","pbwt","run-length-encoding","ukbiobank"],"created_at":"2024-08-01T16:01:11.364Z","updated_at":"2024-11-06T13:31:52.981Z","avatar_url":"https://github.com/dlcgold.png","language":"C++","funding_links":[],"categories":["Genetics"],"sub_categories":["Accelerometer"],"readme":"[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mupbwt/README.html)\n[![Conda](https://img.shields.io/conda/v/bioconda/mupbwt?color=green)](https://anaconda.org/bioconda/mupbwt)\n[![Conda](https://img.shields.io/conda/dn/bioconda/mupbwt?color=green\u0026label=conda%20%7C%20downloads)](https://anaconda.org/bioconda/mupbwt)\n[![GitHub stars](https://img.shields.io/github/stars/dlcgold/muPBWT.svg)](https://github.com/dlcgold/muPBWT/stargazers)\n# μ-PBWT\nA PBWT-based light index  for UK Biobank scale genotype data.\n\n## Conda install\nμ-PBWT is available for Gnu/Linux on [conda](https://docs.conda.io/en/latest/) ([bioconda](https://bioconda.github.io/) channel):\n```shell\nconda install -c bioconda mupbwt\n```\n\n## Build from source\nPrepare the cmake for building the current project in ‘.’ into the ‘build’ folder\n```shell\ncmake -S . -B build \n```\nBuild μ-PBWT:\n```shell\ncmake --build build\n```\n## Install from source (optional)\nInstall μ-PBWT (default in `/usr/local/bin/`, `sudo` required):\n```shell\ncmake --install build\n```\nUse `--prefix \u003cpath\u003e` for custom path.\n\n## Usage\nFile format supported:\n- BCF/VCF\n- [MaCS](https://github.com/gchen98/macs)\n```shell\ncd build\n```\n```shell\n\nUsage: ./mupbwt [options]\n\nOptions:\n  -i, --input_file \u003cpath\u003e\t vcf/bcf file for panel\n  -s, --save \u003cpath\u003e\t  path to save index\n  -l, --load \u003cpath\u003e\t path to load index\n  -o, --output \u003cpath\u003e\t path to query output\n  -q, --query \u003cpath\u003e\t path to query file (vcf/bcf)\n  -m, --macs\tuse macs as file format for both input and query file\n  -v, --verbose\t extra prints\n  -d, --details\t print memory usage details\n  -h, --help\t show this help message and exit\n```\n\nBuild the index:\n```shell\n./mupbwt -i \u003cinput file\u003e -s \u003cindex file\u003e\n```\nQuery the index:\n```shell\n./mupbwt -l \u003cindex file\u003e -q \u003cquery file\u003e -o \u003coutput file\u003e \n```\nQuery without save the index:\n```shell\n./mupbwt -i \u003cinput file\u003e -q \u003cquery file\u003e -o \u003coutput file\u003e\n```\nQuery and  save the index:\n```shell\n./mupbwt -i \u003cinput file\u003e -s \u003cindex file\u003e -q \u003cquery file\u003e -o \u003coutput file\u003e\n```\nUsing examples in `sample_data`:\n```shell\n./mupbwt -i sample_data/panel.bcf -s sample_data/index.ser\n./mupbwt -l sample_data/index.ser -q sample_data/query.bcf -o sample_data/sample_data_results \n./mupbwt -i sample_data/panel.bcf -q sample_data/query.bcf -o sample_data/sample_data_results\n./mupbwt -i sample_data/panel.bcf -s sample_data/index.ser -q sample_data/query.bcf -o sample_data/sample_data_results\n```\n\nLoad the index and print details to stdout:\n```shell\n./mupbwt -l \u003cindex file\u003e -d\n```\nAn output example is:\n```shell\n\u003e ./mupbwt -l sample_data/index.ser -d\nbuilt/loaded in: 0.015628 s\n\n----\nTotal haplotypes: 900\nTotal sites: 499\n----\nTotal runs: 27512\nAverage runs: 55\n----\nrun: 0.0386925 megabytes\nthr: 0.0387306 megabytes\nuv: 0.0380135 megabytes\nsamples: 0.0833178 megabytes\nrlpbwt (mapping): 0.201148 megabytes\nphi panels: 0.414757 megabytes\nphi support: 0.126385 megabytes\nphi data structure (panels + support): 0.541142 megabytes\nrlpbwt: 0.74229 megabytes\n----\nestimated dense size: 36.4132 megabytes\n----\n```\n\n### Input\nOnly bialleic case is supported. In case of vcf/bcf [bcftools](https://github.com/samtools/bcftools) can be used to filter the input:\n```shell\nbcftools view -m2 -M2  -v snps \u003cinput vcf/vcf\u003e \u003e \u003cfiltered vcf/bcf file\u003e\n```\n### Output\nOutput file follow the standard proposed in [Durbin's PBWT](https://github.com/richarddurbin/pbwt). \nEach row contain a SMEM:\n```\nMATCH   \u003cquery index\u003e   \u003crow index\u003e \u003cstaring column\u003e    \u003cending column\u003e \u003cSMEM length\u003e\n```\nFor example:\n```\nMATCH\t99\t150\t414\t430\t17\n```\nRow index and query index are incrementally so the name of the sample and the precise haplotype can be calculated using the output of [bcftools](https://github.com/samtools/bcftools). \n\nThe command:\n```shell\nbcftools query -l \u003cinput vcf/bcf\u003e \u003e samples.txt\n```\nstore in `samples.txt` all the samples name, in order. So, for example, row indices 0 and 1 corresponds to the two haplotypes of the first sample, row indices 2 and 3 to the second one etc...\n\nEventually you can use `script/mem_sample.py`:\n```\n\u003e python mem_sample.py -h\nusage: mem_sample.py [-h] [-i INPUT] [-p PANEL] [-q QUERIES] [-o OUTPUT]\n\noptions:\n  -h, --help            show this help message and exit\n  -i INPUT, --input INPUT\n                        SMEM file in Durbin's format\n  -p PANEL, --panel PANEL\n                        panel as VCF/BCF (optional)\n  -q QUERIES, --queries QUERIES\n                        queries as VCF/BCF (optional)\n  -o OUTPUT, --output OUTPUT\n                        output file\n```\nEsample:\n```\npython mem_sample.py -i sample_data/sample_data_results -p sample_data/panel.bcf -q sample_data/query.bcf -o sample_data/sample_data_results_new\n```\nOnly one between `PANEL` and `QUERIES` can be specified.\nNew SMEM file will contain in each row:\n```\nMATCH   \u003cquerySample_haplotype\u003e   \u003cpanelSample_haplotype\u003e \u003cstaring column\u003e    \u003cending column\u003e \u003cSMEM length\u003e\n```\nFor example, assuming both `PANEL` and `QUERIES`:\n```\nMATCH\t1318026_1\t4919834_0\t414\t430\t17\n```\n## Results\nResults on high-coverage whole genome sequencing data from UK Biobank (chromosome 20):\n\n| **Region**              | **#Samples** | **#Sites**   | **Size BCF (GB)** | **μ-PBWT (GB)** | **Construction time (hh:mm)** | **Construction memory peak (GB)** |\n|-------------------------|--------------|--------------|-------------------|-----------------|-------------------------------|-----------------------------------|\n| chr20:60061-4060065     | 150119       | 865267       | 1.9               | 0.88            | 06:25                         | 2.27                              |\n| chr20:4060066-8060066   | 150119       | 880899       | 2                 | 0.85            | 06:28                         | 2.22                              |\n| chr20:8060067-12515479  | 150119       | 961591       | 2.1               | 0.77            | 07:04                         | 2.05                              |\n| chr20:12515480-16768988 | 150119       | 917468       | 2                 | 0.73            | 06:47                         | 1.97                              |\n| chr20:16768989-21050967 | 150119       | 931010       | 2                 | 0.71            | 06:53                         | 1.92                              |\n| chr20:21050968-31549151 | 150119       | 1919134      | 4.2               | 1.20            | 13:54                         | 3.06                              |\n| chr20:31549152-38282825 | 150119       | 1436549      | 2.8               | 0.99            | 10:25                         | 2.63                              |\n| chr20:38282826-43181963 | 150119       | 1056144      | 2.2               | 0.76            | 07:42                         | 2.06                              |\n| chr20:43181964-47619489 | 150119       | 955970       | 2                 | 0.79            | 06:56                         | 2.09                              |\n| chr20:47619490-51789198 | 150119       | 923178       | 2                 | 0.80            | 06:44                         | 2.12                              |\n| chr20:51789199-55789212 | 150119       | 911452       | 2                 | 0.81            | 06:45                         | 2.13                              |\n| chr20:55789213-59874964 | 150119       | 925442       | 2                 | 0.84            | 06:49                         | 2.20                              |\n| chr20:59874965-64334101 | 150119       | 1096089      | 2.4               | 0.93            | 08:00                         | 2.42                              |\n| **Total**               | **150119**   | **13780193** | **29.6**          | **11.08**       | **-**                         | **29.15**                         |\n\nResults on 1000 Genome Project phase 3 data including the average number of runs per site:\n\n| **Chr**   | **#Samples** | **#Sites**   | **#Runs/site** | **Size BCF (GB)** | **μ-PBWT (GB)** | **Construction time (hh:mm)** | **Construction memory peak (GB)** |\n|-----------|--------------|--------------|----------------|-------------------|-----------------|-------------------------------|-----------------------------------|\n| 1         | 2454         | 6196151      | 11             | 0.78              | 1.44            | 00:19                         | 4.59                              |\n| 2         | 2454         | 6786300      | 10             | 0.84              | 1.47            | 00:21                         | 4.76                              |\n| 3         | 2454         | 5584397      | 10             | 0.71              | 1.20            | 00:18                         | 4.24                              |\n| 4         | 2454         | 5480936      | 10             | 0.71              | 1.19            | 00:17                         | 4.28                              |\n| 5         | 2454         | 5037955      | 9              | 0.63              | 1.08            | 00:16                         | 4.22                              |\n| 6         | 2454         | 4800101      | 10             | 0.64              | 1.06            | 00:15                         | 4.28                              |\n| 7         | 2454         | 4517734      | 10             | 0.58              | 1.03            | 00:14                         | 4.34                              |\n| 8         | 2454         | 4417368      | 10             | 0.56              | 0.97            | 00:14                         | 4.30                              |\n| 9         | 2454         | 3414848      | 11             | 0.43              | 0.81            | 00:11                         | 2.54                              |\n| 10        | 2454         | 3823786      | 10             | 0.50              | 0.87            | 00:12                         | 2.77                              |\n| 11        | 2454         | 3877543      | 10             | 0.49              | 0.84            | 00:12                         | 2.71                              |\n| 12        | 2454         | 3698099      | 10             | 0.47              | 0.82            | 00:12                         | 2.63                              |\n| 13        | 2454         | 2727881      | 10             | 0.35              | 0.60            | 00:9                          | 2.14                              |\n| 14        | 2454         | 2539149      | 11             | 0.32              | 0.58            | 00:8                          | 2.18                              |\n| 15        | 2454         | 2320474      | 12             | 0.29              | 0.57            | 00:7                          | 2.30                              |\n| 16        | 2454         | 2596072      | 12             | 0.32              | 0.63            | 00:8                          | 2.28                              |\n| 17        | 2454         | 2227080      | 12             | 0.28              | 0.55            | 00:7                          | 2.32                              |\n| 18        | 2454         | 2171378      | 11             | 0.28              | 0.51            | 00:7                          | 2.23                              |\n| 19        | 2454         | 1751878      | 13             | 0.23              | 0.45            | 00:6                          | 1.43                              |\n| 20        | 2454         | 1739315      | 11             | 0.22              | 0.41            | 00:5                          | 1.30                              |\n| 21        | 2454         | 1054447      | 14             | 0.14              | 0.30            | 00:3                          | 1.26                              |\n| 22        | 2454         | 1055454      | 14             | 0.14              | 0.29            | 00:3                          | 1.24                              |\n| **Total** | **2454**     | **77818346** | **11**         | **9.91**          | **17.67**       | **-**                         | **64.34**                         |\n\nNote that total building times are not printed due to the fact that all the computations have been done in parallel.\n\nThe pipeline for 1000 Genome Project phase 3 data is available at [dlcgold/muPBWT-1KGP-workflow](https://github.com/dlcgold/muPBWT-1KGP-workflow).\n\n## Reference\nμ-PBWT results are currently available on [Bioinformatics](https://academic.oup.com/bioinformatics/article/39/9/btad552/7265394).\n\nBibtex:\n```\n@article{10.1093/bioinformatics/btad552,\n    author = {Cozzi, Davide and Rossi, Massimiliano and Rubinacci, Simone and Gagie, Travis and Köppl, Dominik and Boucher, Christina and Bonizzoni, Paola},\n    title = \"{μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data}\",\n    journal = {Bioinformatics},\n    volume = {39},\n    number = {9},\n    pages = {btad552},\n    year = {2023},\n    month = {09},\n    abstract = \"{The Positional Burrows–Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20\\\\% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.}\",\n    issn = {1367-4811},\n    doi = {10.1093/bioinformatics/btad552},\n    url = {https://doi.org/10.1093/bioinformatics/btad552},\n    eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/9/btad552/51556136/btad552.pdf},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdlcgold%2FmuPBWT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdlcgold%2FmuPBWT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdlcgold%2FmuPBWT/lists"}