{"id":13752573,"url":"https://github.com/lh3/bgt","last_synced_at":"2025-05-07T08:12:18.589Z","repository":{"id":31372800,"uuid":"34935806","full_name":"lh3/bgt","owner":"lh3","description":"Flexible genotype query among 30,000+ samples whole-genome","archived":false,"fork":false,"pushed_at":"2019-09-04T19:43:27.000Z","size":310,"stargazers_count":96,"open_issues_count":9,"forks_count":10,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-05-07T08:12:13.385Z","etag":null,"topics":["bioinformatics","genomics"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"withoutboats/notty","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-05-02T04:38:40.000Z","updated_at":"2024-05-28T01:08:24.000Z","dependencies_parsed_at":"2022-08-09T20:30:18.934Z","dependency_job_id":null,"html_url":"https://github.com/lh3/bgt","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fbgt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fbgt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fbgt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fbgt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/bgt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252839295,"owners_count":21812090,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","genomics"],"created_at":"2024-08-03T09:01:07.699Z","updated_at":"2025-05-07T08:12:18.570Z","avatar_url":"https://github.com/lh3.png","language":"C","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"## \u003ca name=\"started\"\u003e\u003c/a\u003eGetting Started\n\n#### Connect to a public BGT server\n```sh\ncurl -s 'http://bgtdemo.herokuapp.com/'\ncurl -s 'http://bgtdemo.herokuapp.com/?a=(impact==\"HIGH\")\u0026s=(population==\"FIN\")\u0026f=(AC\u003e0)'\ncurl -s 'http://bgtdemo.herokuapp.com/?t=CHROM,POS,END,REF,ALT,AC/AN\u0026f=(AC\u003e1)\u0026r=20'\n```\nFor the last query, the last line is \"*\", indicating the result is incomplete.\nNote that this web app is using Heroku's free tier. It is restricted to one CPU\nonly and put to sleep when the app is idle. There is an overhead of wakeup.\nHeroku also forces free apps to sleep for \"6 hours in a 24 hour period\". I\ndon't know how exactly this works.\n\n#### Run BGT locally\n```sh\n# Installation\ngit clone https://github.com/lh3/bgt.git\ncd bgt; make\n# Download demo BCF (1st 1Mbp of chr11 from 1000g), and convert to BGT\nwget -O- http://bit.ly/BGTdemo | tar xf -\n./bgt import 1kg11-1M.bgt 1kg11-1M.raw.bcf\ngzip -dc 1kg11-1M.raw.samples.gz \u003e 1kg11-1M.bgt.spl  # sample meta data\n# Get all sample genotypes\n./bgt view -C 1kg11-1M.bgt | less -S\n# Get genotypes of HG00171 and HG00173 in region 11:100,000-200,000\n./bgt view -s,HG00171,HG00173 -f'AC\u003e0' -r 11:100000-200000 1kg11-1M.bgt\n# Get alleles high-frequency in CEU but absent from YRI\n./bgt view -s'population==\"CEU\"' -s'population==\"YRI\"' -f'AC1/AN1\u003e=0.1\u0026\u0026AC2==0' -G 1kg11-1M.bgt\n# Select high-impact sites (var annotation provided with -d)\n./bgt view -d anno11-1M.fmf.gz -a'impact==\"HIGH\"' -CG 1kg11-1M.bgt\n```\n\n#### Set up your web server\n```sh\n# Compile the server; Go compiler required\nmake bgt-server\nGOMAXPROCS=4 ./bgt-server -d anno11-1M.fmf.gz 1kg11-1M.bgt 2\u003e server.log \u0026\ncurl -s '0.0.0.0:8000' | less -S  # help\ncurl -s '0.0.0.0:8000/?a=(impact==\"HIGH\")\u0026s=(population==\"FIN\")\u0026f=(AC\u003e0)'\n```\n\n## Table of Contents\n\n- [Getting Started](#started)\n- [Users' Guide](#guide)\n  - [Data model overview](#model)\n  - [Import](#import)\n    - [Import genotypes](#igenotype)\n    - [Import sample phenotypes](#iphenotype)\n    - [Import site annotations](#isite)\n  - [Query](#query)\n    - [Genotype-independent site selection](#givs)\n    - [Genotype-independent sample selection](#giss)\n    - [Genotype-dependent site selection](#gdvs)\n    - [Tabular output](#tabout)\n    - [Miscellaneous output](#miscout)\n  - [BGT server](#server)\n    - [Privacy](#privacy)\n- [Further Notes](#notes)\n  - [Other genotype formats](#others)\n  - [Performance evaluation](#perf)\n\n## \u003ca name=\"guide\"\u003e\u003c/a\u003eUsers' Guide\n\nBGT is a compact file format for efficiently storing and querying whole-genome\ngenotypes of tens to hundreds of thousands of samples. It can be considered as\nan alternative to genotype-only BCFv2. BGT is more compact in size, more\nefficient to process, and more flexible on query.\n\nBGT comes with a command line tool and a web application which largely mirrors\nthe command line uses. The tool supports expressive and powerful query syntax.\nThe \"Getting Started\" section shows a few examples.\n\n### \u003ca name=\"model\"\u003e\u003c/a\u003e1. Data model overview\n\nBGT models a genotype data set as a matrix of genotypes with rows indexed by\nsite and columns by sample. Each BGT database keeps a genetype matrix and a\nsample annotation file. Site annotations are kept in a separate file which is\nintended to be shared across multiple BGT databases. This model is different\nfrom VCF in that VCF 1) keeps sample information in the header and 2) stores\nsite annotations in INFO along with genotypes which are not meant to be shared\nacross VCFs.\n\n### \u003ca name=\"import\"\u003e\u003c/a\u003e2. Import\n\nA BGT database always has a genotype matrix and sample names, which are\nacquired from VCF/BCF. Site annotations and sample phenotypes are optional but \nare recommended. Flexible meta data query is a distinguishing feature of BGT.\n\n#### \u003ca name=\"igenotype\"\u003e\u003c/a\u003e2.1 Import genotypes\n\n```sh\n# Import BCFv2\nbgt import prefix.bgt in.bcf\n# Import VCF with \"##contig\" header lines\nbgt import -S prefix.bgt in.vcf.gz\n# Import VCF without \"##contig\" header lines\nbgt import -St ref.fa.fai prefix.bgt in.vcf.gz\n```\nDuring import, BGT separates multiple alleles on one VCF line. It discards all\nINFO fields and FORMAT fields except GT. See section 2.3 about how to use\nvariant annotations with BGT.\n\n#### \u003ca name=\"iphenotype\"\u003e\u003c/a\u003e2.2 Import sample phenotypes\n\nAfter importing VCF/BCF, BGT generates `prefix.bgt.spl` text file, which for\nnow only has one column of sample names. You can add pheotype data to this file\nin a format like (fields TAB-delimited):\n```\nsample1   gender:Z:M    height:f:1.73     region:Z:WestEurasia     foo:i:10\nsample2   gender:Z:F    height:f:1.64     region:Z:WestEurasia     bar:i:20\n```\nwhere each meta annotation takes a format `key:type:value` with `type` being\n`Z` for a string, `f` for a real number and `i` for an integer. We call this\nformat Flat Metadata Format or FMF in brief. You can get samples matching\ncertain conditions with:\n```sh\nbgt fmf prefix.bgt.spl 'height\u003e1.7\u0026\u0026region==\"WestEurasia\"'\nbgt fmf prefix.bgt.spl 'mass/height**2\u003e25\u0026\u0026region==\"WestEurasia\"'\n```\nYou can most common arithmetic and logical operators in the condition.\n\n#### \u003ca name=\"isite\"\u003e\u003c/a\u003e2.3 Import site annotations\n\nSite annotations are also kept in a FMF file like:\n```\n11:209621:1:T  effect:Z:missense_variant   gene:Z:RIC8A  CCDS:Z:CCDS7690.1  CDSpos:i:347\n11:209706:1:T  effect:Z:synonymous_variant gene:Z:RIC8A  CCDS:Z:CCDS7690.1  CDSpos:i:432\n```\nWe provide a script `misc/vep2fmf.pl` to convert the VEP output with the\n`--pick` option to FMF.\n\nNote that due to an implementation limitation, we recommend to use a subset of\n\"important\" variants with BGT, for example:\n```sh\ngzip -dc vep-all.fmf.gz | grep -v \"effect:Z:MODIFIER\" | gzip \u003e vep-important.fmf.gz\n```\nUsing the full set of variants is fine, but is much slower with the current\nimplementation.\n\n### \u003ca name=\"query\"\u003e\u003c/a\u003e3. Query\n\nA BGT query is composed of output and conditions. The output is VCF by default\nor can be a TAB-delimited table if requsted. Conditions include\ngenotype-independent site selection with option `-r` and `-a` (e.g. variants in\na region), genotype-independent sample selection with option `-s` (e.g. a list\nof samples), and genotype-dependent site selection with option `-f` (e.g.\nallele frequency among selected samples above a threshold). BGT has limited\nsupport of genotype-dependent sample selection (e.g. samples having an allele).\n\nBGT has an important concept \"sample group\". On the command line, each option\n`-s` creates a sample group. The #-th option `-s` populates a pair of `AC#` and\n`AN#` aggregate variables. These variables can be used in output or\ngenotype-dependent site selection.\n\n#### \u003ca name=\"givs\"\u003e\u003c/a\u003e3.1 Genotype-independent site selection\n\n```sh\n# Select by a region\nbgt view -r 11:100,000-200,000 1kg11-1M.bgt \u003e out.vcf\n# Select by regions in a BED (BGT will read through the entire BGT)\nbgt view -B regions.bed 1kg11-1M.bgt \u003e out.vcf\n# Select a list of alleles (if on same chr, use random access)\nbgt view -a,11:151344:1:G,11:110992:AACTT:A,11:160513::G 1kg11-1M.bgt\n# Select by annotations (-d specifies the site annotation database)\nbgt view -d anno11-1M.fmf.gz -a'impact==\"HIGH\"' -CG 1kg11-1M.bgt\n```\nIt should be noted that in the last command line, BGT will read through the\nentire annotation file to find the list of matching alleles. It may take\nseveral minutes if the site annotation files contains 100 million lines.\nThat is why we recommend to use a subset of important alleles (section 2.3).\n\n#### \u003ca name=\"giss\"\u003e\u003c/a\u003e3.2 Genotype-independent sample selection\n\n```sh\n# Select a list of samples\nbgt view -s,HG00171,HG00173 1kg11-1M.bgt\n# Select by phenotypes (see also section 2.2)\nbgt view -s'population==\"CEU\"' 1kg11-1M.bgt\n# Create sample groups (there will be AC1/AN1 and AC2/AN2 in VCF INFO)\nbgt view -s'population==\"CEU\"' -s'population==\"YRI\"' -G 1kg11-1M.bgt\n```\n\n#### \u003ca name=\"gdvs\"\u003e\u003c/a\u003e3.3 Genotype-dependent site selection\n\n```sh\n# Select by allele frequency\nbgt view -f'AN\u003e0\u0026\u0026AC/AN\u003e.05' 1kg11-1M.bgt\n# Select by group frequnecy\nbgt view -s'population==\"CEU\"' -s'population==\"YRI\"' -f'AC1\u003e10\u0026\u0026AC2==0' -G 1kg11-1M.bgt\n```\nOf course, we can mix all the three types of conditions in one command line:\n```sh\nbgt view -G -s'population==\"CEU\"' -s'population==\"YRI\"' -f'AC1/AN1\u003e.1\u0026\u0026AC2==0' \\\n         -r 11:100,000-500,000 -d anno11-1M.fmf.gz -a'CDSpos\u003e0' 1kg11-1M.bgt\n```\n\n#### \u003ca name=\"tabout\"\u003e\u003c/a\u003e3.4 Tabular output\n\n```sh\n# Output position, sequence and allele counts\nbgt view -t CHROM,POS,REF,ALT,AC1,AC2 -s'population==\"CEU\"' -s'population==\"YRI\"' 1kg11-1M.bgt\n```\n\n#### \u003ca name=\"miscout\"\u003e\u003c/a\u003e3.5 Miscellaneous output\n\n```sh\n# Get samples having a set of alleles (option -S)\nbgt view -S -a,11:151344:1:G,11:110992:AACTT:A,11:160513::G -s'population==\"CEU\"' 1kg11-1M.bgt\n# Count haplotypes\nbgt view -Hd anno11-1M.fmf.gz -a'gene==\"SIRT3\"' -f 'AC/AN\u003e.01' 1kg11-1M.bgt\n# Count haplotypes in multiple populations\nbgt view -Hd anno11-1M.fmf.gz -a'gene==\"SIRT3\"' -f 'AC/AN\u003e.01' \\\n         -s'region==\"Africa\"' -s'region==\"EastAsia\"' 1kg11-1M.bgt\n```\n\n### \u003ca name=\"server\"\u003e\u003c/a\u003e4. BGT server\n\nIn addition to a command line tool, we also provide a prototype web application\nfor genotype query. The query syntax is similar to `bgt view` as is shown in\n\"Getting Started\", but with some notable differences:\n\n1. The server uses `.and.` for the logical AND operator `\u0026\u0026` (as `\u0026` is a special character to HTML).\n2. The server can't load a list of samples from a local file (for security).\n3. The server doesn't support BCF output for now (can be implemented on request).\n4. The server doesn't output genotypes by default (option `g` required for server).\n5. The server loads site annotations into RAM (for real-time response but requiring more memory).\n6. By default (tunable), the server processes up to 10 million genotypes and then truncates the result.\n7. The server may forbid the output of genotypes of some samples (see below).\n\n#### \u003ca name=\"privacy\"\u003e\u003c/a\u003e4.1 Privacy\n\nThe BGT server implements a simple mechanism to keep the privacy of samples or\na subset of samples. It is controlled by a single parameter: minimal sample\ngroup size or MGS.  The server refuses to create a sample group if the size of\nthe group is smaller than the MGS of one of the samples in the group. In\nparticular, if MGS is above one, the server doesn't report sample name or\nsample genotypes.  Each sample may have a different MGS as is marked by the\n`_mgs` integer tag in `prefix.bgt.spl`. For samples without this tag, a\ndefault MGS is applied.\n\n## \u003ca name=\"notes\"\u003e\u003c/a\u003eFurther Notes\n\n### \u003ca name=\"others\"\u003e\u003c/a\u003eOther genotype formats\n\n* BGT vs [PBWT][pbwt]. BGT uses the same data structure as PBWT and is inspired\n  by PBWT. PBWT supports advanced query such as haplotype matching, phasing\n  and imputation, while BGT puts more emphasis on fast random access and data\n  retrieval.\n\n* BGT vs [BCF2][vcf]. BCF is more versatile. It is able to keep per-genotype\n  meta information (e.g. per-genotype read depth and genotype likelihood). BGT\n  is generally more efficient and times smaller. It scales better to many\n  samples. BGT also supports more flexible queries, although technically,\n  nothing prevents us from implementing similar functionalities on top of BCF.\n\n* BGT vs [GQT][gqt]. GQT should be much faster on traversing sites across whole\n  chromosomes without considering LD. It is however inefficient to retrieve\n  data in small regions or to get haplotype information due to its design.\n  For this reason, GQT is regarded as a complement to BCF or BGT, not a\n  replacement. On file size, GQT is usually larger than genotype-only BCF and\n  is thus larger than BGT.\n\n### \u003ca name=\"perf\"\u003e\u003c/a\u003ePerformance evaluation\n\nThe test is run on the first release of [Haplotype Reference Consortium][hrc]\n(HRC) data. There are ~39 million phased SNPs in 32,488 samples. We have\ngenerated the BGT for the entire dataset, but We are only running tools in\nregion chr11:10,000,001-20,000,000. The following table shows the time and\ncommand line. Note that the table omits option `-r 11:10,000,001-20,000,000`\nwhich has been applied to all command lines below.\n\n|Time   |Command line|\n|------:|:------------|\n|11s    |bgt view -G HRC-r1.bgt|\n|13s    |bcftools view -Gu HRC-r1.bcf|\n|30s    |bgt view -GC HRC-r1.bgt|\n|4s     |bgt view -GC -s'source==\"1000G\"'|\n|19s    |bcftools view -Gu -S 1000G.txt HRC-r1.bcf|\n|8s     |bgt view -G -s 'source==\"UK10K\"' -s 'source==\"1000G\"\u0026\u0026population!=\"GBK\"'|\n\nOn file sizes, the BGT database for HRC-r1 is 7.4GB (1GB=1024\\*1024\\*1024 bytes). In comparison,\nBCFv2 for the same data takes 65GB, GQT 93GB and PBWT 4.4GB. BGT and PBWT,\nwhich are based on the same data structure, are much more compact. BGT is\nlarger than PBWT primarily because BGT keeps an extra bit per haplotype to\ndistinguish reference and multi allele, and stores markers to enable fast\nrandom access.\n\n[hrc]: http://www.haplotype-reference-consortium.org\n[gqt]: https://github.com/ryanlayer/gqt\n[pbwt]: https://github.com/richarddurbin/pbwt\n[vcf]: https://samtools.github.io/hts-specs/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fbgt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fbgt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fbgt/lists"}