{"id":19459526,"url":"https://github.com/aprilweilab/picovcf","last_synced_at":"2025-04-25T07:31:57.122Z","repository":{"id":221023356,"uuid":"753177281","full_name":"aprilweilab/picovcf","owner":"aprilweilab","description":"Single-header C++ library for fast/low-memory VCF (Variant Call Format) parsing.","archived":false,"fork":false,"pushed_at":"2024-04-08T13:27:57.000Z","size":155,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-04-08T14:34:18.494Z","etag":null,"topics":["c-plus-plus","comp-bio","header-only","header-only-library","variant-calling","vcf"],"latest_commit_sha":null,"homepage":"https://picovcf.readthedocs.io/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aprilweilab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-05T16:09:17.000Z","updated_at":"2024-04-08T14:34:20.396Z","dependencies_parsed_at":"2024-02-09T14:30:01.919Z","dependency_job_id":"d95073b0-2fb9-4936-abda-16099e7103e8","html_url":"https://github.com/aprilweilab/picovcf","commit_stats":null,"previous_names":["aprilweilab/picovcf"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aprilweilab%2Fpicovcf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aprilweilab%2Fpicovcf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aprilweilab%2Fpicovcf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aprilweilab%2Fpicovcf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aprilweilab","download_url":"https://codeload.github.com/aprilweilab/picovcf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223988521,"owners_count":17236921,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c-plus-plus","comp-bio","header-only","header-only-library","variant-calling","vcf"],"created_at":"2024-11-10T17:32:56.754Z","updated_at":"2025-04-25T07:31:57.115Z","avatar_url":"https://github.com/aprilweilab.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"![](https://github.com/aprilweilab/picovcf/actions/workflows/cmake-multi-platform.yml/badge.svg)\n![](https://readthedocs.org/projects/picovcf/badge/?version=latest)\n\n# picovcf\n\nSingle-header C++ library for fast/low-memory VCF (Variant Call Format) parsing. Gzipped VCF (.vcf.gz) is optionally supported.\n\nThere are a lot of great tools for processing VCF files out there, but not many C++ libraries that are small (only parsing, no extra functionality) and easy to use. `picovcf` attempts to fill this niche by providing a header-only library using modern C++ (C++11) that allows clients to be selective about which parts of the VCF file get parsed.\n\nFeatures:\n* Fast and easy to use VCF(.GZ) parsing.\n* Convert VCF(.GZ) to Indexable Genotype Data (IGD) format, which is a very simple format that is **more than 3x smaller than VCF.GZ at Biobank scale** and **more than 15x faster to read**\n* Fast and easy to use IGD parsing.\n\nMore details can be found in the supplement of our [preprint \"Genotype Representation Graph\" paper](https://www.biorxiv.org/content/10.1101/2024.04.23.590800v1).\n\nSee also [pyigd](https://github.com/aprilweilab/pyigd/) if you want Python access to IGD files.\n\n## Using the library\n\nEither copy the latest header file (`picovcf.hpp`) into your project directly, or make use of something like git submodules to include https://github.com/aprilweilab/picovcf.\n\nSee the [vcfpp.cpp](https://github.com/aprilweilab/picovcf/blob/main/examples/vcfpp.cpp) for an example of how to use the APIs. Read [the docs](https://picovcf.readthedocs.io/en/latest/) for an overview of the API.\n\nWhen building code that uses `picovcf.hpp`, define `VCF_GZ_SUPPORT=1` (`-DVCF_GZ_SUPPORT=1` on most compiler command lines) to enable zlib support for compressed VCF files.\n\n## Build and run the tests/tools\n\npicovcf does not need to be built to be used, since it is a single header that gets built as part of your project. However, if you want to build the tests and tools:\n\n```\ncd picovcf\nmkdir build \u0026\u0026 cd build\ncmake .. -DENABLE_VCF_GZ=ON\nmake\n```\n\n**NOTE**: `-DENABLE_VCF_GZ=ON` is optional, and links against `libz` in case you want to support `.vcf.gz` (compressed) files in the tools.\n\nTo convert from a `.vcf` or `.vcf.gz` file to `.igd`, run:\n```\n./igdtools \u003cvcf filename\u003e -o \u003coutput IGD filename\u003e\n```\n\nRun `./igdtools --help` to see the full list of options. Here are some common tasks you might want to perform, besides VCF conversion:\n* Pipe allele frequencies to a file: `./igdtools \u003cinput IGD\u003e -a \u003e allele.freq.tsv`\n* View variant/sample statistics and header info: `./igdtools \u003cinput IGD\u003e --stats --info`\n* To, e.g., restrict to variants in base-pair range 10000,20000 add argument `--range 10000-20000`\n* To restrict to variants with frequencies \u003e=0.01: `--frange 0.01-1.0`\n* Copy from one IGD to another: `./igdtools \u003cinput IGD\u003e -o \u003coutput IGD\u003e`\n  * Only include variants in a certain range and with frequency: `./igdtools \u003cinput IGD\u003e -o \u003coutput IGD\u003e --range 100000-500000 --frange 0.01-0.5`\n\nFinally, to run the unit tests:\n```\nEXAMPLE_VCFS=../test/example_vcfs/ ./picovcf_test\n```\n\nThere is a Dockerfile that encodes all the build steps and dependencies, including documentation build.\n\n## Build the documentation\n\nRequires Python packages `sphinx`, `sphinx-rtd-theme`, `breathe`. Requires Doxygen.\n\nFrom the same `build/` directory as above:\n```\nDOC_BUILD_DIR=$PWD sphinx-build -c ../doc/ -b html -Dbreathe_projects.picovcf=$PWD/doc/xml ../doc/ $PWD/doc/sphinx/\n```\n\n## Indexable Genotype Data (IGD)\n\n`picovcf` also defines an extremely simple binary file format that can be used for fast access to genotype data. Most other genotype data formats are not indexable directly: that is, you cannot jump directly to the 1 millionth variant without first scanning all the previous (almost million) variants. IGD has the following properties:\n* Indexable. You can use math to figure out where the `i`th variant will be in the file.\n* Uncompressed. No need to link in compression libraries.\n* Simple format: all variants are expanded into binary variants. So if a Variant has `N` alternate alleles, then IGD will store that as `N` rows containing `0` (reference allele) or `1` (alternate allele). Each of these binary variants is stored as either a bitvector (non-sparse) or a list of sample indexes (sparse). A flag in the index indicates which way each variant is stored.\n* Very small. Oftentimes smaller than compressed formats like `.vcf.gz` or `.bgen`. The more low-frequency mutations (such as for really large sample sizes) the smaller the file, assuming you are using the default implementation of dynamically choosing between sparse/non-sparse representation.\n\nFor example, the following are from chromosome 22 of a real dataset:\n* `.vcf`: 11GB\n* `.vcf.gz`: 203MB\n* `.bgen`: 256MB\n* `.igd`: 183MB\n\nConverting the `.vcf.gz` to `.bgen` (via qctool) took 23 minutes, but converting to `.igd` only took 3 minutes. Furthermore, iteratively accessing all the variants (and genotype data) in the `.igd` file was approximately `15x` faster than accessing the same data in the `.vcf.gz` file (using `picovcf`). On Biobank-scale real datasets, IGD is on average 3.5x smaller than `.vcf.gz`.\n\n### How do I use IGD in my project?\n\n* Clone [picovcf](https://github.com/aprilweilab/picovcf) and follow the instructions in this README to build the example tools for that library.\n  * If you want to be able to convert `.vcf.gz` (compressed VCF) to IGD, make sure you build with `-DENABLE_VCF_GZ=ON`\n* Use `igdtools` to convert and process files\n* Do one of the following:\n  * If your project is C++, copy [picovcf.hpp](https://github.com/aprilweilab/picovcf/blob/main/picovcf.hpp) into your project, `#include` it somewhere and then use according to the [documentation](https://picovcf.readthedocs.io/en/latest/)\n  * If your project is Python, you can install [pyigd](https://github.com/aprilweilab/pyigd/) via `pip install pyigd` (see [the docs](https://pyigd.readthedocs.io/en/latest/))\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faprilweilab%2Fpicovcf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faprilweilab%2Fpicovcf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faprilweilab%2Fpicovcf/lists"}