{"id":19434495,"url":"https://github.com/mlin/spvcf","last_synced_at":"2025-09-03T21:42:03.622Z","repository":{"id":138431555,"uuid":"138694727","full_name":"mlin/spVCF","owner":"mlin","description":"Sparse Project VCF: evolution of VCF to encode population genotype matrices efficiently","archived":false,"fork":false,"pushed_at":"2023-10-29T00:59:59.000Z","size":8281,"stargazers_count":58,"open_issues_count":7,"forks_count":1,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-24T20:43:00.571Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mlin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-26T06:34:14.000Z","updated_at":"2024-09-29T00:26:18.000Z","dependencies_parsed_at":"2023-10-02T09:21:50.942Z","dependency_job_id":"ce32067d-6d1e-488e-8bc4-fab8d827b9a1","html_url":"https://github.com/mlin/spVCF","commit_stats":null,"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"purl":"pkg:github/mlin/spVCF","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2FspVCF","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2FspVCF/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2FspVCF/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2FspVCF/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mlin","download_url":"https://codeload.github.com/mlin/spVCF/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2FspVCF/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273516477,"owners_count":25119763,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-03T02:00:09.631Z","response_time":76,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T14:46:37.059Z","updated_at":"2025-09-03T21:42:03.560Z","avatar_url":"https://github.com/mlin.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sparse Project VCF (spVCF)\n\n**Maintainer: Mike Lin [@DNAmlin](https://twitter.com/DNAmlin)**\n\nProject VCF (pVCF; aka multi-sample VCF or population VCF) is the prevailing file format for small genetic variants discovered by cohort sequencing. It encodes a two-dimensional matrix with variant sites down the rows and study participants across the columns, filled in with all the genotypes and associated QC measures (read depths, genotype likelihoods, etc.). Large cohorts harbor many rare variants, implying a sparse genotype matrix composed largely of reference-identical or non-called cells. But the dense pVCF format encodes this inefficiently, growing super-linearly with the cohort size.\n\nspVCF is an evolution of VCF that keeps most aspects of its tab-delimited text format, but presents the genotype matrix sparsely, by selectively reducing QC measure entropy and run-length encoding repetitive information about reference coverage. This is less sophisticated than some other efforts to address VCF's density and other shortcomings, but perhaps more palatable to existing VCF consumers by virtue of simplicity.\n\nFurther resources:\n\n* Our [*Bioinformatics* Applications Note](https://doi.org/10.1093/bioinformatics/btaa1004) describing the approach, tool, and example results\n* [doc/SPEC.md](https://github.com/mlin/spVCF/blob/master/doc/SPEC.md) has format details and a worked example\n* [doc/compression_results.md](https://github.com/mlin/spVCF/blob/master/doc/compression_results.md) tests spVCF with *N*=50K exomes, observing up to 15X size reduction for bgzip-compressed pVCF, and scaling much more gently with *N*.\n* [slide deck](https://docs.google.com/presentation/d/13lzEkdWAVwcsKofhsiYEdl92xMQgx5_dSOSIyZDggfM/edit?usp=sharing) presented at the GA4GH \u0026 MPEG-G Genome Compression Workshop, October 2018.\n* [spVCF files for the resequenced 1000 Genomes Project cohort](https://github.com/mlin/spVCF/blob/master/doc/1000G_NYGC_GATK.md) (*N*=2,504 WGS)\n\n## `spvcf` utility\n\n[![build](https://github.com/mlin/spVCF/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/mlin/spVCF/actions/workflows/build.yml)\n\nThis repository has a command-line utility for encoding pVCF to spVCF and vice versa. The [Releases](https://github.com/mlin/spVCF/releases) page has pre-built executables compatible with most Linux x86-64 hosts, which you can download and `chmod +x spvcf`.\n\nTo build and test it locally, begin with a C++14 Linux development environment with CMake and [libdeflate](https://github.com/ebiggers/libdeflate). Clone this repository and:\n\n```\ncmake . \u0026\u0026 make\nctest -V\n```\n\nThe subcommands `spvcf encode` and `spvcf decode` encode existing pVCF to spVCF and vice versa. The input and output streams are uncompressed VCF text, so you usually arrange a pipe with `bgzip`. Examples:\n\n```\n$ ./spvcf encode cohort.vcf \u003e cohort.spvcf\n$ bgzip -dc cohort.vcf.gz | ./spvcf encode | bgzip -c -@ $(nproc)  \u003e cohort.spvcf.gz\n$ bgzip -dc cohort.spvcf.gz | ./spvcf decode \u003e cohort.decoded.vcf\n```\n\nDetails:\n\n```\nspvcf encode [options] [in.vcf|-]\nReads VCF text from standard input if filename is empty or -\n\nOptions:\n  -o,--output out.spvcf  Write to out.spvcf instead of standard output\n  -n,--no-squeeze        Disable lossy QC squeezing transformation (lossless run-encoding only)\n  -p,--period P          Ensure checkpoints (full dense rows) at this period or less (default: 1000)\n  -t,--threads N         Use multithreaded encoder with this number of worker threads\n  -q,--quiet             Suppress statistics printed to standard error\n  -h,--help              Show this help message\n```\n\n```\nspvcf decode [options] [in.spvcf|-]\nReads spVCF text from standard input if filename is empty or -\n\nOptions:\n  --with-missing-fields  Include trailing FORMAT fields with missing values\n  -o,--output out.vcf    Write to out.vcf instead of standard output\n  -q,--quiet             Suppress statistics printed to standard error\n  -h,--help              Show this help message\n```\n\nThere's also `spvcf squeeze` to apply the QC squeezing transformation to a pVCF, without the sparse quote-encoding. This produces valid pVCF that's typically much smaller, although not as small as spVCF.\n\nThe multithreaded encoder should be used only if the single-threaded version is a proven bottleneck. It's capable of higher throughput in favorable circumstances, but trades off memory usage and copying. The memory usage scales with threads, period, and *N*.\n\n### Tabix slicing\n\nIf the familiar `bgzip` and `tabix -p vcf` utilities are used to block-compress and index a spVCF file, then `spvcf tabix` can take a genomic range slice from it, extracting spVCF which decodes standalone. (The regular `tabix` utility generates the index, but using it to take the slice would yield a broken fragment.) Example:\n\n```\n$ bgzip -dc cohort.vcf.gz | ./spvcf encode | bgzip -c -@ $(nproc)  \u003e cohort.spvcf.gz\n$ tabix -p vcf cohort.spvcf.gz\n$ ./spvcf tabix cohort.spvcf.gz chr21:5143000-5219900 \u003e slice.spvcf\n$ ./spvcf decode slice.spvcf \u003e slice.vcf\n```\n\n## Compatibility\n\nspVCF is frequently used with project VCF files generated by [GATK GenotypeGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/360037057852-GenotypeGVCFs) and [GLnexus](https://github.com/dnanexus-rnd/GLnexus). Other joint-callers' products should work too, but aren't as routinely tested.\n\nGLnexus now has a `--squeeze` command-line option to generate squeezed project VCF directly, which also speeds it up significantly. For large cohorts this should still be piped into `spvcf encode` to clean it up a little and add the run-encoding.\n\nSqueezed project VCF (decoded from spVCF, or generated by `spvcf squeeze`) keeps the declarations of all FORMAT fields, but in most cells omits all except `GT` and `DP`. The rest aren't just marked missing, but omitted completely. Some downstream tools may be confused by this, even though the [VCF specification](https://samtools.github.io/hts-specs/VCFv4.3.pdf) expressly allows it (*\"Trailing fields can be dropped...\"*). We're advocating for more recognition of this useful, existing feature.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlin%2Fspvcf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmlin%2Fspvcf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlin%2Fspvcf/lists"}