{"id":19434496,"url":"https://github.com/mlin/vcf_line_splitter","last_synced_at":"2025-04-14T18:11:08.312Z","repository":{"id":138431764,"uuid":"219890766","full_name":"mlin/vcf_line_splitter","owner":"mlin","description":"Split a huge VCF file into parts, quickly","archived":false,"fork":false,"pushed_at":"2021-12-30T09:43:07.000Z","size":29,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-28T06:33:41.335Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mlin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-06T02:03:00.000Z","updated_at":"2023-04-26T02:00:53.000Z","dependencies_parsed_at":"2024-01-13T01:44:36.021Z","dependency_job_id":null,"html_url":"https://github.com/mlin/vcf_line_splitter","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2Fvcf_line_splitter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2Fvcf_line_splitter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2Fvcf_line_splitter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2Fvcf_line_splitter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mlin","download_url":"https://codeload.github.com/mlin/vcf_line_splitter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248933340,"owners_count":21185460,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T14:46:37.314Z","updated_at":"2025-04-14T18:11:08.286Z","avatar_url":"https://github.com/mlin.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vcf_line_splitter\n\n### Split a huge VCF file into multiple parts, quickly\n\nStarting with `big.vcf.gz`,\n\n```\nbgzip -dc@ 4 big.vcf.gz | vcf_line_splitter -MB 1024 -threads $(nproc) small-\n```\n\nWrites `small-000000.vcf.gz`, `small-000001.vcf.gz`, `small-000002.vcf.gz`, ..., each including the header and roughly one gigabyte worth (before compression) of the variant lines from the original VCF (contiguous and in-order).\n\nIt's multithreaded C++ code to do this at high speed. Memory usage is liable to scale as the specified part size times the number of threads (times safety factor).\n\nCompile with `make`, dependencies listed in the [Dockerfile](https://github.com/mlin/vcf_line_splitter/blob/master/Dockerfile).\n\n### Motivation\n\nModern cohort sequencing projects now produce joint .vcf.gz files that are individually hundreds of gigabytes or more, compressed. Parallel analytics environments like Apache Spark are appropriate for such datasets; but, importing individual files of that size can still be very slow, because they're initially read using just one thread. Splitting the file beforehand lets us parallelize the import.\n\nSo we wrote this utility with custom multithreading, and other low-level speed tuning, to split up the lines of a big VCF file into recompressed partitions as quickly as possible, to prepare for import to Spark or similar.\n\nAlternatives and their problems:\n\n* coreutils `split`: materializes uncompressed data on disk (too big), or blocks main thread on recompression (too slow); doesn't copy the header into each part.\n* `tabix`: complications when variants span the edges of genome regions extracted; have to had run single-threaded tabix indexing.\n\n### Limitations\n\nSplitting occurs between VCF lines, but otherwise without regard to their genome positions or content. Therefore, a part may contain variants on many contigs, or the variants on one contig may be found in many parts. Similarly, groups of related variants (e.g. by phase set, or spanning deletion) may be split across parts.\n\n### Maximizing throughput\n\nBesides adding more threads, make sure you have modern versions of bgzip and htslib which support multithreaded BGZF and use [libdeflate](https://github.com/ebiggers/libdeflate). To verify, `ldd $(which bgzip) vcf_line_splitter` and check that both use `libdeflate.so`.\n\nIf the big input VCF file initially resides on a remote server, then pipe the data directly into decompression and `vcf_line_splitter`, instead of first downloading it completely.\n\n### Docker \u0026 WDL\n\nThe published Docker image has all the dependencies on board, and might be used like so:\n\n```\ndocker run --rm -it -v $(pwd):/work ghcr.io/mlin/vcf_line_splitter bash -euo pipefail \\\n    \"bgzip -dc@ 4 /work/big.vcf.gz | vcf_line_splitter -MB 1024 -threads $(nproc) /work/small-\"\n```\n\nWe've also included a [WDL task](https://github.com/mlin/vcf_line_splitter/blob/master/vcf_line_splitter.wdl) definition.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlin%2Fvcf_line_splitter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmlin%2Fvcf_line_splitter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlin%2Fvcf_line_splitter/lists"}