{"id":19434498,"url":"https://github.com/mlin/gvcf_norm","last_synced_at":"2025-04-24T20:32:12.689Z","repository":{"id":43799355,"uuid":"387141964","full_name":"mlin/gvcf_norm","owner":"mlin","description":"gVCF allele normalizer","archived":false,"fork":false,"pushed_at":"2024-01-14T15:31:11.000Z","size":10777,"stargazers_count":8,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-03T10:38:02.766Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mlin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-18T09:52:27.000Z","updated_at":"2022-07-07T19:55:23.000Z","dependencies_parsed_at":"2024-11-10T14:46:59.103Z","dependency_job_id":"00fd8c90-0fee-44a1-8ec6-a6ab582431ec","html_url":"https://github.com/mlin/gvcf_norm","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2Fgvcf_norm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2Fgvcf_norm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2Fgvcf_norm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2Fgvcf_norm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mlin","download_url":"https://codeload.github.com/mlin/gvcf_norm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250704843,"owners_count":21473771,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T14:46:37.528Z","updated_at":"2025-04-24T20:32:09.872Z","avatar_url":"https://github.com/mlin.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gvcf_norm\n\n**Command-line tool for [left-aligning and normalizing](https://genome.sph.umich.edu/wiki/Variant_Normalization#Algorithm_for_Normalization) gVCF variants**\n\n**NOTICE: [DeepVariant v1.3.0+](https://github.com/google/deepvariant/releases/tag/v1.3.0) has a built-in option `--normalize-reads` to ensure indel left-alignment, replacing this utility**\n\nSame algorithm as `vt normalize` and `bcftools norm -f`, but tolerates gVCF's idioms: (i) ignores any symbolic alleles in variant records (`\u003cNON_REF\u003e`, `\u003c*\u003e`, `*`), and (ii) passes through reference bands unchanged.\n\nBecause variant records can be repositioned, *but not* reference bands, a repositioned variant record may end up in the middle of an overlapping reference band, with a small coverage gap between the reference bands. But at least the variant records are normalized.\n\n**Build** \n\n[![CI](https://github.com/mlin/gvcf_norm/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/mlin/gvcf_norm/actions/workflows/build.yml)\n\n```cargo build --release```\n\nto build `target/release/gvcf_norm`\n\n**Usage**\n\n```bgzip -dc my.g.vcf.gz | ./gvcf_norm -r /ref/genome/dir/ - | bgzip -c \u003e my.norm.g.vcf.gz```\n\nwhere `/ref/genome/dir` is a directory with the reference genome sequences, one file per chromosome named as such, containing no whitespace (suitable for memory-mapped offset access into each sequence). Generate this directory from a reference genome FASTA using the [`unpack_fasta_dir.sh`](unpack_fasta_dir.sh) script, which uses [seqkit](https://bioinf.shenwei.me/seqkit/).\n\nMemory usage grows with the uncompressed text of all gVCF records for the largest chromosome.\n\n### Example 1\n\n*(Some fields omitted for brevity)*\n\n**Before**\n\n```\nchr21  29848774  T   \u003c*\u003e             END=29848778  GT:MIN_DP  0/0:30\nchr21  29848779  AT  ATATATTT,T,\u003c*\u003e  .             GT:DP      1/2:32\nchr21  29848781  T   \u003c*\u003e             END=29848791  GT:MIN_DP  0/0:19\n```\n\n**After**\n\n```\nchr21  29848774  T   \u003c*\u003e             END=29848778                    GT:MIN_DP  0/0:30\nchr21  29848778  TA  TATATATT,T,\u003c*\u003e  gvcf_norm_originalPOS=29848779  GT:DP      1/2:32\nchr21  29848781  T   \u003c*\u003e             END=29848791                    GT:MIN_DP  0/0:19\n```\n\nThe single-nucleotide deletion chr21:29848778 TAT\u003eTT was written as 29848779 AT\u003eT but normalized to 29848778 TA\u003eT, and the insertion padded to match. The pre-normalized position is recorded in a new INFO field. The new position overlaps with the preceding reference band, and there's a gap in reference band coverage.\n\n### Example 2\n\n**Before**\n\n```\nchr21  26193733  T       \u003c*\u003e      END=26193733  GT:MIN_DP  0/0:33\nchr21  26193734  G       T,\u003c*\u003e    .             GT:DP      0/1:33\nchr21  26193735  T       \u003c*\u003e      END=26193740  GT:MIN_DP  0/0:32\nchr21  26193741  TTTTTT  T,\u003c*\u003e    .             GT:DP      0/1:32\nchr21  26193747  T       \u003c*\u003e      END=26193751  GT:MIN_DP  0/0:22\n```\n\n**After**\n\n```\nchr21  26193733  T       \u003c*\u003e    END=26193733                    GT:MIN_DP  0/0:33\nchr21  26193734  G       T,\u003c*\u003e  .                               GT:DP      0/1:33\nchr21  26193734  GTTTTT  G,\u003c*\u003e  gvcf_norm_originalPOS=26193741  GT:DP      0/1:32\nchr21  26193735  T       \u003c*\u003e    END=26193740                    GT:MIN_DP  0/0:32\nchr21  26193747  T       \u003c*\u003e    END=26193751                    GT:MIN_DP  0/0:22\n```\n\nThe deletion chr21:26193741 TTTTTT\u003eT moved some distance upstream to 26193734 GTTTTT\u003eG. The record order changed to remain sorted by position. The new position coincides with another variant record, which wasn't merged, and also hangs over a passed-through reference band. There's a gap in reference band coverage at the old position.\n\nThe tool doesn't have enough information to fix up the reference bands. We hope the tool will not be needed for long.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlin%2Fgvcf_norm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmlin%2Fgvcf_norm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlin%2Fgvcf_norm/lists"}