https://github.com/cmdcolin/vcfverifier
Check that a VCF matches a given reference genome
https://github.com/cmdcolin/vcfverifier
Last synced: about 1 year ago
JSON representation
Check that a VCF matches a given reference genome
- Host: GitHub
- URL: https://github.com/cmdcolin/vcfverifier
- Owner: cmdcolin
- License: mit
- Created: 2022-07-29T23:50:35.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2024-12-04T16:52:36.000Z (over 1 year ago)
- Last Synced: 2025-04-12T09:54:18.105Z (about 1 year ago)
- Language: Rust
- Size: 44.9 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# vcfverifier
Checks that a given VCF file matches a given assembly in FASTA format by
checking that the REF column matches the FASTA file for each record in the
FASTA file (case insensitive)
## Install
First install rust, probably with rustup https://rustup.rs/
Then
```
cargo install vcfverifier
```
## Usage
```
## Generated FASTA index (fai)
samtools faidx myfile.fa
## Run the verifier
vcfverifier --fasta myfile.fa --vcf myfile.vcf.gz
```
Allows plaintext, gzip, or bgzip vcf files as input to the --vcf flag
## Approx speed
Processing chr1 (6.5M rows) of the 1000 genomes dataset takes ~24seconds
```
$ time vcfverifier --fasta hs37d5.fa --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Lines processed: 6468347
No mismatching lines found
vcfverifier --fasta ~/Downloads/hs37d5.fa --vcf 24.07s user 0.26s system 99% cpu 24.330 total
```
## Note
My first rust project!
Uses faimm to memory-map the indexed FASTA file, keeping memory usage low (the
entire FASTA does not have to be loaded into memory and the VCF is read line by
line) https://github.com/veldsla/faimm