Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cmdcolin/vcfverifier
Check that a VCF matches a given reference genome
https://github.com/cmdcolin/vcfverifier
Last synced: 4 days ago
JSON representation
Check that a VCF matches a given reference genome
- Host: GitHub
- URL: https://github.com/cmdcolin/vcfverifier
- Owner: cmdcolin
- License: mit
- Created: 2022-07-29T23:50:35.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-07-29T16:29:19.000Z (4 months ago)
- Last Synced: 2024-10-31T11:57:12.180Z (19 days ago)
- Language: Rust
- Size: 41 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# vcfverifier
Checks that a given VCF file matches a given assembly in FASTA format by
checking that the REF column matches the FASTA file for each record in the
FASTA file (case insensitive)## Install
First install rust, probably with rustup https://rustup.rs/
Then
```
cargo install vcfverifier
```## Usage
```
## Generated FASTA index (fai)
samtools faidx myfile.fa## Run the verifier
vcfverifier --fasta myfile.fa --vcf myfile.vcf.gz
```Allows plaintext, gzip, or bgzip vcf files as input to the --vcf flag
## Approx speed
Processing chr1 (6.5M rows) of the 1000 genomes dataset takes ~24seconds
```
$ time vcfverifier --fasta hs37d5.fa --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Lines processed: 6468347
No mismatching lines found
vcfverifier --fasta ~/Downloads/hs37d5.fa --vcf 24.07s user 0.26s system 99% cpu 24.330 total```
## Note
My first rust project!
Uses faimm to memory-map the indexed FASTA file, keeping memory usage low (the
entire FASTA does not have to be loaded into memory and the VCF is read line by
line) https://github.com/veldsla/faimm