https://github.com/brendancsmith/vcf-isec
A simple python implementation of Variant Call Format intersection and complements for identifying genetic mutations
https://github.com/brendancsmith/vcf-isec
bioinformatics genome variant-call-format variant-calling
Last synced: 3 months ago
JSON representation
A simple python implementation of Variant Call Format intersection and complements for identifying genetic mutations
- Host: GitHub
- URL: https://github.com/brendancsmith/vcf-isec
- Owner: brendancsmith
- License: mit
- Created: 2024-03-06T06:56:51.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-07T15:53:56.000Z (about 1 year ago)
- Last Synced: 2025-01-17T03:15:13.186Z (4 months ago)
- Topics: bioinformatics, genome, variant-call-format, variant-calling
- Language: Python
- Homepage:
- Size: 93.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# vcf-isec
A simple python implementation of Variant Call Format intersection and complements.
## Background
Bioinformaticians store variants identified by next generation sequencing in a VCF file. The VCF specification was originally maintained by the 1000 Genomes Project, and the torch has since been passed to the Global Alliance for Genomics and Health Data Working group file format team.
Specifications for VCF v4.1 can be found [here](http://samtools.github.io/hts-specs/VCFv4.1.pdf).
Essentially, a variant is represented as a separate line in the VCF, where the chromosome, position, reference base(s), and alternate base(s) identified at that position are found in columns 1, 2, 4, and 5, resp. Additional information pertaining to the variant is listed in the remaining fields of the VCF.
## Task
A common task for bioinformaticians is to compare variants, whether to compare VCF files generated by different analytical pipelines or to simply compare variants between related individuals.
This script takes as input two VCFs and performs a comparison of the variants found in each file. The script outputs 3 VCFs, reflecting those variants that are shared and unique to each individual.
**NOTE**: An example VCF is provided at `tests/resources/sample.vcf`. VCFs can grow up to 4 million variants in size, as in the case of whole genome sequencing.