https://github.com/alexcoppe/vcf_to_tsv
Transforms a VCF (variant call format) file to a tab-separated values (.tsv) one
https://github.com/alexcoppe/vcf_to_tsv
bioinformatics cpp genomics mutations variant-calling vcf
Last synced: 8 months ago
JSON representation
Transforms a VCF (variant call format) file to a tab-separated values (.tsv) one
- Host: GitHub
- URL: https://github.com/alexcoppe/vcf_to_tsv
- Owner: alexcoppe
- License: mit
- Created: 2024-08-10T14:35:30.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-09T08:33:44.000Z (almost 2 years ago)
- Last Synced: 2025-05-15T07:11:54.591Z (about 1 year ago)
- Topics: bioinformatics, cpp, genomics, mutations, variant-calling, vcf
- Language: C++
- Homepage:
- Size: 12.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# vcf_to_tsv :dna:
Transforms a VCF (variant call format) file to a tab-separated values (.tsv) one.
Its compilation and functionality have been verified on the following operating system:
- macOS :green_apple:
- Linux :penguin:
# Download and Compilation :floppy_disk:
```console
>>> git https://github.com/alexcoppe/vcf_to_tsv
>>> cd vcf_to_tsv
>>> make
```
After compilation, move the generated executable ```vcf_to_tsv``` to a directory listed in the $PATH variable. You can identify these directories by using the ```echo $PATH``` command.
# Run the software :running_man:
This software transforms an uncompressed VCF file to a tab-separated values (tsv) file. It also works with VCFs generated by [SnpEff](https://pcingola.github.io/SnpEff/) and [ANNOVAR](https://annovar.openbioinformatics.org/en/latest/).
To run it, you need **two arguments**: the **VCF file** and a **text file specifying the desired fields**. Refer to the table below for guidance on creating this file.
When utilizing a [SnpEff](https://pcingola.github.io/SnpEff/) annotated VCF, the tool currently displays each transcript indicated by SnpEff in separate rows.
Starting character | What you get
------------ | -------------
None | get the fields from the VCF
: | get a subfield from the INFO field added by SnpEff
; | get a specific subfiled from the IMFO field
\| | get a specific subfield from the Genotype fields
Example of a text file specifying the desired fields and subfields:
```console
:hgvs_c
position
;gnomAD_genome_AMR
|AD
```
Launching the program with the above text file
```console
vcf_to_tsv a_vcf_file_path.vcf wanted_fields.txt
```
Output:
```console
n.-3702C>T 157370625 0.0020 14,1 31,5
n.*1931C>T 157370625 0.0020 14,1 31,5
n.-3707C>T 157370630 0 15,1 33,4
...
```
Currently, the software operates exclusively on 1 or 2 genotype fields.
The table below displays all the sub-fields added by [SnpEff](https://pcingola.github.io/SnpEff/) along with the corresponding sub-field names used in vcf_to_table (listed in the first column).
Subfield by vcf_to_table | Subfield by SnpEff | Explanation
------------ | ------------- | -------------
:allele | Allele (or ALT) | The alternative allele
:annotation | Annotation (a.k.a. effect) | Annotated using Sequence Ontology terms
:putative_impact | Putative_impact | A simple estimation of putative impact / deleteriousness : {HIGH, MODERATE, LOW, MODIFIER}
:gene_name | Gene Name | Common gene name (HGNC)
:gene_id | Gene ID | Gene ID
:feature_type | Feature type | Which type of feature is in the next field
:feature_id | Feature ID | Depends on the annotation
:transcript_biotype | Transcript biotype | The bare minimum is at least a description on whether the transcript is {"Coding", "Noncoding"}. Whenever possible, use ENSEMBL biotypes
:rank | Rank / total | Exon or Intron rank / total number of exons or introns
:hgvs_c | HGVS.c | Variant using HGVS notation (DNA level)
:hgvs_p | HGVS.p | If variant is coding, this field describes the variant using HGVS notation (Protein level)
:cdna_position | cDNA_position / cDNA_len | Position in cDNA and trancript's cDNA length (one based)
:cds_position | CDS_position / CDS_len | Position and number of coding bases (one based includes START and STOP codons)
:protein_position | Protein_position / Protein_len | Position and number of AA (one based, including START, but not STOP)
:distance_to_feature | Distance to feature | All items in this field are options see SnpEff page for details
:errors | Errors, Warnings or Information messages | Errors, warnings or informative message that can affect annotation accuracy