https://github.com/alexcoppe/vcf_to_tsv

Transforms a VCF (variant call format) file to a tab-separated values (.tsv) one
https://github.com/alexcoppe/vcf_to_tsv

bioinformatics cpp genomics mutations variant-calling vcf

Last synced: 9 months ago
JSON representation

Transforms a VCF (variant call format) file to a tab-separated values (.tsv) one

Host: GitHub
URL: https://github.com/alexcoppe/vcf_to_tsv
Owner: alexcoppe
License: mit
Created: 2024-08-10T14:35:30.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-09-09T08:33:44.000Z (almost 2 years ago)
Last Synced: 2025-05-15T07:11:54.591Z (about 1 year ago)
Topics: bioinformatics, cpp, genomics, mutations, variant-calling, vcf
Language: C++
Homepage:
Size: 12.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # vcf_to_tsv :dna:

Transforms a VCF (variant call format) file to a tab-separated values (.tsv) one.

Its compilation and functionality have been verified on the following operating system:

- macOS :green_apple:

- Linux :penguin:

# Download and Compilation :floppy_disk:

```console

>>> git https://github.com/alexcoppe/vcf_to_tsv

>>> cd vcf_to_tsv

>>> make

```

After compilation, move the generated executable ```vcf_to_tsv``` to a directory listed in the $PATH variable. You can identify these directories by using the ```echo $PATH``` command.

# Run the software :running_man:

This software transforms an uncompressed VCF file to a tab-separated values (tsv) file. It also works with VCFs generated by [SnpEff](https://pcingola.github.io/SnpEff/) and [ANNOVAR](https://annovar.openbioinformatics.org/en/latest/). 

To run it, you need **two arguments**: the **VCF file** and a **text file specifying the desired fields**. Refer to the table below for guidance on creating this file.

When utilizing a [SnpEff](https://pcingola.github.io/SnpEff/) annotated VCF, the tool currently displays each transcript indicated by SnpEff in separate rows.

Starting character | What you get

------------ | -------------

None | get the fields from the VCF

 : | get a subfield from the INFO field added by SnpEff

 ; | get a specific subfiled from the IMFO field 

 \| | get a specific subfield from the Genotype fields 

Example of a text file specifying the desired fields and subfields:

```console

:hgvs_c

position

;gnomAD_genome_AMR

|AD

```

Launching the program with the above text file

```console

vcf_to_tsv a_vcf_file_path.vcf wanted_fields.txt

```

Output:

```console

n.-3702C>T      157370625       0.0020  14,1    31,5

n.*1931C>T      157370625       0.0020  14,1    31,5

n.-3707C>T      157370630       0       15,1    33,4

...

```

Currently, the software operates exclusively on 1 or 2 genotype fields.

The table below displays all the sub-fields added by [SnpEff](https://pcingola.github.io/SnpEff/) along with the corresponding sub-field names used in vcf_to_table (listed in the first column). 

Subfield by vcf_to_table | Subfield by SnpEff | Explanation

------------ | ------------- | -------------

:allele | Allele (or ALT) | The alternative allele

:annotation | Annotation (a.k.a. effect) | Annotated using Sequence Ontology terms

:putative_impact | Putative_impact | A simple estimation of putative impact / deleteriousness : {HIGH, MODERATE, LOW, MODIFIER}

:gene_name | Gene Name | Common gene name (HGNC)

:gene_id | Gene ID | Gene ID

:feature_type | Feature type | Which type of feature is in the next field

:feature_id | Feature ID | Depends on the annotation

:transcript_biotype | Transcript biotype | The bare minimum is at least a description on whether the transcript is {"Coding", "Noncoding"}. Whenever possible, use ENSEMBL biotypes

:rank | Rank / total | Exon or Intron rank / total number of exons or introns

:hgvs_c | HGVS.c | Variant using HGVS notation (DNA level)

:hgvs_p | HGVS.p | If variant is coding, this field describes the variant using HGVS notation (Protein level)

:cdna_position | cDNA_position / cDNA_len | Position in cDNA and trancript's cDNA length (one based)

:cds_position | CDS_position / CDS_len | Position and number of coding bases (one based includes START and STOP codons)

:protein_position | Protein_position / Protein_len | Position and number of AA (one based, including START, but not STOP)

:distance_to_feature | Distance to feature | All items in this field are options see SnpEff page for details

:errors | Errors, Warnings or Information messages | Errors, warnings or informative message that can affect annotation accuracy

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alexcoppe/vcf_to_tsv

Awesome Lists containing this project

README