An open API service indexing awesome lists of open source software.

https://github.com/alexcoppe/vcf_to_tsv

Transforms a VCF (variant call format) file to a tab-separated values (.tsv) one
https://github.com/alexcoppe/vcf_to_tsv

bioinformatics cpp genomics mutations variant-calling vcf

Last synced: 8 months ago
JSON representation

Transforms a VCF (variant call format) file to a tab-separated values (.tsv) one

Awesome Lists containing this project

README

          

# vcf_to_tsv :dna:
Transforms a VCF (variant call format) file to a tab-separated values (.tsv) one.

Its compilation and functionality have been verified on the following operating system:

- macOS :green_apple:
- Linux :penguin:

# Download and Compilation :floppy_disk:

```console
>>> git https://github.com/alexcoppe/vcf_to_tsv
>>> cd vcf_to_tsv
>>> make
```

After compilation, move the generated executable ```vcf_to_tsv``` to a directory listed in the $PATH variable. You can identify these directories by using the ```echo $PATH``` command.

# Run the software :running_man:

This software transforms an uncompressed VCF file to a tab-separated values (tsv) file. It also works with VCFs generated by [SnpEff](https://pcingola.github.io/SnpEff/) and [ANNOVAR](https://annovar.openbioinformatics.org/en/latest/).

To run it, you need **two arguments**: the **VCF file** and a **text file specifying the desired fields**. Refer to the table below for guidance on creating this file.

When utilizing a [SnpEff](https://pcingola.github.io/SnpEff/) annotated VCF, the tool currently displays each transcript indicated by SnpEff in separate rows.

Starting character | What you get
------------ | -------------
None | get the fields from the VCF
: | get a subfield from the INFO field added by SnpEff
; | get a specific subfiled from the IMFO field
\| | get a specific subfield from the Genotype fields

Example of a text file specifying the desired fields and subfields:

```console
:hgvs_c
position
;gnomAD_genome_AMR
|AD
```

Launching the program with the above text file

```console
vcf_to_tsv a_vcf_file_path.vcf wanted_fields.txt
```

Output:

```console
n.-3702C>T 157370625 0.0020 14,1 31,5
n.*1931C>T 157370625 0.0020 14,1 31,5
n.-3707C>T 157370630 0 15,1 33,4
...
```

Currently, the software operates exclusively on 1 or 2 genotype fields.

The table below displays all the sub-fields added by [SnpEff](https://pcingola.github.io/SnpEff/) along with the corresponding sub-field names used in vcf_to_table (listed in the first column).

Subfield by vcf_to_table | Subfield by SnpEff | Explanation
------------ | ------------- | -------------
:allele | Allele (or ALT) | The alternative allele
:annotation | Annotation (a.k.a. effect) | Annotated using Sequence Ontology terms
:putative_impact | Putative_impact | A simple estimation of putative impact / deleteriousness : {HIGH, MODERATE, LOW, MODIFIER}
:gene_name | Gene Name | Common gene name (HGNC)
:gene_id | Gene ID | Gene ID
:feature_type | Feature type | Which type of feature is in the next field
:feature_id | Feature ID | Depends on the annotation
:transcript_biotype | Transcript biotype | The bare minimum is at least a description on whether the transcript is {"Coding", "Noncoding"}. Whenever possible, use ENSEMBL biotypes
:rank | Rank / total | Exon or Intron rank / total number of exons or introns
:hgvs_c | HGVS.c | Variant using HGVS notation (DNA level)
:hgvs_p | HGVS.p | If variant is coding, this field describes the variant using HGVS notation (Protein level)
:cdna_position | cDNA_position / cDNA_len | Position in cDNA and trancript's cDNA length (one based)
:cds_position | CDS_position / CDS_len | Position and number of coding bases (one based includes START and STOP codons)
:protein_position | Protein_position / Protein_len | Position and number of AA (one based, including START, but not STOP)
:distance_to_feature | Distance to feature | All items in this field are options see SnpEff page for details
:errors | Errors, Warnings or Information messages | Errors, warnings or informative message that can affect annotation accuracy