Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lipingfangs/VAP

a programme for visualization of short sequence alignment and path navigation in graphical pan-genome
https://github.com/lipingfangs/VAP

Last synced: 2 months ago
JSON representation

a programme for visualization of short sequence alignment and path navigation in graphical pan-genome

Awesome Lists containing this project

README

        

# VAP (Visualization of Read Alignments with Pangenomes)
A programme for visualization of reads alignment and path navigation in graphical pan-genome, The mapping stage of read to graph can be processed by vg giraffe/map or other similar softwares. The .bam file generated from the alignment was the file of input in VAP.
![main](https://user-images.githubusercontent.com/46209789/213981873-bc18ff74-93ff-4001-8ecd-3dfc1ed5992c.jpg)

##Version in Window Os was online, which can be downloaded in https://ricegenomichjx.xiaomy.net/VAP/VAPallgui.exe.

##The detailed description and usages can be found in documents in https://lipingfangs.github.io/VAPreadme.html.

***Introduction***

VAP is a comprehensive bioinformatic tool designed for the analysis of graphical pangenome data and its corresponding read alignments generated by widely used mapper such as vg map and vg giraffe, which consists of several modules as below:
1. Index: This module accepts files in GFA format as input, using GFAtools,to convert a GFA format file into an non-redundant variation information file for subsequent visualization and analysis.
2. Graphsamtools: This module enables the extraction of alignment information from specific intervals and associated branch nodes within the graph-based pangenome, using files in gam and bam formats, through the application of vg, BEDtools and SAMtools.
3. ShowTrack: This module is capable of processing node information extracted from specific intervals using Graphsamtools. It can recognize relationships and directions between nodes, and provides functionality to filter nodes based on their length or to extract particular nodes based on a list containing the nodes' names.
4. ShowRead: This module processes read information from specified intervals, extracted by Graphsamtools. It can process both short and long read mapping information to recognize read pairs and split read information and identify and display SNPs and small indels/deletions through the sub-module ShowSmall. The Python package pysam is used for processing the bam format file.
5. Coverages: Mosdepth is utilized in this module to calculate read coverage and depth based on the alignment information. Using this mode, users can visualize the read coverage and depth information within extracted interval within the graph-based pangenome.
6. MutSamples: This module integrates and visualizes read distribution and coverage from multiple samples within the same interval
7. GainGene: With an annotation file in the gff format, this module extracts and visualizes gene, transcript and TE information within specified intervals.
8. PopulationFrequency: This module integrates Plink v1.9 to calculate and visualizes PAV frequency across various populations. Users can employ the Chi-Squared Test, provided by scikit-learn, and set a P-value (default: 0.01) to identify significant PAV differential intervals between populations. Using the GainGene module, along with provided gene annotation, users can pinpoint functional genes situated within these significant PAV differential intervals.
9. SupportSQ: The SupportSQ module restores each path in the variation graph to the segment form (Nx……Nx+i), generated by Minigraph, based on the branch information. The SupportSQ module divides the nodes into two categories (In the reference path/ Not in the reference path). The module then sorts the nodes according to their location in the genome, while the nodes out of reference genome are judged based on the location of the start point of variation between the reference genomes. The first nodes covered by reliable reads in the reference genome is regarded as the starting node. From this starting node (Nx), this module first determines whether the adjacent nodes are connected in the mapping data supported with paired-end and read-split information. The read coverage is used to judge the mapped interval in connected nodes Nx+1. The paired-end and read-split information of connected nodes (Nx+1) is further used to search the next connected node (Nx+2…Nx+i.). After this judgment, during the extension process, this module attempts to identify all mapping data connected nodes in the extraction interval. If a breakpoint is detected in this process and the next connected node cannot be extended, VAP will attempt to reset the starting nodes in the reference genome using read coverage. A default threshold of paired-end and split-read information (connected nodes > 5) and sequencing read depth ( > 5X)was used to control the ending of the extension process.
10. Convey: The convey module provides an interface to convey and reduce dimension of GFA files stored in a non-coordinate manner according to the specified reference genome (add SN/SO/SR), which is converted to a non-redundant GFA format (rGFA) with reference genome coordinates.

***Install***

**Dependency*

Pysam, matplotlib, mpld3

Vg, Samtools, Bedtools, mosdepth(to identify PAV and reads coverage), seqkit (to display snp)

php environment (for web), apache (for web)

**Command tools*

```
cd VAP
chmod 700 graphsamtools
python setup.py install
cd script
export PATH=$PATH:$(pwd)
chmod 700 *py
```

**web-server*
All pages are packaged in the webserver index. You can directly copy the page to the login index of the server to complete the deployment and make it accessible. If you have any difficulty, Please feel free to contact the author ([email protected]); The example platform was accessible with https://ricegenomichjx.xiaomy.net/VAP/sequenceextraction.php

***Usage***

**Generated the info.file contained graph tracks information*

```
#For the graph generated from minigraph or Cactus-minigraph
gfatools gfa2fa -s .gfa > .fa
python VAP/runVAP.py --mode index --rfa .fa >
#The bug for complex region was fixed in the lastest verion. no error should be reported in this stages.
#error tend to be reported in this step which will not affect the user in next step if the info file was generated; That was attriubute to the compliacted branch. It will be fixed in next version.

#For the graph with individual nodes information, which means the SN/SO/SR information are lost in graph which contain the information in P line from each indiviual in graph construction
python VAP/runVAP.py --mode convey --gfa .gfa --ref > .r.fa
python VAP/runVAP.py --mode index --rfa .r.fa >
```

**Extraction of related main and branch paths based on a certain reference genome interval*

The bam file can be generated from vg giraffe or other similar software, which should be sorted with samtools sort and build the index.

```
#From .bam file
graphsamtools

#From .gam file (Longer time may used for the running of Vg)(The file of was generated from vg autoindex with -w map paramter)
graphsamtools gam
```

In snp mode within 2000bp:
```
graphsamtools .fa

```
**Only display the Graph*

Large interval scales are supported(eg. a chromosome)

```
graphsamtools onlytrack
```

**Population mode*

Please keep the script freqacq.py in the same index of graphsamtools

```
graphsamtools population , ,

```
**For upload and visualized in Web-sever*

```
tar zcvf .tar.gz
```
Upload the .tar.gz.

**For drawing and graph navigation with command lines*

```

usage:python runVAP.py -h (for more help)

VSAG is a software for the visualization of short read alignment of graphical pan-genome

optional arguments:
-h, --help show this help message and exit
--inindex ININDEX the index of your data after process of graphsamtools
--out OUT The output file name of the image
--geneinfo GENEINFO bed file contained gene info
--gff GFF Annotation file of the graph genome
--fa FA Phase the sequence of reliable tracks
--drawtype DRAWTYPE Types of your data you want to visualization (onlytrack/read/coverage/mutiplesamples/Popultaion)
--anntracks ANNTRACKS
High light the pair-end supported tracks(pathways), default not (0)
--pairend PAIREND Display the pair end information or not, default not (0)
--pairendrange PAIRENDRANGE
The search range in the main track, default: 200
--pairendtheraold PAIRENDTHERAOLD
The selected theraold in the main track, default: 1
--gaingene GAINGENE gain the gene from the gff file, default: 0
--fl FL The filter the track lengths below
--td TD draw the track direction or not
--rd RD draw the read direction or not
--rn RN draw the read name or not
--legend LEGEND draw the legend or not
--legendheight LEGENDHEIGHT
The height of legend
--snp SNP draw the snp information or not (interval<2000bp)
--onlysv ONLYSV Only display the SV large than thersold (default for >5bp) or not
--onlysvthersold ONLYSVTHERSOLD
Thersold of length only display the SV large than thersold or not
--coveragesteplength COVERAGESTEPLENGTH
Length of step with coverage (default for 100bp)
--middle MIDDLE Middle the track and read, default yes (1)
--trackcolor TRACKCOLOR
Track colors including main track and the branches default:#CDCD00,#00BFFF
--readcolor READCOLOR
Read colors default:#FFC1C1
--pairendcolor PAIRENDCOLOR
Colors of pair end link default:yellow
--coveragecolor COVERAGECOLOR
coverage colors default:#FF7F50
--anncolor ANNCOLOR Read colors default:#D02090
--genecolor GENECOLOR
Gene colors default:black
--mutilplesamplecolor MUTILPLESAMPLECOLOR
Mutilple sample colors default:#FFDEAD,#FFA54F
--sx SX Size of X
--sy SY Size of Y
--xlabel XLABEL Label of your Y axi
--ylabel YLABEL Label of your X axi
--ppi PPI The dpi of your image
--imtype IMTYPE The image type of output;default:png; Support:pdf,svg,jpg
--dw DW web (.html output or not)

```
***Command example***

**Only display the graph*
```
python runVAP.py --inindex --drawtype onlytrack
```
**Draw the alignment of read*

```
python runVAP.py --inindex
```

**Draw the coverage of reads alignment*

```
python runVAP.py --inindex --drawtype coverage
```

**Draw the alignment with split read information and predict the reliable segements(nodes)*

```
The dir should be generated from gam mode in graphsamtools
python VAP/runVAP.py --inindex --mode gam --readspilt 1 --anntrack 1

```

**Draw the alignment with read pair information and predict the reliable segements(nodes)*

```

python runVAP.py --inindex --pairend 1 --anntrack 1

```

**Draw the distriubution of gene with .gff file (extraction based interval was realized)*

```
python runVAP.py --inindex --gaingene 1 --gff
```

**Draw the distriubution of coverage between population*

```
python runVAP.py --inindex --drawtype populationfreq
```

**Draw the distriubution of coverage between population*

***Contacts and cite***

IF you have any problem or comment in usage, please feel free to contact the aurthor ([email protected]) who will reply on time!

IF the software participate in your researech , please cite the paper: Visualization and review of reads alignment on the graphical pan-genome with VAP
Fangping Li, Haifei Hu, Zitong Xiao, Jingming Wang, Jieying Liu, Deshu Zhao, Yu Fu, Yijun Wang, Xue Yuan, Suhong Bu, Xiaofan Zhou, Junliang Zhao, Shaokui Wang
bioRxiv 2023.01.20.524849; doi: https://doi.org/10.1101/2023.01.20.524849, Thank you so much