Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/starsareintherose/rgbepp
Reference Genome based Exon Phylogeny Pipeline
https://github.com/starsareintherose/rgbepp
exome exome-sequencing exome-sequencing-analysis next-generation-sequencing phylogeny
Last synced: 23 days ago
JSON representation
Reference Genome based Exon Phylogeny Pipeline
- Host: GitHub
- URL: https://github.com/starsareintherose/rgbepp
- Owner: starsareintherose
- License: gpl-2.0
- Created: 2024-07-05T08:34:03.000Z (6 months ago)
- Default Branch: master
- Last Pushed: 2024-12-08T07:08:06.000Z (about 1 month ago)
- Last Synced: 2024-12-08T07:19:06.270Z (about 1 month ago)
- Topics: exome, exome-sequencing, exome-sequencing-analysis, next-generation-sequencing, phylogeny
- Language: D
- Homepage: https://git.malacology.net/malacology/RGBEPP
- Size: 107 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# RGB EPP
Reference Genome based Exon Phylogeny Pipeline
License: GPL-2.0-only
Author: Guoyi Zhang
## Requirements
### External software
- fastp
- spades.py (provided by spades)
- diamond
- bowtie2
- samtools
- bcftools
- exonerate (optional, only for --codon)
- java
- macse (default recognized path: /usr/share/java/macse.jar)
- trimal### Internal software
- sortdiamond (default recognized path: /usr/bin/sortdiamond)
- delstop (default recognized path: /usr/bin/delstop)## Arguments
### Details
```
-c --config config file for software path (optional)
-g --genes gene file path (optional, if -r is specified)
-f --functions functions type (optional): all clean assembly
map postmap varcall consen codon align trim
-h --help show this information
-l --list list file path
-m --memory memory settings (optional, default 16 GB)
-r --reference reference genome path
-t --threads threads setting (optional, default 8 threads)
--codon Only use the codon region (optional)
--fastp Fastp path (optional)
--spades Spades python path (optional)
--diamond Diamond python path (optional)
--sortdiamond SortDiamond python path (optional)
--bowtie2 Bowtie2 path (optional)
--samtools Samtools path (optional)
--bcftools Bcftools path (optional)
--exonerate Exonerate path (optional)
--macse Macse jarfile path (optional)
--delstop Delstop path (optional)
--trimal Trimal path (optional)
for example: ./RGBEPP -f all -l list -t 8 -r reference.fasta
```### Directories Design
```
.
├── 00_raw
├── 01_fastp
├── 02_spades
├── 03_bowtie2
├── 04_bam
├── 05_vcf
├── 06_consen
├── 07_macse
├── 08_trimal
├── list
├── gene
├── reference.aa.fasta
└── RGBEPP
```Each directory corresponds to each function.
`00_raw` should conatin all raw fastq.gz data.
### Text Files
`list` is the text file containing all samples, if your raw data is following the style ${list_name}\_R1.fastq.gz and ${list_name}\_R2.fastq.gz, ${list_name} is what you should list in `list` file. The easy way to get it in Linux/Unix system is the following command
```
cd 00_raw
ls | sed "s@_R[12].fastq.gz@@g" | uniq > ../list
cd ..
````genes` is the text file containing all gene names from the reference fasta file. The easy way to get it in Linux/Unix system is the following command
```
grep '>' Reference.fasta | sed "s@>@@g" > genes
````reference.aa.fasta` can be replaced by another other name, but it must contain reference amino acids genome in fasta format
## Process
### RGBEPP functions
- Function clean: Quality control + trimming (fastp)
- Function assembly: de novo assembly (spades)
- Function map: local nucleic acids alignment search against amino acids subject sequence (diamond, sortdiamond), mapping raw reads to its scaffolds sequences (bowtie2)
- Function postmap: Sorting and marking the read read alignment (samtools)
- Function varcall: variant calling and filtering (bcftools)
- Function consen: get consensus fasta file from vcf files (bcftools), then sort sequences based on gene name and taxa name (RGBEPP)
- Function codon (optional): only extract the exon sequence (exonerate)
- Function align: multiple sequence align based on condon (macse)
- Function trim: trimming based on codon (trimal, delstop)### Arguments reuqirements for functions
| Functions | -g/--gene | -l/--list | -r/--reference |
| --------- | --------- | --------- | -------------- |
| clean | | ✔ | |
| assembly | | ✔ | |
| map | | ✔ | ✔ |
| postmap | | ✔ | |
| varcall | | ✔ | |
| consen | ✔ | ✔ | |
| codon | ✔ | | ✔ |
| align | ✔ | | |
| trim | ✔ | | |### Downstream process
- concatenate sequences via SeqCombGo or catsequences or sequencematrix
- coalescent / concatenated phylogeny## Inner software
### sortdiamond
Usage: `sortdiamond diamond_output.m8 generated.fasta sseq,qstart,qend,bitscore/evalue,qseq(optional, default 1,6,7,11,17, start from 0) bitscore/evalue(optional, default bitscore)`
Default sseq is column 2, qstart is column 8, etc.
Diamond default output format (--outfmt 6) does not contain qseq, you must custom the output format under output format 6.
### delstop
`delstop --delete`
Delete StopCondon generated by Macse. fasta_aa and fasta_nt should be macse output files, `--delete` should be used when downstream software is tirmal
### splitfasta
Usage: `splitfasta sample.fasta`
It always creates directories in the path that you run the splitfasta, and puts split fasta into the directory.