Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://gitlab.com/piroonj/eligos2

Epitranscriptional/(Epigenomical) Landscape Inferring from Glitches of ONT Signals (version 2)
https://gitlab.com/piroonj/eligos2

Last synced: about 2 months ago
JSON representation

Epitranscriptional/(Epigenomical) Landscape Inferring from Glitches of ONT Signals (version 2)

Lists

README

        

# ELIGOS2
![eligos](images/eligos_logo_web.png)
## **E**pitranscriptional/(Epigenomical) **L**andscape **I**nferring from **G**litches of **O**NT **S**ignals (**version 2**)

## **SUMMARY**

Oxford Nanopore Technology (ONT) offers the sequencing platform enables us to sequence DNA and RNA in native from without amplification. Therefore, any existed modifications on the native sequences are preserved, resulting in the recorded ionic signals. The alterations of the signal, which differ from canonical base calling model, lead to the missed interpretation of base caller as sequencing errors. We use the errors to identify the positions of modifications on the RNA transcripts or DNA sequences.

**ELIGOS** is developed to identify the position of modification on the native RNA sequences from the distinction of error at specific base (ESB) between the native RNA sequences with the reference. We employ the standard statistical analysis, Fisher's exact test to evaluate the distinction of error. The reference can be unmodified RNA sequences derived from in vitro transcription, cDNA sequences or our develop background error model (rBEM), which mimic the systematic errors of unmodified RNA sequences.

**ELIGOS can be currently applied to perform :**

**1.** Differential epitranscriptome analysis between two different conditions (DNA,RNA) (see example 1)

**2.** Epitranscriptome profiling (RNA) (see example 2)

**3.** Identification of DNA modifications such as DNA adduct (see example 3)

**Please cite**:
1. Piroon Jenjaroenpun, Thidathip Wongsurawat, Taylor D Wadley, Trudy M Wassenaar, Jun Liu, Qing Dai, Visanu Wanchai, Nisreen S Akel, Azemat Jamshidi-Parsian, Aime T Franco, Gunnar Boysen, Michael L Jennings, David W Ussery, Chuan He, Intawat Nookaew, Decoding the epitranscriptional landscape from native RNA sequences. Nucleic Acids Research, 2020, gkaa620, https://doi.org/10.1093/nar/gkaa620
2. Intawat Nookaew, Piroon Jenjaroenpun, Hua Du, Pengcheng Wang, Jun Wu, Thidathip Wonsurawat, Sun Hee Moon, En Huang, Yinsheng Wang, Gunnar Boysen, Detection and discrimination of DNA adducts differing in size, regiochemistry and functional group by nanopore sequencing., Chemical Research in Toxicology, 2020, https://doi.org/10.1021/acs.chemrestox.0c00202

**SRA Fast5**:
https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP166020

**Journal cover**:

![cover](images/two_cover.png)

---
## **INSTALLATION**
### **ELIGOS installation on Linux/Mac/Windows(WSL)**
1. **Miniconda3 installation**
[Click link](https://docs.conda.io/en/latest/miniconda.html "Miniconda3 Installation")

2. **Download ELIGOS via GitLab**
```bash
## Download ELIGOS from GitLab
git clone https://gitlab.com/piroonj/eligos2.git

## Go to ELIGOS folder
cd eligos2
```
3. **Creating conda environment for ELIGOS**
* Creating an environment with commands:
```bash
## Install environment
# this conda create might get package conflicts
# conda create -n eligos2 -c bioconda -c conda-forge -c anaconda python=3.6 pysam=0.13 pandas=0.23.4 pybedtools=0.8.0 bedtools=2.25 rpy2=2.8.5 r-base=3.4.1 tqdm=4.40.2 numpy=1.11.3

# I recently try to exclude version specification and it works for me.
conda install mamba
conda update mamba
mamba create -n eligos2 -c conda-forge -c bioconda -c r python=3.10 pysam pandas pybedtools bedtools rpy2 r-base tqdm numpy

## Activate ELIGOS environment
conda activate eligos2

## Install samplesizeCMH module
Rscript -e 'install.packages("samplesizeCMH", repos="https://cloud.r-project.org")'

## Export ELIGOS to system environment
export PATH=$PWD:$PWD/Scripts:$PATH

## Run ELIGOS
eligos2 -h
```
* Creating an environment from an environment.yml file:
```bash
## Install environment
conda env create -f eligos2.linux.yml

## Activate ELIGOS environment
conda activate eligos2

## Install samplesizeCMH module
Rscript -e 'install.packages("samplesizeCMH", repos="https://cloud.r-project.org")'

## Export ELIGOS to system environment
export PATH=$PWD:$PWD/Scripts:$PATH

## Run ELIGOS
eligos2 -h
```
### ELIGOS installation on Docker or Singularity containers
1. Install Docker or Singularity on computer. Click on link: [Docker](https://docs.docker.com/install/ "Docker Installation") or [Singularity ](https://sylabs.io/singularity/ "Singularity Installation")
2. Install ELIGOS
* Install ELIGOS from DockerHub
```bash
## Install ELIGOS from DockerHub
docker pull piroonj/eligos2

## Run ELIGOS
docker run --rm -v $PWD:$PWD -w $PWD piroonj/eligos2 eligos2 -h
```
* Install ELIGOS from Dockerfile
```bash
## Download ELIGOS from GitLab
git clone https://gitlab.com/piroonj/eligos2.git

## Go to DockerFiles folder
cd eligos2/DockerFiles

## Build ELIGOS on docker
docker build -t piroonj/eligos2:latest .

## Run ELIGOS
docker run --rm -v $PWD:$PWD -w $PWD piroonj/eligos2 eligos2 -h
```

* Install ELIGOS via Singularity image
```bash
## Create Singularity images of ELIGOS from DockerHub
singularity build eligos2.sif docker://piroonj/eligos2:latest

## Run ELIGOS
singularity exec eligos2.sif eligos2 -h
```

---

## **Usage**
**Main functions**
```
eligos2 -h

usage: eligos2 [-h] [-v]
{map_preprocess,build_genedb,rna_mod,pair_diff_mod,bedgraph,filter}
...

ELIGOS is a package of tools for the identification of modified nucleotides
based on nanopore sequencing data.

ELIGOS command groups:
map_preprocess Preprocess mapped reads
build_genedb Build bam files from gene database
rna_mod Identify RNA modification against rBEM5+2 model
pair_diff_mod Identify RNA modification against control condition
bedgraph Filtering and creating BedGraph for IGV plot
filter Filtering for Eligos result

optional arguments:
-h, --help show this help message and exit
-v, --version show ELIGOS version and exit.
```

**Index BAM and Reference sequence**

Before using eligos2, BAM files and reference sequence need to be indexed.
```bash
## index BAM
samtools index file.bam

## index reference sequence
samtools faidx file.fasta
```

---
## **Example**
### 1. **Differential ESB analysis yeast meiosis transcriptome to identify m6A and enriched RRACH motif**

The native RNA sequences of yeast transcriptome of m6A methyl transferase knockout (∆ime4) strain and wild type grew under meiosis state from Lui et. al. (https://doi.org/10.1038/s41467-019-11713-9) is used for this example.
The subset of 531 transcripts, which pre-identified to contain differential %ESB sites of adenine, is used for the example.

#### Run ELIGOS comparing between native RNA of Wild-type and native RNA of Knock-out
**1.1 Download data** : The example file contains two mapped reads (BAM file) of ∆ime4 and wild type, gene locations (BED file), and reference (FASTA file).

1.dRNA_m6A_yeast_knockout.tar.gz [Download](https://drive.google.com/file/d/1J4QEbHeenT5OQkpboOaFRg2WsaFM_Fx6/view?usp=sharing "Data 1")
```
1.dRNA_m6A_yeast_knockout/
├── sc_KO.selected.bam
├── sc_KO.selected.bam.bai
├── sc_WT.selected.bam
├── sc_WT.selected.bam.bai
├── yeast.S288C.genes.selected_set.bed
├── yeast.S288C.genome.fa.gz
├── yeast.S288C.genome.fa.gz.fai
└── yeast.S288C.genome.fa.gz.gzi
```

**1.2 Run ELIGOS command**
```bash
## Index reference sequence
samtools faidx yeast.S288C.genome.fa.gz

## Run ELIGOS compare between samples when Wild-type (-tbam) and Knock-out (-cbam)
eligos2 pair_diff_mod -tbam sc_WT.selected.bam -cbam sc_KO.selected.bam -reg yeast.S288C.genes.selected_set.bed -ref yeast.S288C.genome.fa.gz -t 34 --pval 0.001 --oddR 1.2 --esb 0 -o results

## Extract potential base A modified using eligos2 filter and filter out homopolymer sequence
eligos2 filter -i results/sc_WT.selected_vs_sc_KO.selected_on_yeast.S288C.genes.selected_set_baseExt0.txt -sb A --homopolymer --esb 0 --oddR 1.2 --pval 0.001

## Show output file
head sc_WT.selected_vs_sc_KO.selected_on_yeast.S288C.genes.selected_set_baseExt0.A.filtered.txt
```
**1.3 Extract potential modified A sequences with 6bp up/down stream expansion**
```bash
## Extract fasta from compressed fasta file
gunzip -c yeast.S288C.genome.fa.gz > yeast.S288C.genome.fa
samtools faidx yeast.S288C.genome.fa

## Convert Eligos output to fasta file with 6bp. up/down extension
table2fa_eligos.mergeextend.sh results_5_test_pval_and_adjP/sc_WT.selected_vs_sc_KO.selected_on_yeast.S288C.genes.selected_set_baseExt0.A.filtered.txt yeast.S288C.genome.fa

## Check sequence output
head sc_WT.selected_vs_sc_KO.selected_on_yeast.S288C.genes.selected_set_baseExt0.A.filtered.fa
```

**1.4 De-novo motif discovery with** [**BaMM motif**](https://bammmotif.soedinglab.org/)

The sequences of the differential adenine (above) can be used to identify consensus motif using BaMM motif discovery (https://doi.org/10.1093/nar/gky431) through web service (https://bammmotif.soedinglab.org/job/denovo/).

BaMM motif web-site

Use De-novo motif discovery

Upload above output fasta file and BaMM!. We will obtain consensus RRACH motif as the figure below.

---
### 2. **Detection of RNA modification of MYC and JUNB transcript using rBEM_k5+2 model**

The native RNA sequences of human transcriptome from Workman et. al. (https://doi.org/10.1038/s41592-019-0617-2) is used for this example.
The two oncogenes transcript of MYC and JUNB is used for this example.

**2.1 Download data** : The example file contains mapped reads (BAM file), gene locations (BED file), cDNA (BCF file), and chormosomes (FASTA file).

2.dRNA_m6A_MYC_JUNB.tar.gz [Download](https://drive.google.com/file/d/1WBHmUfIlRTF1MwDLRp4vEHZWquzLtDIP/view?usp=sharing "Data 2")

```
2.dRNA_m6A_MYC_JUNB
├── chr8_chr19.fa.gz
├── chr8_chr19.fa.gz.fai
├── chr8_chr19.fa.gz.gzi
├── myc_junb.bed
├── rna_consortium.myc_junb.bam
├── rna_consortium.myc_junb.bam.bai
├── rna_consortium.myc_junb.bcf
└── rna_consortium.myc_junb.bcf.csi
```

**2.2 Run eligos comparing between native RNA and rBEM5+2 model**
```bash
## Run ELIGOS
eligos2 rna_mod -i rna_consortium.myc_junb.bam -bcf rna_consortium.myc_junb.bcf -reg myc_junb.bed -ref chr8_chr19.fa.gz -o results_myc_junb --pval 1e-5 --oddR 5 --esb 0.2
```
**2.3 Create BedGraph file format with filtering options**
```bash
## Create BedGraph file of ESB signal with selecting A bases and filtering out homopolymer sequence
eligos2 bedgraph -i results_myc_junb/rna_consortium.myc_junb_vs_model_on_myc_junb_baseExt0.txt --select_base A --signal ESB --homopolymer

## Check output of ESB fequency in Native RNA reads (test)
head results_myc_junb/rna_consortium.myc_junb_vs_model_on_myc_junb_baseExt0.A.ESB_test.bdg

## Check output of ESB fequency in rBEM5+2 model (ctrl)
head results_myc_junb/rna_consortium.myc_junb_vs_model_on_myc_junb_baseExt0.A.ESB_ctrl.bdg
```

The genertaed bed graph file can be use to visulaize the position of differental ESB of adenine in IGV (http://software.broadinstitute.org/software/igv/).

![oddR](images/junb_myc.png)

PS: dRNA ESB frequency show in cyan color
rBEM5+2 ESB frequency shows in margenta color

---
### 3. **Differential ESB analysis of synthetic plasmid to identify DNA adduct**

The native DNA sequences of a synthetic plasmid, containing one N2-Ethyl Deoxyguanine is used for this example.

**3.1 Download data** : The example file contains two mapped reads (BAM file) of N2Et and WT, gene locations (BED file), cDNA (BCF file), and plasmid (FASTA file).

3.DNA_N2Et_adduct.tar.gz [Download](https://drive.google.com/file/d/14HltQGe2_JuEk9kkeCRjHXZ-m1g4AX4E/view?usp=sharing "Data 3")

```
3.DNA_N2Et_adduct
├── N2Et.DNA.bam
├── N2Et.DNA.bam.bai
├── WT.DNA.bam
├── WT.DNA.bam.bai
├── WT.plasmid.bed
├── WT.plasmid.fa.gz
├── WT.plasmid.fa.gz.fai
└── WT.plasmid.fa.gz.gzi
```

**3.2 Run ELIGOS comparing between N2Et modified DNA and native DNA**
```bash
## Filter out mapped reads shorter than 200 bases
eligos2 map_preprocess -aln 200 -i N2Et.DNA.bam
eligos2 map_preprocess -aln 200 -i WT.DNA.bam

## Run ELIGOS
eligos2 pair_diff_mod -tbam N2Et.DNA.preprocess.bam -cbam WT.DNA.preprocess.bam -reg WT.plasmid.bed -ref WT.plasmid.fa.gz -o results --oddR 0 --esb 0 --pval 1 --adjPval 1
```
**3.3 Create BedGraph file format with filtering options**
```bash
## Create BedGraph file of odd ratio signal with filtering out homopolymer sequence
eligos2 bedgraph -i results/N2Et.DNA.preprocess_vs_WT.DNA.preprocess_on_WT.plasmid_baseExt0.txt --signal oddR --homopolymer

## Check output
head results/N2Et.DNA.preprocess_vs_WT.DNA.preprocess_on_WT.plasmid_baseExt0.oddR.bdg
```

Odd ratio levels of individual nucleotide from the generated BedGraph file is plotted below and show the correct identiifation of N2-Ethyl Deoxyguanine position on the synthetic plasmid.

![oddR](images/result_odd.png)

---
### 4. **Identification of RNA modifications from replicates using the Cochran-Mantel-Haenszel (CMH) Test**

To perform, comparison between two conditions with biological replicates using the CMH test

**4.1 Download data** : The example file containing three rna_mod results (TSV file) of RNA-seq under Ethanol codition and three rna_mod results of Glucose condition from previous step (eligos2 rna_mod).

4.example_YPL061W_baseExt0.tar.gz [Download](https://drive.google.com/file/d/1Pgvi6qFTE0tQOP0eCn17-A36C6LXdhUd/view?usp=sharing "Data 4")

```
tar -xvzf 4.example_YPL061W_baseExt0.tar.gz

4.example_YPL061W_baseExt0
├── drna_ethanol_11.mn200_vs_model_on_YPL061W_baseExt0.txt
├── drna_ethanol_21.mn200_vs_model_on_YPL061W_baseExt0.txt
├── drna_ethanol_31.mn200_vs_model_on_YPL061W_baseExt0.txt
├── drna_glucose_12.mn200_vs_model_on_YPL061W_baseExt0.txt
├── drna_glucose_22.mn200_vs_model_on_YPL061W_baseExt0.txt
└── drna_glucose_32.mn200_vs_model_on_YPL061W_baseExt0.txt
```

**4.2 Run multiple samples testing**
```bash
## Run multi_samples_test
eligos2 multi_samples_test --test_mods 4.example_YPL061W_baseExt0/drna_ethanol_*.txt --ctrl_mods 4.example_YPL061W_baseExt0/drna_glucose_*.txt --prefix ethanol_vs_glucose

wc -l ethanol_vs_glucose.CMH_testing.txt
```

---
## Result description

| column | Description |
| ------ | ------ |
| chrom | reference name |
| start_loc | mapped start position |
| end_loc | mapped end position |
| strand | mapped direction |
| name | region/gene name from bed file |
| ref | reference sequence |
| homo_seq | overlapping with homopolymer sequence |
| kmer5 | Base 5 mers: the reference in third position |
| majorAllel | The base with highest freqeuncy (major allele) |
| majorAllelFreq | The highest frequency of major allele |
| kmer7 | Base 7 mers: the reference in third position |
| test_err_1 | Counts of error base in test sample |
| model_err_1 | Counts of error base in error profile from rBEM model |
| test_cor_1 | Counts of correct base in test sample |
| model_cor_1 | Counts of correct base in error profile from rBEM model |
| oddR | odds ratios |
| pval | P-value |
| adjPval | adjusted P-value |
| baseExt | Base expansion (0, 1, or 2 bases) |
| total_reads | Total read counts |
| ESB_test | The percent error of specific base of test sample |
| ESB_ctrl | The percent error of specific base of test rBEM model |

---

## License & copyright

Licensed under the [MIT License](LICENSE)