Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bcgsc/terminitor
Deep Neural Network model that predicts polyadenylation sites
https://github.com/bcgsc/terminitor
deep-neural-networks deeplearning polyadenylation
Last synced: about 2 months ago
JSON representation
Deep Neural Network model that predicts polyadenylation sites
- Host: GitHub
- URL: https://github.com/bcgsc/terminitor
- Owner: bcgsc
- License: gpl-3.0
- Created: 2019-06-30T22:43:40.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-04-22T04:11:30.000Z (over 3 years ago)
- Last Synced: 2023-03-23T00:52:07.609Z (almost 2 years ago)
- Topics: deep-neural-networks, deeplearning, polyadenylation
- Language: Python
- Homepage:
- Size: 377 KB
- Stars: 5
- Watchers: 5
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Terminitor
Terminitor is a deep neural network that predicts whether a sequence contains a polyadenylated (poly(A)) cleavage site (CS) at certain position.
For more information, please refer to the preprint: https://www.biorxiv.org/content/10.1101/710699v2
### Datasets for download
www.bcgsc.ca/downloads/supplementary/TerminitorThis ftp site contains two datasets, human and mouse, and two corresponding pre-trained models for test.
### Dependencies
* Python3
* Numpy
* Keras
* Scikit-learn
* Pybedtools
* Pysam
* HTSeqA Python environment for these packages can be created with `conda`, e.g.
```
conda create --name terminitor pysam pybedtools numpy keras scikit-learn htseq
```
For more information, consult the [user guide for conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).### Train
Usage: `train.py [-h] [-v] -polya POLYA -cs CS -non NON -model MODEL -l L`
```
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-polya POLYA Poly(A) CS, fasta file
-cs CS Non-poly(A) CS, fasta file
-non NON Non-CS, fasta file
-model MODEL File name of trained model
-l L Length of input sequences
```### Extract candidate sequence
Usage: `extract_from_sequences.py [-h] [-v] -t ANNOT_TRANS -a ANNOT_ALL -m ALN
-g GENOME -o O [-u UP_LEN] [-d DOWN_LEN]````
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-t ANNOT_TRANS, --annot_trans ANNOT_TRANS
Transcript annotation file, GTF format. This file
contains only transcript level annotation, can be
downloaded from the ftp site provided on our Github
page
-a ANNOT_ALL, --annot_all ANNOT_ALL
Ensembl annotation file, GTF format. Can be downloaded
from Ensembl ftp site
-m ALN, --aln ALN The alignment file from assembled transcript contigs
to reference genome in BAM format.
-g GENOME, --genome GENOME
Indexed reference genome assembly in FASTA format, which
can be downloaded from Ensembl
-o O Output file, fasta format containing candidate
sequences to be tested
-u UP_LEN, --up_len UP_LEN
Upstream sequence length
-d DOWN_LEN, --down_len DOWN_LEN
Downstream sequence length
```### Test
Usage: `test.py [-h] [-v] -t TEST_FILE -m MODEL -l L -o OUTPUT`
```
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-t TEST_FILE, --test_file TEST_FILE
Fasta file to be tested
-m MODEL, --model MODEL
Pre-trained model file
-l L Length of input sequences
-o OUTPUT, --output OUTPUT
Output probabilities
```### Pipeline
1. For Illumina RNA-seq short reads, run assembly with [RNA-Bloom](https://github.com/bcgsc/RNA-Bloom)
```
java -jar RNA-Bloom.jar -left read2.fq -right read1.fq -revcomp-right -outdir assembly -a 4 -e 1 -stratum 01 -ss -ntcard -fpr 0.005
```
For PacBio CCS reads, skip this step2. Genome alignment with [minimap2](https://github.com/lh3/minimap2)
```
minimap2 -ax splice hg38.mmi rnabloom.transcripts.fa | samtools view -u - | samtools sort -T tmp_prefix -O BAM -o aln.bam
samtools index aln.bam
```3. Extract candidate sequence
```
python extract_from_sequences.py -t Homo_sapiens.GRCh38.99.transcripts.gtf -a Homo_sapiens.GRCh38.99.gtf -g Homo_sapiens.GRCh38.dna.primary_assembly.fa -m aln.bam -o extracted_sequences.fa
```The GTF file for `-a` option can be downloaded from Ensembl.
The GTF file for `-t` option can be generated based on the Ensembl annotation, e.g.
```
awk '$3=="transcript" {print}' Homo_sapiens.GRCh38.99.gtf > Homo_sapiens.GRCh38.99.transcripts.gtf
```The reference genome for `-g` option can be downloaded from Ensembl and must be indexed, e.g.
```
samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa
```4. Test
```
python test.py -t extracted_sequences.fa -m pre_trained_model -l 200 -o probablities.txt
```