Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/ljw-struggle/Bioinfor-DeepATT

DeepATT, a hybrid deep neural network method for identifying functional effects of DNA sequences.
https://github.com/ljw-struggle/Bioinfor-DeepATT

Last synced: 3 months ago
JSON representation

DeepATT, a hybrid deep neural network method for identifying functional effects of DNA sequences.

Lists

README

        

# Bioinfor DeepATT

DeepATT is a model for identifying functional effects of DNA sequences. This is implemented by tensorflow-2.0.
Our model has four built-in neural network constructions: convolution layer captures regulatory motifs,
recurrent layer captures a regulatory grammar, category attention layer (improved from self-attention layer) selects
corresponding valid features for different functions, and category dense layer (improved from local-connected dense
layer) classifies the labels with feature vectors selected by the query vectors of the regulatory functions. We compare
DeepATT with DeepSEA and DanQ, which are all implemented or replicated on our own platform. Comparison results
demonstrate that DeepATT achieves state-of-the-art performance of **0.94519** AV-AUROC and **0.39522** AV-AUPR,
which is far better than other non-coding DNA regulatory function prediction methods. **The performances of all the
models that were described in the original paper are shown in the below Table.**

Model|DeepSEA|DanQ|DanQ_JASPAR|DeepATT|DeepATT_Plus|
:-:|:-:|:-:|:-:|:-:|:-:|
AV-AUPR|0.34163|0.37089|0.37936|0.39522|0.39324
AV-AUROC|0.93260|0.93837|0.94174|0.94519|0.94432
Parameter number|61,723,119|46,926,479|67,892,175|7,808,057|7,900,775

**Key Points:**
- We propose a hybrid deep neural network method with four built-in
neural network layers, DeepATT, for identifying 919 regulatory
functions on nearly 5 million DNA sequences. We firstly design a
category attention layer and category dense layer in order to distinguish
specific representations of different DNA functions.
- We replicate two state-of-the-art models, DeepSEA and DanQ,
in order to compare different model architectures with our novel
model construction. DeepATT performs significantly better than other
prediction tools for identifying DNA functions.
- Our novel model mine important correlation among different DNA
functions according to the category attention module. The attention
mechanism calculates scores of feature vectors to estimate all
functional targets for different DNA regulatory functions.
- Our novel model reduce the number of hyper-parameters by attention
mechanism and local full-connected, on the basis of ensuring
prediction accuracy. The attention mechanism determines relevant
characteristics for each binary target, and the local connection
eliminates all unnecessary features for specific connections

## Citation

```
@article{Li2020DeepATT,
title={DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences},
author={Li, Jiawei and Pu, Yuqian and Tang, Jijun and Zou, Quan and Guo, Fei},
journal={Briefings in Bioinformatics},
year={2020},
}
```
My Manuscript: [[PDF]](https://www.ljwstruggle.com/mynote/docs/DeepATT.pdf)

My Homepage: [https://www.ljwstruggle.com/]()

## DeepATT
![deepatt](process/deepatt.png)

### Model Architecture
- **DeepSEA**

CNN + Pool + CNN + Pool + CNN + Pool + Dense + Dense
- **DanQ**

CNN(320 kernels) + Pool + BidLSTM + Dense + Dense
- **DanQ_JASPAR**

CNN(1024 kernels) + Pool + BidLSTM + Dense + Dense
- **DeepATT**

CNN + Pool + BidLSTM + Category Multi-Head-Attention + Category-Dense(relu)(weight-share) + Category-Dense(sigmoid)(weight-share)
- **DeepATT-Plus**

CNN + Pool + BidLSTM + Category Multi-Head-Attention + Category-Dense(relu)(weight-share) + Category-Dense(sigmoid)(no weight-share)

### Loss Function
There we use NLLLoss or FocalLoss.
(You can change the config file to use these loss functions.)

### Optimization Method
We have implemented four optimization methods that include SGD, Adadelta, Adagrad, Adam and RMSprop.
(You can change the config file to use these methods.)

## USAGE

We run the code on Ubuntu 18.04 LTS with a GTX 1080ti GPU. It takes **1 or 2 hours** to train one model for one epoch.
And it takes **1 or 2 days** to get one trained model. And we have trained 28 models to do comparision.

### Requirement
[Python]() (3.7.3) | [Tensorflow]() (2.0.0)
| [CUDA]() (10.0) | [cuDNN]() (7.6.0)

### Data
You need to first download the training, validation, and testing sets from DeepSEA. You can download the datasets from
[here](). After you have extracted the
contents of the tar.gz file, move the 3 .mat files into the **`./data/`** folder.

### Model File

None

### Preprocess
Because of my RAM limited, I firstly transform the train.mat file to .tfrecord files.
```
python process/preprocess.py
```

### Train
Then you can train the model initially.
```
CUDA_VISIBLE_DEVICES=0 python main.py -e train -c ./config/config_0.json
```

### Test
When you have trained successfully, you can evaluate the model.
```
CUDA_VISIBLE_DEVICES=0 python main.py -e test -c ./config/config_0.json
```

## RESULT
You can get my [result](./result/DeepATT.xlsx) in the **`./result/`** directory.

### Performance
We use two metrics to evaluate the model. (AUROC, AUPR)

Model|Optimizer|Loss|Learning Rate|Scheduler|Batch Size|AVG AUPR|AVG AUROC|
:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
DeepSEA*|Adam|NLL|0.001|None|64|0.26140|0.89225
DeepSEA*|Adam|NLL|0.0005|None|64|0.29214|0.90847
DeepSEA*|Adam|Focal|0.001|None|64|0.24434|0.87009
DeepSEA*|Adam|Focal|0.0005|None|64|0.25994|0.88411
DanQ*|Adam|NLL|0.001|None|64|0.33254|0.92363
DanQ*|Adam|NLL|0.0005|None|64|0.35921|0.93399
DanQ*|Adam|Focal|0.001|None|64|0.34454|0.92875
DanQ*|Adam|Focal|0.0005|None|64|0.34962|0.93160
DanQ_JASPAR*|Adam|NLL|0.001|None|64|0.37443|0.93827
DanQ_JASPAR*|Adam|NLL|0.0005|None|64|0.37872|0.94001
DanQ_JASPAR*|Adam|Focal|0.001|None|64|0.37692|0.93954
DanQ_JASPAR*|Adam|Focal|0.0005|None|64|0.38441|0.94171
DeepATT|Adam|NLL|0.001|StepLR|64|0.39304|0.94422
DeepATT|Adam|NLL|0.001|None|64|0.38519|0.94232
DeepATT|Adam|NLL|0.0005|StepLR|64|0.39619|0.94486
DeepATT|Adam|NLL|0.0005|None|64|0.39267|0.94436
DeepATT|Adam|Focal|0.001|StepLR|64|0.39246|0.94432
DeepATT|Adam|Focal|0.001|None|64|0.39303|0.94332
DeepATT|Adam|Focal|0.0005|StepLR|64|**`0.39522`**|**`0.94519`**
DeepATT|Adam|Focal|0.0005|None|64|0.39488|0.94491
DeepATT_Plus|Adam|NLL|0.001|StepLR|64|0.38595|0.94271
DeepATT_Plus|Adam|NLL|0.001|None|64|0.37768|0.93932
DeepATT_Plus|Adam|NLL|0.0005|StepLR|64|0.38125|0.94196
DeepATT_Plus|Adam|NLL|0.0005|None|64|0.38406|0.94293
DeepATT_Plus|Adam|Focal|0.001|StepLR|64|0.38772|0.94266
DeepATT_Plus|Adam|Focal|0.001|None|64|0.38711|0.94274
DeepATT_Plus|Adam|Focal|0.0005|StepLR|64|**`0.39324`**|**`0.94432`**
DeepATT_Plus|Adam|Focal|0.0005|None|64|0.38797|0.94308

### Attention Analysis
We analyze all trained query vectors in the category attention layer, in order to mine the correlation among **919**
DNA non-coding regulatory functions. In the category attention module, we generate a **919 x 919** diagonal matrix
as input to the attention layer. First, we randomly generate **919** independent query vectors within **100**-length.
We calculate the cosine similarity matrix of these randomly query vectors, however, we obtain no valid correlation
information. Then, we effectively train query vectors in the category attention layer and calculate the cosine
similarity matrix for **919** chromatin features (**125** DNase features, **690** TF features, **104** Histone features).
Basically, we can find out some subtle correlations among the same function category. Moreover, we enhance the cosine
similarity matrix by the sigmoid function. Some obvious small blocks indicate a lot of learned correlation information
between **919** DNA non-coding regulatory functions. It need to be stated that three major categories of various
non-coding functions are quantified as DNase I sensitivity for **0-124** items, Transcription factor (TF) binding for
**125-814** items and Histone-mark profile for **815-918** items. It's worth noting that the cosine similarity matrix
reveals some sub-categories in the TF binding functions. We visualize the cosine similarity matrix by using heatmap,
as shown in below figure.

![similarity matrix](process/similarity.png)

## ISSUE
If you encounter any issue or have a feedback, please don't hesitate to [raise an issue]().

## REFERENCE
> [Predicting effects of noncoding variants with deep learning-based sequence model]() | [Github]()

> [DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences]() | [Github]()