Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jundaf2/eigenmha
Forward and backward Attention DNN operators implementationed by LibTorch, cuDNN, and Eigen.
https://github.com/jundaf2/eigenmha
backpropagation cuda cudnn cudnn-v8 dnn inference pytorch
Last synced: 3 months ago
JSON representation
Forward and backward Attention DNN operators implementationed by LibTorch, cuDNN, and Eigen.
- Host: GitHub
- URL: https://github.com/jundaf2/eigenmha
- Owner: jundaf2
- Created: 2023-02-26T04:29:24.000Z (almost 2 years ago)
- Default Branch: cudnn
- Last Pushed: 2023-06-06T09:48:12.000Z (over 1 year ago)
- Last Synced: 2023-12-13T02:50:43.657Z (about 1 year ago)
- Topics: backpropagation, cuda, cudnn, cudnn-v8, dnn, inference, pytorch
- Language: C++
- Homepage:
- Size: 75.2 MB
- Stars: 17
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Which part will we implement in the transformer model.# eigenMHA (eigenDNN vs cuDNN) -- Multi-head Attention Inference and Training implemented by Eigen.
To clone this repo,
```
git clone --recursive https://github.com/jundaf2/eigenMHA
cd eigenMHA
git clone https://gitlab.com/libeigen/eigen # clone eigen if necessary
```## Introduction
In this repo, we use Eigen3 to implement the forward and backward of Multi-head Attention in Transformer models. Basically, this repo has two branches -- `torch` and `cudnn`.## The MHAs in this repo
1. a pytorch MHA in `mha.py` that illustrates the MHA module we implement
2. an eigen MHA in `mha.cc` in both branches (with sources in `./src/eigenDNN.cpp` and headers in `./inlcude/eigenDNN.h`)
3. a libtorch MHA in the `torch` branch as a comparison to the eigenMHA
4. a cudnn MHA in the `cudnn` branch as a comparison to the eigenMHA### branch `torch`
```
git checkout torch
```In this branch, the eigenDNN is compared with the CPU libtorch. To make and run the project, first install LibTorch for necessary verification, see https://github.com/jundaf2/dnn-test-framework [nnTest mainly focuses on providing a testing framework to train and inference Deep Neural Networks using YOUR OWN LIBRARY]. And then,
```
mkdir build && cd build
cmake ..
make -j4
./mha
```### branch `cudnn`
```
git checkout cudnn
```
In this branch, the eigenDNN is compared with the Multi-head Attention APIs provided by cuDNN V8 (`cudnn_samples_v8/multiHeadAttention`).To install cuDNN, see https://developer.nvidia.com/rdp/cudnn-download and https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar . After copying the corresponding libraries and headers to the correct location,
```
mkdir build && cd build
cmake ..
make -j4
./mha
```To be more specific, this eigenDNN does what the cuDNN does in the following APIs for MHA operations.
* [cudnnCreateAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnCreateAttnDescriptor)
* [cudnnSetAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSetAttnDescriptor)
* [cudnnGetAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnGetAttnDescriptor)
* [cudnnSetAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSetAttnDescriptor)
* [cudnnDestroyAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnDestroyAttnDescriptor)
* [cudnnGetMultiHeadAttnBuffers()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnGetMultiHeadAttnBuffers)
* [cudnnGetMultiHeadAttnWeights()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnGetMultiHeadAttnWeights)
* [cudnnMultiHeadAttnForward()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnMultiHeadAttnForward)
* [cudnnMultiHeadAttnBackwardData()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnMultiHeadAttnBackwardData)
* [cudnnMultiHeadAttnBackwardWeights()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnMultiHeadAttnBackwardWeights)For more details of the Attention APIs in cuDNN v8, see this [中文CSDN链接](http://t.csdn.cn/Hw0Qi).
## What are the variables of MHA in a Training Library?
### Forward Pass of MHA
1. Q, K, V input embeddings
$$
\mathbf{Q}_{in} \quad \mathbf{K}_{in} \quad \mathbf{V}_{in}
$$2. Weights and bias for the linear layer of Q K V and O.
$$
\mathbf{W}_{Q} \quad \mathbf{b}_{Q}
$$$$
\mathbf{W}_{K} \quad \mathbf{b}_{K}
$$$$
\mathbf{W}_{V} \quad \mathbf{b}_{V}
$$$$
\mathbf{W}_{O} \quad \mathbf{b}_{O}
$$3. Intermediate variables
4. Output and target$$
\mathbf{O}_{out}\quad\mathbf{O}_{target}
$$The equations of MHA forward pass are as follows,
$$
\mathbf{Q} = \mathbf{Q}_{in}*\mathbf{W}_{Q}+\mathbf{b}_{Q}
$$$$
\mathbf{K} = \mathbf{K}_{in}*\mathbf{W}_{K}+\mathbf{b}_{K}
$$$$
\mathbf{V} = \mathbf{V}_{in}*\mathbf{W}_{V}+\mathbf{b}_{V}
$$$$
\mathbf{S} = \mathbf{Q}*\mathbf{K}^T
$$$$
\mathbf{P} = SoftmaxFWD(Mask(\mathbf{S}*\frac{1}{\sqrt{d}}))
$$$$
\mathbf{P} = DropoutFWD(\mathbf{P})
$$$$
\mathbf{O}=\mathbf{P}*\mathbf{V}
$$$$
\mathbf{O}_{out} = \mathbf{O}*\mathbf{W}_{O}+\mathbf{b}_{O}
$$### MSE Loss
$$
loss = MSELoss(\mathbf{O}_{out},\mathbf{O}_{target})
$$MSELoss will also gives
$$ \mathbf{grad\\_O}_{out} $$
, the gradient of
$$ \mathbf{O}_{out} $$
### Backward Pass of MHA
1. Gradients for output (from LayerNorm)
$$
\mathbf{grad\\_O}_{out}
$$2. Gradients for the intermediate variables
3. Gradients for the forward input$$
\mathbf{grad\\_Q}_{in} \quad \mathbf{grad\\_K}_{in} \quad \mathbf{grad\\_V}_{in}
$$4. Gradients of the weights and biases
$$
\mathbf{grad\\_W}_{Q} \quad \mathbf{grad\\_b}_{Q}
$$$$
\mathbf{grad\\_W}_{K} \quad \mathbf{grad\\_b}_{K}
$$$$
\mathbf{grad\\_W}_{V} \quad \mathbf{grad\\_b}_{V}
$$$$
\mathbf{grad\\_W}_{O} \quad \mathbf{grad\\_b}_{O}
$$The equations of MHA backward pass are as follows,
$$
\mathbf{grad\\_O} = \mathbf{grad\\_O}_{out}*\mathbf{W}_{O}
$$$$
\mathbf{grad\\_W}_{O} = \mathbf{grad\\_O}_{out}^T*\mathbf{O}
$$$$
\mathbf{grad\\_b}_{O} = colsum(\mathbf{grad\\_O}_{out})
$$$$
\mathbf{grad\\_P} = \mathbf{grad\\_O}*\mathbf{V}^T
$$$$
\mathbf{grad\\_V} = \mathbf{P}^T*\mathbf{grad\\_O}
$$$$
\mathbf{grad\\_P} = DropoutBWD(\mathbf{grad\\_P})
$$$$
\mathbf{grad\\_S} = SoftmaxBWD(\mathbf{P},\mathbf{grad\\_P})*\frac{1}{\sqrt{d}}
$$$$
\mathbf{grad\\_Q} = \mathbf{grad\\_S}*\mathbf{K}
$$$$
\mathbf{grad\\_K} = \mathbf{grad\\_S}^T*\mathbf{Q}
$$$$
\mathbf{grad\\_Q}_{in} = \mathbf{grad\\_Q}*\mathbf{W}_{Q}^T
$$$$
\mathbf{grad\\_W}_{Q} = \mathbf{Q}_{in}^T*\mathbf{grad\\_Q}
$$$$
\mathbf{grad\\_b}_{Q} = colsum(\mathbf{grad\\_Q})
$$$$
\mathbf{grad\\_K}_{in} = \mathbf{grad\\_K}*\mathbf{W}_{K}^T
$$$$
\mathbf{grad\\_W}_{K} = \mathbf{K}_{in}^T*\mathbf{grad\\_K}
$$$$
\mathbf{grad\\_b}_{K} = colsum(\mathbf{grad\\_K})
$$$$
\mathbf{grad\\_V}_{in} = \mathbf{grad\\_V}*\mathbf{W}_{V}^T
$$$$
\mathbf{grad\\_W}_{V} = \mathbf{V}_{in}^T*\mathbf{grad\\_V}
$$$$
\mathbf{grad\\_b}_{V} = colsum(\mathbf{grad\\_V})
$$
## The components of the MHA Training Library
### MSE Loss FunctionLoss function, as the origin of DL system, is a basic component inside a DL system.
MSE Loss.```
eidnnStatus_t eidnnMSELoss(
eidnnHandle_t handle,
const Tensor &output,
const Tensor &target,
Tensor &loss,
Tensor &d_loss);
```### Linear
cuDNN has no specific APIs for linear layer.In eigenDNN, we have
```
eidnnStatus_t eidnnLinearForward(eidnnHandle_t handle,
const Tensor& x, // data
const Tensor& w, // weight
const Tensor& bias, // bias
Tensor& y);
``````
eidnnStatus_t eidnnLinearBackward(eidnnHandle_t handle,
const Tensor& dy,
const Tensor& x,
const Tensor& w,
Tensor& dx, // gradient of input data
Tensor& dw, // accumulated gradient of weight
Tensor& dbias // accumulated gradient of bias
);
```### MatMul
$$ C = \beta * C + \alpha*Op_c(MatMul(Op_a(A),Op_b(B))) $$
, where $Op_m(M)$ is whether to transpose matrix $M$ or not in the forward pass.
cuDNN has no specific APIs for matrix-multiply operation.
In eigenDNN, we have
```
eidnnStatus_t eidnnStridedBatchedGemmForward(
eidnnHandle_t handle,
float alpha,
float beta,
bool trans_A, // Op_a
bool trans_B, // Op_b
bool trans_C, // Op_c
const Tensor &A,
const Tensor &B,
Tensor &C);
``````
eidnnStatus_t eidnnStridedBatchedGemmBackward(
eidnnHandle_t handle,
float alpha,
float beta,
bool trans_A, // Op_a
bool trans_B, // Op_b
bool trans_C, // Op_c
const Tensor &A, // A
const Tensor &B, // B
const Tensor &d_C, // gradient of C
Tensor &d_A, // gradient of A
Tensor &d_B // gradient of B
);
```
### Softmax
cuDNN has the following APIs for softmax operation.
* [cudnnSoftmaxForward()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSoftmaxForward)
* [cudnnSoftmaxBackward()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSoftmaxBackward)In eigenDNN, we have
```
eidnnStatus_t eidnnSoftmaxForward(eidnnHandle_t handle,
eidnnSoftmaxAlgorithm_t algo,
eidnnSoftmaxMode_t mode,
const Tensor& x,
Tensor& y);
``````
eidnnStatus_t eidnnSoftmaxBackward(eidnnHandle_t handle,
eidnnSoftmaxAlgorithm_t algo,
eidnnSoftmaxMode_t mode,
const Tensor& y,
const Tensor& dy,
Tensor& dx);
```### Dropout
cuDNN has the following APIs for dropout operation.
* [cudnnCreateDropoutDescriptor()]()
* [cudnnDestroyDropoutDescriptor()]()
* [cudnnDropoutGetStatesSize()]()
* [cudnnDropoutGetReserveSpaceSize()]()
* [cudnnDropoutForward()]()
* [cudnnGetDropoutDescriptor()]()
* [cudnnRestoreDropoutDescriptor()]()
* [cudnnSetDropoutDescriptor()]()
* [cudnnDropoutBackward()]()In eigenDNN, we have
```
// dropout rate,
// pointer to memory space of states (allocated by forward pass),
// size of memory space in bytes (calculated by forward pass),
// random seed
using eidnnDropoutDescriptor_t = std::tuple;
```
```
eidnnStatus_t eidnnDropoutForward(
eidnnHandle_t handle,
eidnnDropoutDescriptor_t &dropoutDesc,
const Tensor &x, // input data
Tensor &y // input data after dropout
);
``````
eidnnStatus_t eidnnDropoutBackward(
eidnnHandle_t handle,
const eidnnDropoutDescriptor_t dropoutDesc,
const Tensor &dy, // gradient of dropout output data
Tensor &dx // gradient of dropout input data
);
```