Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/jundaf2/eigenmha

Forward and backward Attention DNN operators implementationed by LibTorch, cuDNN, and Eigen.
https://github.com/jundaf2/eigenmha
backpropagation cuda cudnn cudnn-v8 dnn inference pytorch
Last synced: 3 months ago
JSON representation
Forward and backward Attention DNN operators implementationed by LibTorch, cuDNN, and Eigen.
Host: GitHub
URL: https://github.com/jundaf2/eigenmha
Owner: jundaf2
Created: 2023-02-26T04:29:24.000Z (almost 2 years ago)
Default Branch: cudnn
Last Pushed: 2023-06-06T09:48:12.000Z (over 1 year ago)
Last Synced: 2023-12-13T02:50:43.657Z (about 1 year ago)
Topics: backpropagation, cuda, cudnn, cudnn-v8, dnn, inference, pytorch
Language: C++
Homepage:
Size: 75.2 MB
Stars: 17
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        


Which part will we implement in the transformer model.

# eigenMHA (eigenDNN vs cuDNN) -- Multi-head Attention Inference and Training implemented by Eigen.

To clone this repo, 

```

git clone --recursive https://github.com/jundaf2/eigenMHA

cd eigenMHA

git clone https://gitlab.com/libeigen/eigen  # clone eigen if necessary

```

## Introduction

 In this repo, we use Eigen3 to implement the forward and backward of Multi-head Attention in Transformer models. Basically, this repo has two branches -- `torch` and `cudnn`. 

## The MHAs in this repo

1. a pytorch MHA in `mha.py` that illustrates the MHA module we implement

2. an eigen MHA in `mha.cc` in both branches (with sources in `./src/eigenDNN.cpp` and headers in `./inlcude/eigenDNN.h`)

3. a libtorch MHA in the `torch` branch as a comparison to the eigenMHA

4. a cudnn MHA in the `cudnn` branch as a comparison to the eigenMHA

### branch `torch`

```

git checkout torch

```

In this branch, the eigenDNN is compared with the CPU libtorch. To make and run the project, first install LibTorch for necessary verification, see https://github.com/jundaf2/dnn-test-framework  [nnTest mainly focuses on providing a testing framework to train and inference Deep Neural Networks using YOUR OWN LIBRARY]. And then,

```

mkdir build && cd build

cmake ..

make -j4

./mha

```

### branch `cudnn`

```

git checkout cudnn

```

In this branch, the eigenDNN is compared with the Multi-head Attention APIs provided by cuDNN V8 (`cudnn_samples_v8/multiHeadAttention`). 

To install cuDNN, see https://developer.nvidia.com/rdp/cudnn-download and https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar . After copying the corresponding libraries and headers to the correct location, 

```

mkdir build && cd build

cmake ..

make -j4

./mha

```

To be more specific, this eigenDNN does what the cuDNN does in the following APIs for MHA operations.

* [cudnnCreateAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnCreateAttnDescriptor)

* [cudnnSetAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSetAttnDescriptor)

* [cudnnGetAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnGetAttnDescriptor)

* [cudnnSetAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSetAttnDescriptor)

* [cudnnDestroyAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnDestroyAttnDescriptor)

* [cudnnGetMultiHeadAttnBuffers()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnGetMultiHeadAttnBuffers)

* [cudnnGetMultiHeadAttnWeights()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnGetMultiHeadAttnWeights)

* [cudnnMultiHeadAttnForward()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnMultiHeadAttnForward)

* [cudnnMultiHeadAttnBackwardData()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnMultiHeadAttnBackwardData)

* [cudnnMultiHeadAttnBackwardWeights()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnMultiHeadAttnBackwardWeights)

For more details of the Attention APIs in cuDNN v8, see this [中文CSDN链接](http://t.csdn.cn/Hw0Qi).

## What are the variables of MHA in a Training Library?



### Forward Pass of MHA

1. Q, K, V input embeddings

$$

\mathbf{Q}_{in} \quad  \mathbf{K}_{in} \quad  \mathbf{V}_{in}

$$

2. Weights and bias for the linear layer of Q K V and O.

$$

\mathbf{W}_{Q} \quad \mathbf{b}_{Q}

$$

$$

\mathbf{W}_{K} \quad \mathbf{b}_{K}

$$

$$

\mathbf{W}_{V} \quad \mathbf{b}_{V}

$$

$$

\mathbf{W}_{O} \quad \mathbf{b}_{O}

$$

3. Intermediate variables

4. Output and target

$$

\mathbf{O}_{out}\quad\mathbf{O}_{target}

$$

The equations of MHA forward pass are as follows,

$$

\mathbf{Q} = \mathbf{Q}_{in}*\mathbf{W}_{Q}+\mathbf{b}_{Q}

$$

$$

\mathbf{K} = \mathbf{K}_{in}*\mathbf{W}_{K}+\mathbf{b}_{K}

$$

$$

\mathbf{V} = \mathbf{V}_{in}*\mathbf{W}_{V}+\mathbf{b}_{V}

$$

$$

\mathbf{S} = \mathbf{Q}*\mathbf{K}^T

$$

$$

\mathbf{P} = SoftmaxFWD(Mask(\mathbf{S}*\frac{1}{\sqrt{d}}))

$$

$$

\mathbf{P} = DropoutFWD(\mathbf{P})

$$

$$

\mathbf{O}=\mathbf{P}*\mathbf{V}

$$

$$

\mathbf{O}_{out} = \mathbf{O}*\mathbf{W}_{O}+\mathbf{b}_{O}

$$

### MSE Loss

$$

loss = MSELoss(\mathbf{O}_{out},\mathbf{O}_{target})

$$

MSELoss will also gives 

$$ \mathbf{grad\\_O}_{out} $$

, the gradient of  

$$ \mathbf{O}_{out} $$

### Backward Pass of MHA

1. Gradients for output (from LayerNorm)

$$

\mathbf{grad\\_O}_{out}

$$

2. Gradients for the intermediate variables

3. Gradients for the forward input

$$ 

\mathbf{grad\\_Q}_{in} \quad \mathbf{grad\\_K}_{in} \quad \mathbf{grad\\_V}_{in}

$$

4. Gradients of the weights and biases

$$

\mathbf{grad\\_W}_{Q} \quad \mathbf{grad\\_b}_{Q}

$$

$$

\mathbf{grad\\_W}_{K} \quad \mathbf{grad\\_b}_{K}

$$

$$

\mathbf{grad\\_W}_{V} \quad \mathbf{grad\\_b}_{V}

$$

$$

\mathbf{grad\\_W}_{O} \quad \mathbf{grad\\_b}_{O}

$$

The equations of MHA backward pass are as follows,

$$

\mathbf{grad\\_O} = \mathbf{grad\\_O}_{out}*\mathbf{W}_{O}

$$

$$

\mathbf{grad\\_W}_{O} = \mathbf{grad\\_O}_{out}^T*\mathbf{O}

$$

$$

\mathbf{grad\\_b}_{O} = colsum(\mathbf{grad\\_O}_{out})

$$

$$

\mathbf{grad\\_P} = \mathbf{grad\\_O}*\mathbf{V}^T

$$

$$

\mathbf{grad\\_V} = \mathbf{P}^T*\mathbf{grad\\_O}

$$

$$

\mathbf{grad\\_P} = DropoutBWD(\mathbf{grad\\_P})

$$

$$

\mathbf{grad\\_S} = SoftmaxBWD(\mathbf{P},\mathbf{grad\\_P})*\frac{1}{\sqrt{d}}

$$

$$

\mathbf{grad\\_Q} = \mathbf{grad\\_S}*\mathbf{K}

$$

$$

\mathbf{grad\\_K} = \mathbf{grad\\_S}^T*\mathbf{Q}

$$

$$

\mathbf{grad\\_Q}_{in} = \mathbf{grad\\_Q}*\mathbf{W}_{Q}^T

$$

$$

\mathbf{grad\\_W}_{Q} = \mathbf{Q}_{in}^T*\mathbf{grad\\_Q}

$$

$$

\mathbf{grad\\_b}_{Q} = colsum(\mathbf{grad\\_Q})

$$

$$

\mathbf{grad\\_K}_{in} = \mathbf{grad\\_K}*\mathbf{W}_{K}^T

$$

$$

\mathbf{grad\\_W}_{K} = \mathbf{K}_{in}^T*\mathbf{grad\\_K}

$$

$$

\mathbf{grad\\_b}_{K} = colsum(\mathbf{grad\\_K})

$$

$$

\mathbf{grad\\_V}_{in} = \mathbf{grad\\_V}*\mathbf{W}_{V}^T

$$

$$

\mathbf{grad\\_W}_{V} = \mathbf{V}_{in}^T*\mathbf{grad\\_V}

$$

$$

\mathbf{grad\\_b}_{V} = colsum(\mathbf{grad\\_V})

$$

  

## The components of the MHA Training Library

### MSE Loss Function

Loss function, as the origin of DL system, is a basic component inside a DL system.



 MSE Loss.

```

eidnnStatus_t eidnnMSELoss(

    eidnnHandle_t handle,

    const Tensor &output, 

    const Tensor &target,

    Tensor &loss,

    Tensor &d_loss);

```

### Linear

cuDNN has no specific APIs for linear layer.

In eigenDNN, we have

```

eidnnStatus_t eidnnLinearForward(eidnnHandle_t handle,

                    const Tensor& x, // data

                    const Tensor& w, // weight

                    const Tensor& bias, // bias

                    Tensor& y);

```

```

eidnnStatus_t eidnnLinearBackward(eidnnHandle_t handle,

                     const Tensor& dy,

                     const Tensor& x,

                     const Tensor& w,

                     Tensor& dx, // gradient of input data

                     Tensor& dw, // accumulated gradient of weight

                     Tensor& dbias // accumulated gradient of bias

                     );

```

### MatMul

$$ C = \beta * C + \alpha*Op_c(MatMul(Op_a(A),Op_b(B))) $$

, where $Op_m(M)$ is whether to transpose matrix $M$ or not in the forward pass.

cuDNN has no specific APIs for matrix-multiply operation.

In eigenDNN, we have

```

eidnnStatus_t eidnnStridedBatchedGemmForward(

    eidnnHandle_t handle,

    float alpha,

    float beta,

    bool trans_A, // Op_a

    bool trans_B, // Op_b

    bool trans_C, // Op_c

    const Tensor &A, 

    const Tensor &B, 

    Tensor &C);

```

```

eidnnStatus_t eidnnStridedBatchedGemmBackward(

    eidnnHandle_t handle,

    float alpha,

    float beta,

    bool trans_A, // Op_a

    bool trans_B, // Op_b

    bool trans_C, // Op_c

    const Tensor &A, // A

    const Tensor &B, // B

    const Tensor &d_C, // gradient of C

    Tensor &d_A, // gradient of A

    Tensor &d_B // gradient of B

    );

```

### Softmax

cuDNN has the following APIs for softmax operation.

* [cudnnSoftmaxForward()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSoftmaxForward)

* [cudnnSoftmaxBackward()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSoftmaxBackward)

In eigenDNN, we have

```

eidnnStatus_t eidnnSoftmaxForward(eidnnHandle_t handle,

                    eidnnSoftmaxAlgorithm_t algo,

                    eidnnSoftmaxMode_t mode,

                    const Tensor& x,

                    Tensor& y);

```

```

eidnnStatus_t eidnnSoftmaxBackward(eidnnHandle_t handle,

                     eidnnSoftmaxAlgorithm_t algo,

                     eidnnSoftmaxMode_t mode,

                     const Tensor& y,

                     const Tensor& dy,

                     Tensor& dx);

```

### Dropout

cuDNN has the following APIs for dropout operation.

* [cudnnCreateDropoutDescriptor()]()

* [cudnnDestroyDropoutDescriptor()]()

* [cudnnDropoutGetStatesSize()]()

* [cudnnDropoutGetReserveSpaceSize()]()

* [cudnnDropoutForward()]()

* [cudnnGetDropoutDescriptor()]()

* [cudnnRestoreDropoutDescriptor()]()

* [cudnnSetDropoutDescriptor()]()

* [cudnnDropoutBackward()]()

In eigenDNN, we have

```

// dropout rate, 

// pointer to memory space of states (allocated by forward pass), 

// size of memory space in bytes (calculated by forward pass), 

// random seed

using eidnnDropoutDescriptor_t = std::tuple; 

```

```

eidnnStatus_t eidnnDropoutForward(

    eidnnHandle_t                       handle,

    eidnnDropoutDescriptor_t      &dropoutDesc,

    const Tensor         &x, // input data

    Tensor               &y // input data after dropout

    );

```

```

eidnnStatus_t eidnnDropoutBackward(

    eidnnHandle_t                   handle,

    const eidnnDropoutDescriptor_t  dropoutDesc,

    const Tensor       &dy, // gradient of dropout output data

    Tensor             &dx // gradient of dropout input data

    );

```