{"id":20467375,"url":"https://github.com/jundaf2/eigenmha","last_synced_at":"2025-04-13T09:11:50.027Z","repository":{"id":212199122,"uuid":"606634741","full_name":"jundaf2/eigenMHA","owner":"jundaf2","description":"Forward and backward Attention DNN operators implementationed by LibTorch, cuDNN, and Eigen.","archived":false,"fork":false,"pushed_at":"2023-06-06T09:48:12.000Z","size":78842,"stargazers_count":30,"open_issues_count":2,"forks_count":5,"subscribers_count":1,"default_branch":"cudnn","last_synced_at":"2025-03-27T00:54:16.368Z","etag":null,"topics":["backpropagation","cuda","cudnn","cudnn-v8","dnn","inference","pytorch"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jundaf2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-02-26T04:29:24.000Z","updated_at":"2025-02-26T14:54:46.000Z","dependencies_parsed_at":"2023-12-13T02:50:44.869Z","dependency_job_id":"4c69dfd1-558a-4030-a8cb-0a1999fbd1f3","html_url":"https://github.com/jundaf2/eigenMHA","commit_stats":null,"previous_names":["jundaf2/eigenmha"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jundaf2%2FeigenMHA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jundaf2%2FeigenMHA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jundaf2%2FeigenMHA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jundaf2%2FeigenMHA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jundaf2","download_url":"https://codeload.github.com/jundaf2/eigenMHA/tar.gz/refs/heads/cudnn","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248688566,"owners_count":21145766,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backpropagation","cuda","cudnn","cudnn-v8","dnn","inference","pytorch"],"created_at":"2024-11-15T13:28:17.403Z","updated_at":"2025-04-13T09:11:49.983Z","avatar_url":"https://github.com/jundaf2.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003ccenter\u003e\u003cimg src=\"./figures/MHA.png\" ...\u003e\u003c/center\u003e\n\u003ccenter\u003eWhich part will we implement in the transformer model.\u003c/center\u003e\n\n# eigenMHA (eigenDNN vs cuDNN) -- Multi-head Attention Inference and Training implemented by Eigen.\nTo clone this repo, \n```\ngit clone --recursive https://github.com/jundaf2/eigenMHA\ncd eigenMHA\ngit clone https://gitlab.com/libeigen/eigen  # clone eigen if necessary\n```\n\n## Introduction\n In this repo, we use Eigen3 to implement the forward and backward of Multi-head Attention in Transformer models. Basically, this repo has two branches -- `torch` and `cudnn`. \n\n## The MHAs in this repo\n1. a pytorch MHA in `mha.py` that illustrates the MHA module we implement\n2. an eigen MHA in `mha.cc` in both branches (with sources in `./src/eigenDNN.cpp` and headers in `./inlcude/eigenDNN.h`)\n3. a libtorch MHA in the `torch` branch as a comparison to the eigenMHA\n4. a cudnn MHA in the `cudnn` branch as a comparison to the eigenMHA\n\n### branch `torch`\n```\ngit checkout torch\n```\n\nIn this branch, the eigenDNN is compared with the CPU libtorch. To make and run the project, first install LibTorch for necessary verification, see https://github.com/jundaf2/dnn-test-framework  [nnTest mainly focuses on providing a testing framework to train and inference Deep Neural Networks using YOUR OWN LIBRARY]. And then,\n```\nmkdir build \u0026\u0026 cd build\ncmake ..\nmake -j4\n./mha\n```\n\n\n### branch `cudnn`\n```\ngit checkout cudnn\n```\nIn this branch, the eigenDNN is compared with the Multi-head Attention APIs provided by cuDNN V8 (`cudnn_samples_v8/multiHeadAttention`). \n\nTo install cuDNN, see https://developer.nvidia.com/rdp/cudnn-download and https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar . After copying the corresponding libraries and headers to the correct location, \n```\nmkdir build \u0026\u0026 cd build\ncmake ..\nmake -j4\n./mha\n```\n\nTo be more specific, this eigenDNN does what the cuDNN does in the following APIs for MHA operations.\n* [cudnnCreateAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnCreateAttnDescriptor)\n* [cudnnSetAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSetAttnDescriptor)\n* [cudnnGetAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnGetAttnDescriptor)\n* [cudnnSetAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSetAttnDescriptor)\n* [cudnnDestroyAttnDescriptor()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnDestroyAttnDescriptor)\n* [cudnnGetMultiHeadAttnBuffers()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnGetMultiHeadAttnBuffers)\n* [cudnnGetMultiHeadAttnWeights()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnGetMultiHeadAttnWeights)\n* [cudnnMultiHeadAttnForward()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnMultiHeadAttnForward)\n* [cudnnMultiHeadAttnBackwardData()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnMultiHeadAttnBackwardData)\n* [cudnnMultiHeadAttnBackwardWeights()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnMultiHeadAttnBackwardWeights)\n\nFor more details of the Attention APIs in cuDNN v8, see this [中文CSDN链接](http://t.csdn.cn/Hw0Qi).\n\n## What are the variables of MHA in a Training Library?\n\n\u003ccenter\u003e\u003cimg src=\"./figures/attention_train.png\" ...\u003e\u003c/center\u003e\n\n### Forward Pass of MHA\n\n1. Q, K, V input embeddings\n\n$$\n\\mathbf{Q}_{in} \\quad  \\mathbf{K}_{in} \\quad  \\mathbf{V}_{in}\n$$\n\n2. Weights and bias for the linear layer of Q K V and O.\n\n$$\n\\mathbf{W}_{Q} \\quad \\mathbf{b}_{Q}\n$$\n\n$$\n\\mathbf{W}_{K} \\quad \\mathbf{b}_{K}\n$$\n\n$$\n\\mathbf{W}_{V} \\quad \\mathbf{b}_{V}\n$$\n\n$$\n\\mathbf{W}_{O} \\quad \\mathbf{b}_{O}\n$$\n\n3. Intermediate variables\n4. Output and target\n\n$$\n\\mathbf{O}_{out}\\quad\\mathbf{O}_{target}\n$$\n\n\nThe equations of MHA forward pass are as follows,\n\n$$\n\\mathbf{Q} = \\mathbf{Q}_{in}*\\mathbf{W}_{Q}+\\mathbf{b}_{Q}\n$$\n\n$$\n\\mathbf{K} = \\mathbf{K}_{in}*\\mathbf{W}_{K}+\\mathbf{b}_{K}\n$$\n\n$$\n\\mathbf{V} = \\mathbf{V}_{in}*\\mathbf{W}_{V}+\\mathbf{b}_{V}\n$$\n\n$$\n\\mathbf{S} = \\mathbf{Q}*\\mathbf{K}^T\n$$\n\n$$\n\\mathbf{P} = SoftmaxFWD(Mask(\\mathbf{S}*\\frac{1}{\\sqrt{d}}))\n$$\n\n$$\n\\mathbf{P} = DropoutFWD(\\mathbf{P})\n$$\n\n$$\n\\mathbf{O}=\\mathbf{P}*\\mathbf{V}\n$$\n\n$$\n\\mathbf{O}_{out} = \\mathbf{O}*\\mathbf{W}_{O}+\\mathbf{b}_{O}\n$$\n\n### MSE Loss\n$$\nloss = MSELoss(\\mathbf{O}_{out},\\mathbf{O}_{target})\n$$\n\nMSELoss will also gives \n\n$$ \\mathbf{grad\\\\_O}_{out} $$\n\n, the gradient of  \n\n$$ \\mathbf{O}_{out} $$\n\n### Backward Pass of MHA\n\n1. Gradients for output (from LayerNorm)\n\n$$\n\\mathbf{grad\\\\_O}_{out}\n$$\n\n2. Gradients for the intermediate variables\n3. Gradients for the forward input\n\n$$ \n\\mathbf{grad\\\\_Q}_{in} \\quad \\mathbf{grad\\\\_K}_{in} \\quad \\mathbf{grad\\\\_V}_{in}\n$$\n\n4. Gradients of the weights and biases\n\n$$\n\\mathbf{grad\\\\_W}_{Q} \\quad \\mathbf{grad\\\\_b}_{Q}\n$$\n\n$$\n\\mathbf{grad\\\\_W}_{K} \\quad \\mathbf{grad\\\\_b}_{K}\n$$\n\n$$\n\\mathbf{grad\\\\_W}_{V} \\quad \\mathbf{grad\\\\_b}_{V}\n$$\n\n$$\n\\mathbf{grad\\\\_W}_{O} \\quad \\mathbf{grad\\\\_b}_{O}\n$$\n\nThe equations of MHA backward pass are as follows,\n\n$$\n\\mathbf{grad\\\\_O} = \\mathbf{grad\\\\_O}_{out}*\\mathbf{W}_{O}\n$$\n\n$$\n\\mathbf{grad\\\\_W}_{O} = \\mathbf{grad\\\\_O}_{out}^T*\\mathbf{O}\n$$\n\n$$\n\\mathbf{grad\\\\_b}_{O} = colsum(\\mathbf{grad\\\\_O}_{out})\n$$\n\n$$\n\\mathbf{grad\\\\_P} = \\mathbf{grad\\\\_O}*\\mathbf{V}^T\n$$\n\n$$\n\\mathbf{grad\\\\_V} = \\mathbf{P}^T*\\mathbf{grad\\\\_O}\n$$\n\n$$\n\\mathbf{grad\\\\_P} = DropoutBWD(\\mathbf{grad\\\\_P})\n$$\n\n$$\n\\mathbf{grad\\\\_S} = SoftmaxBWD(\\mathbf{P},\\mathbf{grad\\\\_P})*\\frac{1}{\\sqrt{d}}\n$$\n\n$$\n\\mathbf{grad\\\\_Q} = \\mathbf{grad\\\\_S}*\\mathbf{K}\n$$\n\n$$\n\\mathbf{grad\\\\_K} = \\mathbf{grad\\\\_S}^T*\\mathbf{Q}\n$$\n\n$$\n\\mathbf{grad\\\\_Q}_{in} = \\mathbf{grad\\\\_Q}*\\mathbf{W}_{Q}^T\n$$\n\n$$\n\\mathbf{grad\\\\_W}_{Q} = \\mathbf{Q}_{in}^T*\\mathbf{grad\\\\_Q}\n$$\n\n$$\n\\mathbf{grad\\\\_b}_{Q} = colsum(\\mathbf{grad\\\\_Q})\n$$\n\n$$\n\\mathbf{grad\\\\_K}_{in} = \\mathbf{grad\\\\_K}*\\mathbf{W}_{K}^T\n$$\n\n$$\n\\mathbf{grad\\\\_W}_{K} = \\mathbf{K}_{in}^T*\\mathbf{grad\\\\_K}\n$$\n\n$$\n\\mathbf{grad\\\\_b}_{K} = colsum(\\mathbf{grad\\\\_K})\n$$\n\n$$\n\\mathbf{grad\\\\_V}_{in} = \\mathbf{grad\\\\_V}*\\mathbf{W}_{V}^T\n$$\n\n$$\n\\mathbf{grad\\\\_W}_{V} = \\mathbf{V}_{in}^T*\\mathbf{grad\\\\_V}\n$$\n\n$$\n\\mathbf{grad\\\\_b}_{V} = colsum(\\mathbf{grad\\\\_V})\n$$\n\n  \n## The components of the MHA Training Library\n### MSE Loss Function\n\nLoss function, as the origin of DL system, is a basic component inside a DL system.\n\n\u003ccenter\u003e\u003cimg src=\"./figures/MSE Loss.PNG\" ...\u003e\u003c/center\u003e\n\u003ccenter\u003e MSE Loss.\u003c/center\u003e\n\n\n```\neidnnStatus_t eidnnMSELoss(\n    eidnnHandle_t handle,\n    const Tensor\u003cfloat, 3\u003e \u0026output, \n    const Tensor\u003cfloat, 3\u003e \u0026target,\n    Tensor\u003cfloat, 0\u003e \u0026loss,\n    Tensor\u003cfloat, 3\u003e \u0026d_loss);\n```\n\n### Linear\ncuDNN has no specific APIs for linear layer.\n\nIn eigenDNN, we have\n\n```\neidnnStatus_t eidnnLinearForward(eidnnHandle_t handle,\n                    const Tensor\u003cfloat, 3\u003e\u0026 x, // data\n                    const Tensor\u003cfloat, 2\u003e\u0026 w, // weight\n                    const Tensor\u003cfloat, 1\u003e\u0026 bias, // bias\n                    Tensor\u003cfloat, 3\u003e\u0026 y);\n```\n\n```\neidnnStatus_t eidnnLinearBackward(eidnnHandle_t handle,\n                     const Tensor\u003cfloat, 3\u003e\u0026 dy,\n                     const Tensor\u003cfloat, 3\u003e\u0026 x,\n                     const Tensor\u003cfloat, 2\u003e\u0026 w,\n                     Tensor\u003cfloat, 3\u003e\u0026 dx, // gradient of input data\n                     Tensor\u003cfloat, 2\u003e\u0026 dw, // accumulated gradient of weight\n                     Tensor\u003cfloat, 1\u003e\u0026 dbias // accumulated gradient of bias\n                     );\n```\n\n### MatMul\n\n$$ C = \\beta * C + \\alpha*Op_c(MatMul(Op_a(A),Op_b(B))) $$\n\n, where $Op_m(M)$ is whether to transpose matrix $M$ or not in the forward pass.\n\ncuDNN has no specific APIs for matrix-multiply operation.\n\nIn eigenDNN, we have\n\n```\neidnnStatus_t eidnnStridedBatchedGemmForward(\n    eidnnHandle_t handle,\n    float alpha,\n    float beta,\n    bool trans_A, // Op_a\n    bool trans_B, // Op_b\n    bool trans_C, // Op_c\n    const Tensor\u003cfloat, 4\u003e \u0026A, \n    const Tensor\u003cfloat, 4\u003e \u0026B, \n    Tensor\u003cfloat, 4\u003e \u0026C);\n```\n\n```\neidnnStatus_t eidnnStridedBatchedGemmBackward(\n    eidnnHandle_t handle,\n    float alpha,\n    float beta,\n    bool trans_A, // Op_a\n    bool trans_B, // Op_b\n    bool trans_C, // Op_c\n    const Tensor\u003cfloat, 4\u003e \u0026A, // A\n    const Tensor\u003cfloat, 4\u003e \u0026B, // B\n    const Tensor\u003cfloat, 4\u003e \u0026d_C, // gradient of C\n    Tensor\u003cfloat, 4\u003e \u0026d_A, // gradient of A\n    Tensor\u003cfloat, 4\u003e \u0026d_B // gradient of B\n    );\n```\n### Softmax\ncuDNN has the following APIs for softmax operation.\n* [cudnnSoftmaxForward()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSoftmaxForward)\n* [cudnnSoftmaxBackward()](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSoftmaxBackward)\n\nIn eigenDNN, we have\n\n```\neidnnStatus_t eidnnSoftmaxForward(eidnnHandle_t handle,\n                    eidnnSoftmaxAlgorithm_t algo,\n                    eidnnSoftmaxMode_t mode,\n                    const Tensor\u003cfloat, 4\u003e\u0026 x,\n                    Tensor\u003cfloat, 4\u003e\u0026 y);\n```\n\n```\neidnnStatus_t eidnnSoftmaxBackward(eidnnHandle_t handle,\n                     eidnnSoftmaxAlgorithm_t algo,\n                     eidnnSoftmaxMode_t mode,\n                     const Tensor\u003cfloat, 4\u003e\u0026 y,\n                     const Tensor\u003cfloat, 4\u003e\u0026 dy,\n                     Tensor\u003cfloat, 4\u003e\u0026 dx);\n```\n\n### Dropout\ncuDNN has the following APIs for dropout operation.\n* [cudnnCreateDropoutDescriptor()]()\n* [cudnnDestroyDropoutDescriptor()]()\n* [cudnnDropoutGetStatesSize()]()\n* [cudnnDropoutGetReserveSpaceSize()]()\n* [cudnnDropoutForward()]()\n* [cudnnGetDropoutDescriptor()]()\n* [cudnnRestoreDropoutDescriptor()]()\n* [cudnnSetDropoutDescriptor()]()\n* [cudnnDropoutBackward()]()\n\nIn eigenDNN, we have\n\n```\n// dropout rate, \n// pointer to memory space of states (allocated by forward pass), \n// size of memory space in bytes (calculated by forward pass), \n// random seed\nusing eidnnDropoutDescriptor_t = std::tuple\u003cfloat, void*, size_t, unsigned long long\u003e; \n```\n```\neidnnStatus_t eidnnDropoutForward(\n    eidnnHandle_t                       handle,\n    eidnnDropoutDescriptor_t      \u0026dropoutDesc,\n    const Tensor\u003cfloat, 4\u003e         \u0026x, // input data\n    Tensor\u003cfloat, 4\u003e               \u0026y // input data after dropout\n    );\n```\n\n```\neidnnStatus_t eidnnDropoutBackward(\n    eidnnHandle_t                   handle,\n    const eidnnDropoutDescriptor_t  dropoutDesc,\n    const Tensor\u003cfloat, 4\u003e       \u0026dy, // gradient of dropout output data\n    Tensor\u003cfloat, 4\u003e             \u0026dx // gradient of dropout input data\n    );\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjundaf2%2Feigenmha","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjundaf2%2Feigenmha","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjundaf2%2Feigenmha/lists"}