{"id":16947851,"url":"https://github.com/hyunwoongko/transformer","last_synced_at":"2025-05-14T19:01:54.683Z","repository":{"id":38260325,"uuid":"215272388","full_name":"hyunwoongko/transformer","owner":"hyunwoongko","description":"Transformer: PyTorch Implementation of \"Attention Is All You Need\"","archived":false,"fork":false,"pushed_at":"2024-08-06T14:40:08.000Z","size":2043,"stargazers_count":3591,"open_issues_count":14,"forks_count":510,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-04-11T06:17:47.511Z","etag":null,"topics":["attention","dataset","pytorch","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hyunwoongko.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-15T10:36:00.000Z","updated_at":"2025-04-10T12:59:12.000Z","dependencies_parsed_at":"2024-10-30T15:00:25.734Z","dependency_job_id":"9340cd91-4740-49ca-8257-0254cca910c5","html_url":"https://github.com/hyunwoongko/transformer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyunwoongko%2Ftransformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyunwoongko%2Ftransformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyunwoongko%2Ftransformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyunwoongko%2Ftransformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hyunwoongko","download_url":"https://codeload.github.com/hyunwoongko/transformer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248351394,"owners_count":21089272,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","dataset","pytorch","transformer"],"created_at":"2024-10-13T21:48:38.369Z","updated_at":"2025-04-11T06:17:51.461Z","avatar_url":"https://github.com/hyunwoongko.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"### WARNING\nThis code was written in 2019, and I was not very familiar with transformer model in that time.\nSo don't trust this code too much. Currently I am not managing this code well, so please open pull requests if you find bugs in the code and want to fix.\n\n# Transformer\nMy own implementation Transformer model (Attention is All You Need - Google Brain, 2017)\n\u003cbr\u003e\u003cbr\u003e\n![model](image/model.png)\n\u003cbr\u003e\u003cbr\u003e\n\n## 1. Implementations\n\n### 1.1 Positional Encoding\n\n![model](image/positional_encoding.jpg)\n   \n    \n```python\nclass PositionalEncoding(nn.Module):\n    \"\"\"\n    compute sinusoid encoding.\n    \"\"\"\n    def __init__(self, d_model, max_len, device):\n        \"\"\"\n        constructor of sinusoid encoding class\n\n        :param d_model: dimension of model\n        :param max_len: max sequence length\n        :param device: hardware device setting\n        \"\"\"\n        super(PositionalEncoding, self).__init__()\n\n        # same size with input matrix (for adding with input matrix)\n        self.encoding = torch.zeros(max_len, d_model, device=device)\n        self.encoding.requires_grad = False  # we don't need to compute gradient\n\n        pos = torch.arange(0, max_len, device=device)\n        pos = pos.float().unsqueeze(dim=1)\n        # 1D =\u003e 2D unsqueeze to represent word's position\n\n        _2i = torch.arange(0, d_model, step=2, device=device).float()\n        # 'i' means index of d_model (e.g. embedding size = 50, 'i' = [0,50])\n        # \"step=2\" means 'i' multiplied with two (same with 2 * i)\n\n        self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / d_model)))\n        self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / d_model)))\n        # compute positional encoding to consider positional information of words\n\n    def forward(self, x):\n        # self.encoding\n        # [max_len = 512, d_model = 512]\n\n        batch_size, seq_len = x.size()\n        # [batch_size = 128, seq_len = 30]\n\n        return self.encoding[:seq_len, :]\n        # [seq_len = 30, d_model = 512]\n        # it will add with tok_emb : [128, 30, 512]         \n```\n\u003cbr\u003e\u003cbr\u003e\n\n### 1.2 Multi-Head Attention\n\n\n![model](image/multi_head_attention.jpg)\n\n```python\nclass MultiHeadAttention(nn.Module):\n\n    def __init__(self, d_model, n_head):\n        super(MultiHeadAttention, self).__init__()\n        self.n_head = n_head\n        self.attention = ScaleDotProductAttention()\n        self.w_q = nn.Linear(d_model, d_model)\n        self.w_k = nn.Linear(d_model, d_model)\n        self.w_v = nn.Linear(d_model, d_model)\n        self.w_concat = nn.Linear(d_model, d_model)\n\n    def forward(self, q, k, v, mask=None):\n        # 1. dot product with weight matrices\n        q, k, v = self.w_q(q), self.w_k(k), self.w_v(v)\n\n        # 2. split tensor by number of heads\n        q, k, v = self.split(q), self.split(k), self.split(v)\n\n        # 3. do scale dot product to compute similarity\n        out, attention = self.attention(q, k, v, mask=mask)\n        \n        # 4. concat and pass to linear layer\n        out = self.concat(out)\n        out = self.w_concat(out)\n\n        # 5. visualize attention map\n        # TODO : we should implement visualization\n\n        return out\n\n    def split(self, tensor):\n        \"\"\"\n        split tensor by number of head\n\n        :param tensor: [batch_size, length, d_model]\n        :return: [batch_size, head, length, d_tensor]\n        \"\"\"\n        batch_size, length, d_model = tensor.size()\n\n        d_tensor = d_model // self.n_head\n        tensor = tensor.view(batch_size, length, self.n_head, d_tensor).transpose(1, 2)\n        # it is similar with group convolution (split by number of heads)\n\n        return tensor\n\n    def concat(self, tensor):\n        \"\"\"\n        inverse function of self.split(tensor : torch.Tensor)\n\n        :param tensor: [batch_size, head, length, d_tensor]\n        :return: [batch_size, length, d_model]\n        \"\"\"\n        batch_size, head, length, d_tensor = tensor.size()\n        d_model = head * d_tensor\n\n        tensor = tensor.transpose(1, 2).contiguous().view(batch_size, length, d_model)\n        return tensor\n```\n\u003cbr\u003e\u003cbr\u003e\n\n### 1.3 Scale Dot Product Attention\n\n![model](image/scale_dot_product_attention.jpg)\n\n```python\nclass ScaleDotProductAttention(nn.Module):\n    \"\"\"\n    compute scale dot product attention\n\n    Query : given sentence that we focused on (decoder)\n    Key : every sentence to check relationship with Qeury(encoder)\n    Value : every sentence same with Key (encoder)\n    \"\"\"\n\n    def __init__(self):\n        super(ScaleDotProductAttention, self).__init__()\n        self.softmax = nn.Softmax(dim=-1)\n\n    def forward(self, q, k, v, mask=None, e=1e-12):\n        # input is 4 dimension tensor\n        # [batch_size, head, length, d_tensor]\n        batch_size, head, length, d_tensor = k.size()\n\n        # 1. dot product Query with Key^T to compute similarity\n        k_t = k.transpose(2, 3)  # transpose\n        score = (q @ k_t) / math.sqrt(d_tensor)  # scaled dot product\n\n        # 2. apply masking (opt)\n        if mask is not None:\n            score = score.masked_fill(mask == 0, -10000)\n\n        # 3. pass them softmax to make [0, 1] range\n        score = self.softmax(score)\n\n        # 4. multiply with Value\n        v = score @ v\n\n        return v, score\n```\n\u003cbr\u003e\u003cbr\u003e\n\n### 1.4 Layer Norm\n\n![model](image/layer_norm.jpg)\n    \n```python\nclass LayerNorm(nn.Module):\n    def __init__(self, d_model, eps=1e-12):\n        super(LayerNorm, self).__init__()\n        self.gamma = nn.Parameter(torch.ones(d_model))\n        self.beta = nn.Parameter(torch.zeros(d_model))\n        self.eps = eps\n\n    def forward(self, x):\n        mean = x.mean(-1, keepdim=True)\n        var = x.var(-1, unbiased=False, keepdim=True)\n        # '-1' means last dimension. \n\n        out = (x - mean) / torch.sqrt(var + self.eps)\n        out = self.gamma * out + self.beta\n        return out\n\n```\n\u003cbr\u003e\u003cbr\u003e\n\n### 1.5 Positionwise Feed Forward\n\n![model](image/positionwise_feed_forward.jpg)\n    \n```python\n\nclass PositionwiseFeedForward(nn.Module):\n\n    def __init__(self, d_model, hidden, drop_prob=0.1):\n        super(PositionwiseFeedForward, self).__init__()\n        self.linear1 = nn.Linear(d_model, hidden)\n        self.linear2 = nn.Linear(hidden, d_model)\n        self.relu = nn.ReLU()\n        self.dropout = nn.Dropout(p=drop_prob)\n\n    def forward(self, x):\n        x = self.linear1(x)\n        x = self.relu(x)\n        x = self.dropout(x)\n        x = self.linear2(x)\n        return x\n```\n\u003cbr\u003e\u003cbr\u003e\n\n### 1.6 Encoder \u0026 Decoder Structure\n\n![model](image/enc_dec.jpg)\n    \n```python\nclass EncoderLayer(nn.Module):\n\n    def __init__(self, d_model, ffn_hidden, n_head, drop_prob):\n        super(EncoderLayer, self).__init__()\n        self.attention = MultiHeadAttention(d_model=d_model, n_head=n_head)\n        self.norm1 = LayerNorm(d_model=d_model)\n        self.dropout1 = nn.Dropout(p=drop_prob)\n\n        self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob)\n        self.norm2 = LayerNorm(d_model=d_model)\n        self.dropout2 = nn.Dropout(p=drop_prob)\n\n    def forward(self, x, src_mask):\n        # 1. compute self attention\n        _x = x\n        x = self.attention(q=x, k=x, v=x, mask=src_mask)\n        \n        # 2. add and norm\n        x = self.dropout1(x)\n        x = self.norm1(x + _x)\n        \n        # 3. positionwise feed forward network\n        _x = x\n        x = self.ffn(x)\n      \n        # 4. add and norm\n        x = self.dropout2(x)\n        x = self.norm2(x + _x)\n        return x\n```\n\u003cbr\u003e\n\n```python\nclass Encoder(nn.Module):\n\n    def __init__(self, enc_voc_size, max_len, d_model, ffn_hidden, n_head, n_layers, drop_prob, device):\n        super().__init__()\n        self.emb = TransformerEmbedding(d_model=d_model,\n                                        max_len=max_len,\n                                        vocab_size=enc_voc_size,\n                                        drop_prob=drop_prob,\n                                        device=device)\n\n        self.layers = nn.ModuleList([EncoderLayer(d_model=d_model,\n                                                  ffn_hidden=ffn_hidden,\n                                                  n_head=n_head,\n                                                  drop_prob=drop_prob)\n                                     for _ in range(n_layers)])\n\n    def forward(self, x, src_mask):\n        x = self.emb(x)\n\n        for layer in self.layers:\n            x = layer(x, src_mask)\n\n        return x\n```\n\u003cbr\u003e\n\n```python\nclass DecoderLayer(nn.Module):\n\n    def __init__(self, d_model, ffn_hidden, n_head, drop_prob):\n        super(DecoderLayer, self).__init__()\n        self.self_attention = MultiHeadAttention(d_model=d_model, n_head=n_head)\n        self.norm1 = LayerNorm(d_model=d_model)\n        self.dropout1 = nn.Dropout(p=drop_prob)\n\n        self.enc_dec_attention = MultiHeadAttention(d_model=d_model, n_head=n_head)\n        self.norm2 = LayerNorm(d_model=d_model)\n        self.dropout2 = nn.Dropout(p=drop_prob)\n\n        self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob)\n        self.norm3 = LayerNorm(d_model=d_model)\n        self.dropout3 = nn.Dropout(p=drop_prob)\n\n    def forward(self, dec, enc, trg_mask, src_mask):    \n        # 1. compute self attention\n        _x = dec\n        x = self.self_attention(q=dec, k=dec, v=dec, mask=trg_mask)\n        \n        # 2. add and norm\n        x = self.dropout1(x)\n        x = self.norm1(x + _x)\n\n        if enc is not None:\n            # 3. compute encoder - decoder attention\n            _x = x\n            x = self.enc_dec_attention(q=x, k=enc, v=enc, mask=src_mask)\n            \n            # 4. add and norm\n            x = self.dropout2(x)\n            x = self.norm2(x + _x)\n\n        # 5. positionwise feed forward network\n        _x = x\n        x = self.ffn(x)\n        \n        # 6. add and norm\n        x = self.dropout3(x)\n        x = self.norm3(x + _x)\n        return x\n```\n\u003cbr\u003e\n\n```python        \nclass Decoder(nn.Module):\n    def __init__(self, dec_voc_size, max_len, d_model, ffn_hidden, n_head, n_layers, drop_prob, device):\n        super().__init__()\n        self.emb = TransformerEmbedding(d_model=d_model,\n                                        drop_prob=drop_prob,\n                                        max_len=max_len,\n                                        vocab_size=dec_voc_size,\n                                        device=device)\n\n        self.layers = nn.ModuleList([DecoderLayer(d_model=d_model,\n                                                  ffn_hidden=ffn_hidden,\n                                                  n_head=n_head,\n                                                  drop_prob=drop_prob)\n                                     for _ in range(n_layers)])\n\n        self.linear = nn.Linear(d_model, dec_voc_size)\n\n    def forward(self, trg, src, trg_mask, src_mask):\n        trg = self.emb(trg)\n\n        for layer in self.layers:\n            trg = layer(trg, src, trg_mask, src_mask)\n\n        # pass to LM head\n        output = self.linear(trg)\n        return output\n```\n\u003cbr\u003e\u003cbr\u003e\n\n## 2. Experiments\n\nI use Multi30K Dataset to train and evaluate model \u003cbr\u003e\nYou can check detail of dataset [here](https://arxiv.org/abs/1605.00459) \u003cbr\u003e\nI follow original paper's parameter settings. (below) \u003cbr\u003e\n\n![conf](image/transformer-model-size.jpg)\n### 2.1 Model Specification\n\n* total parameters = 55,207,087\n* model size = 215.7MB\n* lr scheduling : ReduceLROnPlateau\n\n#### 2.1.1 configuration\n\n* batch_size = 128\n* max_len = 256\n* d_model = 512\n* n_layers = 6\n* n_heads = 8\n* ffn_hidden = 2048\n* drop_prob = 0.1\n* init_lr = 0.1\n* factor = 0.9\n* patience = 10\n* warmup = 100\n* adam_eps = 5e-9\n* epoch = 1000\n* clip = 1\n* weight_decay = 5e-4\n\u003cbr\u003e\u003cbr\u003e\n\n### 2.2 Training Result\n\n![image](saved/transformer-base/train_result.jpg)\n* Minimum Training Loss = 2.852672759656864\n* Minimum Validation Loss = 3.2048025131225586 \n\u003cbr\u003e\u003cbr\u003e\n\n| Model | Dataset | BLEU Score |\n|:---:|:---:|:---:|\n| Original Paper's | WMT14 EN-DE | 25.8 |\n| My Implementation | Multi30K EN-DE | 26.4 |\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## 3. Reference\n- [Attention is All You Need, 2017 - Google](https://arxiv.org/abs/1706.03762)\n- [The Illustrated Transformer - Jay Alammar](http://jalammar.github.io/illustrated-transformer/)\n- [Data \u0026 Optimization Code Reference - Bentrevett](https://github.com/bentrevett/pytorch-seq2seq/)\n\n\u003cbr\u003e\u003cbr\u003e\n\n## 4. Licence\n    Copyright 2019 Hyunwoong Ko.\n    \n    Licensed under the Apache License, Version 2.0 (the \"License\");\n    you may not use this file except in compliance with the License.\n    You may obtain a copy of the License at\n    \n    http://www.apache.org/licenses/LICENSE-2.0\n    \n    Unless required by applicable law or agreed to in writing, software\n    distributed under the License is distributed on an \"AS IS\" BASIS,\n    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n    See the License for the specific language governing permissions and\n    limitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyunwoongko%2Ftransformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhyunwoongko%2Ftransformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyunwoongko%2Ftransformer/lists"}