{"id":21170211,"url":"https://github.com/dcarpintero/transformer101","last_synced_at":"2026-05-07T07:37:36.161Z","repository":{"id":217287608,"uuid":"743296307","full_name":"dcarpintero/transformer101","owner":"dcarpintero","description":"Annotated vanilla implementation in PyTorch of the Transformer model introduced in 'Attention Is All You Need'.","archived":false,"fork":false,"pushed_at":"2024-02-19T22:15:23.000Z","size":220,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-21T10:51:02.498Z","etag":null,"topics":["attention-is-all-you-need","dot-product-attention","dropout-layers","encoder-decoder-architecture","feedforward-neural-network","gelu","linear-layers","multihead-attention","normalization-layers","positional-encoding","pytorch","self-attention","softmax","transfomer"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dcarpintero.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-14T22:40:28.000Z","updated_at":"2024-07-25T10:44:43.000Z","dependencies_parsed_at":"2024-01-15T14:24:55.634Z","dependency_job_id":"798753ae-d01d-4d0d-a086-95abfc8c998b","html_url":"https://github.com/dcarpintero/transformer101","commit_stats":null,"previous_names":["dcarpintero/transformer101"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Ftransformer101","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Ftransformer101/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Ftransformer101/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Ftransformer101/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dcarpintero","download_url":"https://codeload.github.com/dcarpintero/transformer101/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243616707,"owners_count":20319944,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention-is-all-you-need","dot-product-attention","dropout-layers","encoder-decoder-architecture","feedforward-neural-network","gelu","linear-layers","multihead-attention","normalization-layers","positional-encoding","pytorch","self-attention","softmax","transfomer"],"created_at":"2024-11-20T15:57:09.085Z","updated_at":"2026-05-07T07:37:31.131Z","avatar_url":"https://github.com/dcarpintero.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Transformer101\n\nVanilla implementation in Pytorch of the Transformer model as introduced in the paper [Attention Is All You Need, 2017](https://arxiv.org/pdf/1706.03762.pdf) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin.\n\n``Scaled Dot-Product Attention`` | ``Multi-Head Attention`` | ``Absolute Positional Encodings`` | ``Learned Positional Encodings`` | ``Dropout`` | ``Layer Normalization`` | ``Residual Connection`` | ``Linear Layer`` | ``Position-Wise Feed-Forward Layer`` | ``GELU`` | ``Softmax`` | ``Encoder`` | ``Decorder`` | ``Transformer``\n\n*Note that this is just an in progress learning project - if you are looking for production grade implementations, refer to the [PyTorch Transformer Class](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html), and [OLMo](https://github.com/allenai/OLMo/), a fully open language model.*\n\n## 1. Background\n\nSequence modeling and transduction tasks, such as language modeling and machine translation, were typically addressed with RNNs and CNNs. However, these architectures are limited by: (i) ``long training times``, due to the sequential nature of RNNs, which constrains parallelization, and results in increased memory and computational demands as the text sequence grows; and (ii) ``difficulty in learning dependencies between distant positions``, where CNNs, although much less sequential than RNNs, require a number of steps to integrate information that is, in most cases, correlated (linearly for models like ConvS2S and logarithmically for ByteNet) with the distance between elements in the sequence.\n\n## 2. Technical Approach\n\nThe paper 'Attention is All You Need' introduced the novel Transformer model, ``a stacked encoder-decoder architecture that utilizes self-attention mechanisms instead of recurrence and convolution to compute input and output representations``. In this model, each of the six layers of both the encoder and decoder blocks is composed of two main sub-layers: a multihead self-attention sub-layer, which allows the model to focus on different parts of the input sequence, and a position-wise fully connected feed-forward sub-layer.\n\nAt its core, the ``self-attention mechanism`` enables the model to weight the relationships between input tokens at different positions, resulting in a more effective handling of long-range dependencies. Additionally, by integrating ``multiple attention heads``, the model gains the ability to simultaneously attend to various aspects of the input data during training.\n\nIn the proposed implementation, the input and output tokens are converted to 512-dimensional embeddings, to which ``positional embeddings`` are added, enabling the model to use sequence order information.\n\n### 2.1 Attention\n\nAttention was first introduced as an enhancement to a basic encoder-decoder translation architecture *\"for allowing a model to automatically (soft-)search\nfor parts of a source sentence that are relevant to predicting a target word\"*, as outlined in the paper [Neural machine translation by jointly learning to align and translate](https://arxiv.org/pdf/1409.0473.pdf). In practice, the attention mechanism converts a set of vectors represented in an initial dimensional space into a richer new space that is better suited to solving downstream tasks. In the context of language processing, a word embedding would be the initial dimensional space. However, this space would only capture some elementary (and fixed) semantic properties among words, whereas the attention mechanism would enrich it. \n\nAs an example, when given the sentence *the morning sun cast a warm `light` through the window*, the attention mechanism will transform the dimenstional space of the word *`light`* into a new dimensional space where *`light`* is closer to the words *`sun`* and *`window`*;\n\n![Attention S1](static/attention_s1.png)\n\nwhereas in the sentence *she travels `light` for her weekend*, the word *`light`* would be represented closer to *`travel`*.\n\n![Attention S2](static/attention_s2.png)\n\n## 3. Transformer Model Implementation\n\n\n```python\nimport math\nimport torch\nimport torch.nn as nn\nimport torch.optim as optim\n```\n\n\n```python\nfrom dataclasses import dataclass\n\n@dataclass\nclass ModelConfig:\n    \"\"\" \n    Transformer (model) configuration\n    \"\"\"\n\n    d_model: int = 768         # dimension of the token embeddings (hideen size of the model)\n    n_layer: int = 6           # number of encoder/decoder layers\n    n_head: int = 8            # number of self-attention heads\n    d_ff: int = 2048           # dimension of the feedforward network\n    src_vocab_size: int = 32   # size of the source vocabulary (will be updated after loading dataset)\n    tgt_vocab_size: int = 32   # size of the target vocabulary (will be updated after loading dataset)\n    drop: float = 0.1          # dropout probability\n    max_seq_len: int = 128      # maximum sequence length (will be updated after loading dataset)\n    pad_token_id: int = 0      # padding token id (usually 0)\n    activation: str = \"gelu\"   # activation function\n\nconfig = ModelConfig()\n```\n\n\n```python\ndevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n```\n\n### 3.1 Self-Attention\n\nThe intuition behind ``self-attention`` is that averaging token embeddings instead of using a fixed embedding for each token, enables the model to capture how words relate to each other in the input. In practice, said weighted relationships (attention weights) represent the syntactic and contextual structure of the sentence, leading to a more nuanced and rich understanding of the data.\n\nThe most common way to implement a self-attention layer relies on ``scaled dot-product attention``, and involves:\n1. ``Linear projection`` of each token embedding into three vectors: ``query (q)``, ``key (k)``, ``value (v)``.\n2. Compute ``scaled attention scores``: determine the similary between ``q`` and ``k`` by applying the ``dot product``. Since the results of this function are typically large numbers, they are then divided by a scaling factor inferred from the dimensionality of (k). This scaling contributes to stabilize gradients during training.\n3. Normalize the ``attention scores`` into ``attention weights`` by applying the ``softmax`` function (this ensures all the values sum to 1).\n4. ``Update the token embeddings`` by multiplying the ``attention weights`` by the ``value vector``.\n\n\u003e In addition, the self-attention mechanism of the decoder layer introduces ``masking`` to prevent the decoder from having access to future tokens in the sequence it is generating. In practice, this is implemented with a binary mask that designates which tokens should be attended to (assigned non-zero weights) and which should be ignored (assigned zero weights). In our function, setting the future tokens (upper values) to negative-infinity guarantees that the attention weights become zero after applying the softmax function (e exp -inf == 0). This design aligns with the nature of many tasks like translation, summarization, or text generation, where the output sequence needs to be generated one element at a time, and the prediction of each element should be based only on the previously generated elements.\n\n\n```python\nclass AttentionHead(nn.Module):\n    \"\"\"\n    Represents a single attention head within a multi-head attention mechanism.\n    \n    Args:\n        config (TransformerConfig): The configuration for the transformer model.\n    \"\"\"\n    def __init__(self, config):\n        super().__init__()\n        self.d_head = config.d_model // config.n_head\n\n        self.q = nn.Linear(config.d_model, self.d_head)\n        self.k = nn.Linear(config.d_model, self.d_head)\n        self.v = nn.Linear(config.d_model, self.d_head)\n\n    def scaled_dot_product_attention(self, q, k, v, mask=None):\n        dim_k = torch.tensor(k.size(-1), dtype=torch.float32)\n        attn_scores = torch.bmm(q, k.transpose(1, 2)) / torch.sqrt(dim_k)\n        if mask is not None:\n            attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))\n\n        attn_weights = torch.softmax(attn_scores, axis=-1)\n        output = torch.bmm(attn_weights, v)\n        return output\n\n    def forward(self, q, k, v, mask=None):\n        \"\"\"\n        Args:\n            q (torch.Tensor): query embeddings.\n            k (torch.Tensor): key embeddings.\n            v (torch.Tensor): value embeddings.\n            mask (torch.Tensor): attention mask.\n        \"\"\"\n        output = self.scaled_dot_product_attention(self.q(q), \n                                                   self.k(k), \n                                                   self.v(v), \n                                                   mask=mask)\n        return output\n```\n\n\n```python\nattention_head = AttentionHead(config)\n\nx = torch.randn(10, 32, config.d_model)\n\"\"\"\n10: batch size\n32: sequence length\nconfig.d_model: hidden size (embedding dimension)\n\"\"\"\n\noutput = attention_head(x, x, x)\n\nprint(output.shape)\n# Should be [10, 32, d_head == 768 / 12]\n# \n# Note that in the linear projection step, q, k, and v are in practice splitted into n_head parts.\n# Those will be then concatenated and projected to the final output size \n# in the MultiHeadAttention class (see below).\n```\n\n    torch.Size([10, 32, 96])\n    \n\n### 3.2 Multi-Head Attention\n\nIn a standard attention mechanism, the ``softmax`` of a single head tends to concentrate on a specific aspect of similarity, potentially overlooking other relevant features in the input. By integrating multiple attention heads, the model gains the ability to simultaneously attend to various aspects of the input data.\n\nThe basic approach to implement Multi-Headed Attention comprises:\n\n1. Initialize the ``attention heads``. E.g. BERT has 12 attention heads whereas the embeddings dimension is 768, resulting in 768 / 12 = 64 as the head dimension.\n2. ``Concatenate attention heads`` to combines the outputs of the attention heads into a single vector while preserving the dimensionality of the embeddings.\n3. Apply a ``linear projection``.\n\n\u003e Note that the softmax function is a probability distribution, which when applied within a single attention head tends to amplify certain features (those with higher scores) while diminishing others. Thus, leading to a focus on specific aspects of similarity.\n\n\n```python\nclass MultiHeadAttention(nn.Module):\n    \"\"\"\n    Implements the Multi-Head Attention mechanism.\n\n    Args:\n        config (TransformerConfig): The configuration for the transformer model.\n    \"\"\"\n    def __init__(self, config):\n        super().__init__()\n        assert config.d_model % config.n_head == 0, \"d_model must be divisible by n_head\"\n\n        self.heads = nn.ModuleList([AttentionHead(config) for _ in range(config.n_head)])\n        self.linear = nn.Linear(config.d_model, config.d_model)\n\n    def forward(self, q, k, v, mask=None):\n        attn_outputs = torch.cat([h(q, k, v, mask) for h in self.heads], dim=-1)\n        output = self.linear(attn_outputs)\n        return output\n```\n\n\n```python\nmultihead_attn = MultiHeadAttention(config)\n\nx = torch.randn(10, 32, config.d_model)\nattn_output = multihead_attn(x, x, x)\nattn_output.size()\n# Should be [10, 32, d_model\n# \n# Note that the output size is the same as the input size, \n# as the attention scores of each head are concatenated and projected back to the original size.\n```\n\n\n\n\n    torch.Size([10, 32, 768])\n\n\n\n### 3.3 Position-Wise Feed-Forward Layer\n\nThe Transformer, primarily built upon linear operations like ``dot products`` and ``linear projections``, relies on the ``Position-Wise Feed-Forward Layer`` to introduce non-linearity into the model. This non-linearity enables the model to capture complex data patterns and relationships. The layer typically consists of two linear transformations with a ``non-linear activation function (like ReLU or GELU)``. Each layer in the ``Encoder`` and ``Decoder`` includes one of these feed-forward networks, allowing the model to build increasingly abstract representations of the input data as it passes through successive layers. Note that since this layer processes each embedding independly, the computations can be fully parallelized.\n\nIn summary, this Layer comprises:\n- ``First linear transformation`` to the input tensor. \n- A non-linear ``activation function`` to allow the model learn more complex patterns.\n- ``Second linear transformation``, increasing the model's capacity to learn complex relationships in the data.\n- ``Dropout``, a regularization technique used to prevent overfitting. It randomly zeroes some of the elements of the input tensor with a certain probability during training.\n\n\u003e Note that the ``ReLU`` function is a faster function that activates units only when the input is possitive, which can lead to sparse activations (that can be intended in some tasks); whereas ``GELU``, introduced after``ReLU``, offers smoother activation by modeling the input as a stochastic process, providing a probabilistic gate in the activation. In practice, ``GELU`` has been the preferred choice in the BERT and GPT models. Although recent models like LLaMA, PaLM, and [OLMo](https://allenai.org/olmo/olmo-paper.pdf) use the [SwiGLU](https://www.semanticscholar.org/paper/GLU-Variants-Improve-Transformer-Shazeer/bdbf780dfd6b3eb0c9e980887feae5f23af15bc4) activation function\n\n\n```python\nclass PositionWiseFeedForward(nn.Module):\n    \"\"\"\n    Implements the PositionWiseFeedForward layer.\n\n    Args:\n        config (TransformerConfig): The configuration for the transformer model.\n    \"\"\"\n    def __init__(self, config):\n        super().__init__()\n        self.ff = nn.Sequential(\n            nn.Linear(config.d_model, config.d_ff),\n            nn.GELU(),\n            nn.Linear(config.d_ff, config.d_model),\n            nn.Dropout(config.drop)\n        )\n\n    def forward(self, x):\n        return self.ff(x)\n```\n\n\n```python\nff = PositionWiseFeedForward(config)\nx = torch.randn(10, 32, config.d_model)\nff(x).size()\n# Should be [10, 32, d_model]\n```\n\n\n\n\n    torch.Size([10, 32, 768])\n\n\n\n### 3.4 Positional Encoding\n\nSince the Transformer model contains no recurrence and no convolution, the model is invariant to the position of the tokens. By adding ``positional encoding`` to the input sequence, the Transformer model can differentiate between tokens based on their position in the sequence, which is important for tasks such as language modeling and machine translation. In practice, ``positional encodings`` are added to the input embeddings at the bottoms of the ``encoder`` and ``decoder`` stacks. \n\n\u003e As outlined in the original [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) paper, ``Sinusoidal Positional Encoding`` and ``Learned Positional Encoding`` produce nearly identical results.\n\n\n\n\n```python\nclass PositionalEncoding(nn.Module):\n    \"\"\"\n    Implements the PositionalEncoding layer.\n\n    Args:\n        config (TransformerConfig): The configuration for the transformer model.\n    \"\"\"\n    def __init__(self, config):\n        super().__init__()\n        self.d_model = config.d_model\n        self.max_seq_len = config.max_seq_len\n\n        position = torch.arange(self.max_seq_len).unsqueeze(1)\n        div_term = torch.pow(10000, torch.arange(0, self.d_model, 2) / self.d_model)\n\n        pe = torch.zeros(1, self.max_seq_len, self.d_model)\n        pe[0, :, 0::2] = torch.sin(position / div_term)\n        pe[0, :, 1::2] = torch.cos(position / div_term)\n\n        self.register_buffer('pe', pe)\n\n    def forward(self, x):\n        return x + self.pe[:, :x.size(1), :]\n```\n\n\n```python\npe = PositionalEncoding(config)\nx = torch.randn(10, 64, config.d_model)\npe(x).size()\n```\n\n\n\n\n    torch.Size([10, 64, 768])\n\n\n\n### 3.5 Encoder\n\nEach of the six layers of both the encoder and decoder is composed of two main sub-layers: a ``multihead self-attention`` sub-layer, which as explained hereinabove allows the model to focus on different parts of the input sequence, and a ``position-wise fully connected feed-forward`` sub-layer. In addition, the model employs a ``residual connection`` around each of the two sub-layers, followed by ``layer normalization``. In our case, we implement pre layer (instead of post layer) normalization with ``Dropout`` regularization to favour stability during training and prevent overfitting, respectively.\n\n\u003e ``layer normalization`` contributes to having zero mean and unitity variance. This helps to stabilize the learning process, and to reduce the number of training steps.\n\n\u003e ``residual connection`` or ``skip connection`` helps alleaviate the problem of vanishing gradients by passing a tensor to the next layer of the model without processing it and adding it to the processed tensor. In other words, the output of each sub-layer is\nLayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer\nitself.\n\n\n```python\nclass EncoderLayer(nn.Module):\n    \"\"\"\n    Implements a single Encoder layer.\n\n    Args:\n        config (TransformerConfig): The configuration for the transformer model.\n    \"\"\"\n    def __init__(self, config):\n        super().__init__()\n\n        self.norm_1 = nn.LayerNorm(config.d_model)\n        self.masked_attn = MultiHeadAttention(config)\n\n        self.norm_2 = nn.LayerNorm(config.d_model)\n        self.feed_forward = PositionWiseFeedForward(config)\n        \n        self.dropout = nn.Dropout(config.drop)\n        \n    def forward(self, x, mask=None):\n        attn_outputs = self.masked_attn(x, x, x, mask=mask)\n        x = x + self.dropout(attn_outputs)\n\n        output = x + self.dropout(self.feed_forward(self.norm_2(x)))\n        return output\n```\n\n\n```python\nencoder_layer = EncoderLayer(config)\nx = torch.randn(10, 32, config.d_model)\nencoder_layer(x).size()\n# Should be [10, 32, d_model]\n```\n\n\n\n\n    torch.Size([10, 32, 768])\n\n\n\n\n```python\nclass Encoder(nn.Module):\n    \"\"\"\n    Implements the Encoder stack.\n\n    Parameters:\n        config (TransformerConfig): The configuration for the transformer model.\n    \"\"\"\n    def __init__(self, config):\n        super().__init__()\n        self.embedding = nn.Embedding(config.src_vocab_size, config.d_model)\n        self.pe = PositionalEncoding(config)\n        self.layers = nn.ModuleList([EncoderLayer(config) for _ in range(config.n_layer)])\n\n    def forward(self, x, mask=None):\n        x = self.embedding(x) * math.sqrt(config.d_model)\n        x = self.pe(x)\n        for layer in self.layers:\n            x = layer(x, mask)\n        return x\n```\n\n\n```python\nencoder = Encoder(config)\nx = torch.randint(0, config.src_vocab_size, (10, 32))\nencoder(x).size()\n# Should be [10, 32, d_model]\n```\n\n\n\n\n    torch.Size([10, 32, 768])\n\n\n\n### 3.6 Decoder\n\nThe Decoder has two attention sub-layers: ``masked multi-head self-attention layer`` and ``encoder-decoder attention layer``.\n\n\n```python\nclass DecoderLayer(nn.Module):\n    \"\"\"\n    Implements a single Decoder layer.\n\n    Args:\n        config (TransformerConfig): The configuration for the transformer model.\n    \"\"\"\n    def __init__(self, config):\n        super().__init__()\n        self.norm_1 = nn.LayerNorm(config.d_model)\n        self.masked_attn = MultiHeadAttention(config)\n\n        self.norm_2 = nn.LayerNorm(config.d_model)\n        self.cross_attn = MultiHeadAttention(config)\n\n        self.norm_3 = nn.LayerNorm(config.d_model)\n        self.feed_forward = PositionWiseFeedForward(config)\n        \n        self.dropout = nn.Dropout(config.drop)\n\n    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):\n        attn_output = self.masked_attn(x, x, x, tgt_mask)\n        x = self.norm_1(x + self.dropout(attn_output))\n\n        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)\n        x = self.norm_2(x + self.dropout(attn_output))\n\n        output = self.norm_3(x + self.dropout(self.feed_forward(x)))\n        return output\n```\n\n\n```python\ndecoder_layer = DecoderLayer(config)\nx = torch.randn(10, 32, config.d_model)\nencoder_output = torch.randn(10, 32, config.d_model)\ndecoder_layer(x, encoder_output).size()\n# Should be [10, 32, d_model]\n```\n\n\n\n\n    torch.Size([10, 32, 768])\n\n\n\n\n```python\nclass Decoder(nn.Module):\n    \"\"\"\n    Implements the Decoder stack.\n\n    Args:\n        config (TransformerConfig): The configuration for the transformer model.\n    \"\"\"\n    def __init__(self, config):\n        super().__init__()\n        self.embedding = nn.Embedding(config.tgt_vocab_size, config.d_model)\n        self.pe = PositionalEncoding(config)\n        self.layers = nn.ModuleList([DecoderLayer(config) for _ in range(config.n_layer)])\n\n    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):\n        x = self.embedding(x) * math.sqrt(config.d_model)\n        x = self.pe(x)\n        for layer in self.layers:\n            x = layer(x, enc_output, src_mask, tgt_mask)\n        return x\n```\n\n\n```python\ndecoder = Decoder(config)\nx = torch.randint(0, config.tgt_vocab_size, (10, 32))\ntorch.randn(10, 32, config.d_model).size()\n```\n\n\n\n\n    torch.Size([10, 32, 768])\n\n\n\n### 3.7 Transformer\n\nNote that the ``Decoder`` class takes in an additional argument ``enc_output`` which is the output of the ``Encoder`` stack. This is used in the cross-attention mechanism to calculate the attention scores between the decoder input and the encoder output.\n\nThe ``source mask`` is typically used in the encoder to ignore padding tokens in the input sequence., whereas the ``target mask`` is used in the decoder to ignore also padding tokens, and to ensure that predictions for each token can only depend on previous tokens. This enforces causality in the decoder output.\n\n\n```python\nclass Transformer(nn.Module):\n    \"\"\"\n    Implements the Transformer architecture.\n\n    Args:\n        config (TransformerConfig): The configuration for the transformer model.\n    \"\"\"\n    def __init__(self, config):\n        super().__init__()\n\n        self.encoder = Encoder(config)\n        self.decoder = Decoder(config)\n        self.linear = nn.Linear(config.d_model, config.tgt_vocab_size)\n\n    def generate_mask(self, x):\n        seq_len = x.size(1)\n        mask = torch.tril(torch.ones(seq_len, seq_len)).to(x.device).unsqueeze(0)\n        return mask\n\n    #def generate_mask(self, src, tgt):\n    #    src_mask = (src != 0).unsqueeze(1).unsqueeze(2)\n    #    tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)\n    #    seq_length = tgt.size(1)\n    #    nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()\n    #    tgt_mask = tgt_mask \u0026 nopeak_mask\n    #    return src_mask, tgt_mask\n\n    def forward(self, src, tgt):\n        tgt_mask = self.generate_mask(tgt)\n        \n        enc_output = self.encoder(src, mask=None)\n        dec_output = self.decoder(tgt, enc_output, src_mask=None, tgt_mask=tgt_mask)\n        \n        return self.linear(dec_output)\n```\n\n\n```python\nmodel = Transformer(config)\nsrc = torch.randint(0, config.src_vocab_size, (10, 32))\ntgt = torch.randint(0, config.tgt_vocab_size, (10, 32))\n\nmodel(src, tgt).size()\n# Should be [10, 32, tgt_vocab_size]\n```\n\n\n\n\n    torch.Size([10, 32, 32])\n\n\n\n## 4. Training\n\nWe will train our transformer model to translate from english to spanish. The dataset has been obtained from https://tatoeba.org\n\n\n```python\n# use pandas to read txt in tabular pairs\nimport pandas as pd\ndef load_dataset():\n    df = pd.read_csv('data/en-es.txt', sep='\\t', header=None)\n    en = df[0].tolist()\n    es = df[1].tolist()\n    return en, es\n```\n\n\n```python\nsrc, tgt = load_dataset()\n```\n\n\n```python\nsrc[3500:3510]\n```\n\n\n\n\n    ['We apologize.',\n     'We are happy.',\n     'We are young.',\n     'We can do it.',\n     \"We can't sue.\",\n     \"We can't win.\",\n     'We got ready.',\n     'We got ready.',\n     'We had lunch.',\n     'We have some.']\n\n\n\n\n```python\ntgt[3500:3510]\n```\n\n\n\n\n    ['Pedimos disculpas.',\n     'Somos felices.',\n     'Somos jóvenes.',\n     'Podemos hacerlo.',\n     'No podemos demandar.',\n     'No podemos ganar.',\n     'Nos preparamos.',\n     'Estábamos listos.',\n     'Almorzamos.',\n     'Tenemos algo.']\n\n\n\n\n```python\nlen(src), len(tgt)\n```\n\n\n\n\n    (118964, 118964)\n\n\n\n\n```python\nimport re\n\ndef create_vocab(corpus):\n    vocab = set()\n    for s in corpus:\n        vocab.update(re.findall(r'\\w+|[^\\w\\s]', s))\n    w2i = {w: i+4 for i, w in enumerate(vocab)}\n    w2i['PAD'] = 0\n    w2i['SOS'] = 1\n    w2i['EOS'] = 2\n    w2i['UNK'] = 3\n    i2w = {i: w for w, i in w2i.items()}\n\n    return w2i, i2w\n```\n\n\n```python\nsrc_w2i, src_i2w = create_vocab(src)\ntgt_w2i, tgt_i2w = create_vocab(tgt)\nprint(len(src_w2i), len(tgt_w2i))\n```\n\n    14779 28993\n    \n\n\n```python\ndef encode(corpus, w2i):\n    encoding = []\n    for s in corpus:\n        s_enc = [w2i[w] for w in re.findall(r'\\w+|[^\\w\\s]', s)]\n        s_enc = [w2i['SOS']] + s_enc + [w2i['EOS']]\n        encoding.append(s_enc)\n    return encoding\n```\n\n\n```python\nsrc_enc = encode(src, src_w2i)\ntgt_enc = encode(tgt, tgt_w2i)\n```\n\n\n```python\nfrom sklearn.model_selection import train_test_split\n\nsrc_train, src_test, tgt_train, tgt_test = train_test_split(src_enc,\n                                                            tgt_enc,\n                                                            test_size=0.2,\n                                                            random_state=42)\n\n```\n\n\n```python\nlen(src_train), len(src_test), len(tgt_train), len(tgt_test)\n```\n\n\n\n\n    (95171, 23793, 95171, 23793)\n\n\n\n\n```python\ndef prepare_batch(X, Y):\n    max_len_X = max([len(x) for x in X])\n    max_len_Y = max([len(y) for y in Y])\n\n    enc_input = torch.zeros(len(X), max_len_X, dtype=torch.long)\n    dec_input = torch.zeros(len(Y), max_len_Y, dtype=torch.long)\n    output = torch.zeros(len(Y), max_len_Y, dtype=torch.long)\n\n    for i, s in enumerate(X):\n        enc_input[i, :len(s)] = torch.tensor(s)\n\n    for i, s in enumerate(Y):\n        dec_input[i, :len(s)-1] = torch.tensor(s[:-1])\n        output[i, :len(s)-1] = torch.tensor(s[1:])\n\n    return enc_input, dec_input, output\n```\n\n\n```python\nfrom sklearn.utils import shuffle\n\ndef batch_generator(X, Y, batch_size):\n    idx = 0\n    while True:\n        bx = X[idx:idx+batch_size]\n        by = Y[idx:idx+batch_size]\n        \n        yield prepare_batch(bx, by)\n        \n        idx = (idx + batch_size) % len(x)\n        if idx == 0:\n            X, Y = shuffle(X, Y, random_state=42)\n    \n```\n\n\n```python\nbatch_size = 128\ntrain_loader = batch_generator(src_train, tgt_train, batch_size)\nbe, bd, bo = next(train_loader)\n```\n\n\n```python\nprint([src_i2w[w.item()] for w in be[0]])\nprint([tgt_i2w[w.item()] for w in bd[0]])\nprint([tgt_i2w[w.item()] for w in bo[0]])\n```\n\n    ['SOS', 'I', 'have', 'no', 'choice', 'at', 'all', '.', 'EOS', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD']\n    ['SOS', 'No', 'tengo', 'otra', 'opción', 'en', 'absoluto', '.', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD']\n    ['No', 'tengo', 'otra', 'opción', 'en', 'absoluto', '.', 'EOS', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD']\n    \n\n\n```python\nconfig = ModelConfig(src_vocab_size=len(src_w2i),\n                     tgt_vocab_size=len(tgt_w2i),\n                     max_seq_len=max([len(s) for s in src + tgt]))\n\nmodel = Transformer(config).to(device)\nconfig\n```\n\n\n\n\n    ModelConfig(d_model=768, n_layer=6, n_head=8, d_ff=2048, src_vocab_size=14779, tgt_vocab_size=28993, drop=0.1, max_seq_len=278, pad_token_id=0, activation='gelu')\n\n\n\n\n```python\ncriterion = nn.CrossEntropyLoss(ignore_index=config.pad_token_id)\noptimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)\n```\n\n\n```python\nlen(src_train) // batch_size\n```\n\n\n\n\n    743\n\n\n\nWe perform a sample training process comprising 3 epochs with a batch of 32 to demonstrate that the model can be trained. We recommend 20 epochs and a batch of len(src_train) // batch_size for better results.\n\n\n```python\nmodel.train()\nepochs = 20\nbatch_size = 128\ntrain_loader = batch_generator(src_train, tgt_train, batch_size)\n\nfor epoch in range(epochs):\n    epoch_loss = []\n\n    for i in range(len(src_train) // batch_size):\n        be, bd, bs = next(train_loader)\n        be, bd, bs = be.to(device), bd.to(device), bs.to(device)\n\n        output = model(be, bd)\n        loss = criterion(output.permute(0, 2, 1), bs)\n\n            \n        optimizer.zero_grad()\n        loss.backward()\n        optimizer.step()\n\n        epoch_loss.append(loss.item())\n    print(\"----------------------------------------------------------\")\n    print(f'Epoch: {epoch}, Loss: {sum(epoch_loss) / len(epoch_loss)}')\n    print(\"----------------------------------------------------------\")\nprint('Training finished')\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcarpintero%2Ftransformer101","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdcarpintero%2Ftransformer101","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcarpintero%2Ftransformer101/lists"}