{"id":22602355,"url":"https://github.com/swastikgorai/transformers_from_scratch","last_synced_at":"2026-05-19T14:11:30.045Z","repository":{"id":248101337,"uuid":"826329002","full_name":"SwastikGorai/transformers_from_scratch","owner":"SwastikGorai","description":"Implementation of \"Attention is all you need\" Paper with only PyTorch","archived":false,"fork":false,"pushed_at":"2024-08-26T06:05:31.000Z","size":1133,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-03T05:47:53.083Z","etag":null,"topics":["attention-is-all-you-need","pytorch","pytorch-implementation","scratch-implementation","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SwastikGorai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-09T13:50:36.000Z","updated_at":"2024-08-26T06:05:54.000Z","dependencies_parsed_at":"2024-12-08T12:20:22.927Z","dependency_job_id":"db3f2d48-3f84-446c-af84-ac513c1e5b3a","html_url":"https://github.com/SwastikGorai/transformers_from_scratch","commit_stats":null,"previous_names":["swastikgorai/transformers_from_scratch"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwastikGorai%2Ftransformers_from_scratch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwastikGorai%2Ftransformers_from_scratch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwastikGorai%2Ftransformers_from_scratch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwastikGorai%2Ftransformers_from_scratch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SwastikGorai","download_url":"https://codeload.github.com/SwastikGorai/transformers_from_scratch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246100461,"owners_count":20723469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention-is-all-you-need","pytorch","pytorch-implementation","scratch-implementation","transformers"],"created_at":"2024-12-08T12:20:17.604Z","updated_at":"2025-10-24T17:19:40.008Z","avatar_url":"https://github.com/SwastikGorai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Transformer Implementation in PyTorch: \"Attention is All You Need\"\n\nThis repository contains a PyTorch implementation of the Transformer model as described in the seminal paper \"Attention is All You Need\". The Transformer model, introduced by Vaswani et al. in 2017. I have tried implementing it with (only) PyTorch by breaking down different parts of the paper. The transformers paper was the base of almost all the initial LLMs models such as GPT(Generative Pretrained Transformers).\n## Table of Contents\n\n- [Overview](#overview)\n- [Components](#components)\n  - [Input Embedding](#input-embedding)\n  - [Positional Encoding](#positional-encoding)\n  - [Multi-Head Attention](#multi-head-attention)\n  - [Feed-Forward Network](#feed-forward-network)\n  - [Residual Connection and Layer Normalization](#residual-connection-and-layer-normalization)\n  - [Encoder and Decoder](#encoder-and-decoder)\n  - [Final Projection Layer](#final-projection-layer)\n- [Building the Transformer](#building-the-transformer)\n- [Usage](#usage)\n- [References](#references)\n\n## Overview\n\nThe Transformer model consists of an encoder-decoder structure. The encoder processes the input sequence to create a representation, which the decoder uses to generate the output sequence. Both the encoder and decoder use multi-head self-attention and feed-forward neural networks, connected with residual connections and layer normalization.\n\n## Components\n\n### 1. Input Embedding\n\nThe `Input_Embedding` class converts words into dense vectors of a fixed dimension (`d_model`). The embedding is scaled by the square root of `d_model` to adjust the variance:\n\n```python\nclass Input_Embedding(nn.Module): \n    def __init__(self, vocab_size:int, d_model:int ):\n        super().__init__()\n        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=d_model)\n        \n    def forward(self, x):\n        return self.embedding(x) * math.sqrt(self.d_model)\n```\n\n### 2. Positional Encoding\n\nPositional encoding injects information about the position of words in a sentence. It uses sine and cosine functions to generate different frequencies for each dimension:\n\n```python\n\nclass Positional_Encoding(nn.Module):\n    def __init__(self, d_model:int, sequence_length: int, dropout: float) -\u003e None:\n        super().__init__()\n        self.dropout = nn.Dropout(dropout)\n        pos_enc = torch.zeros(sequence_length, d_model)\n        positions = torch.arange(0, sequence_length, dtype=torch.float).unsqueeze(1)\n        denominator = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))\n        pos_enc[:, 0::2] = torch.sin(positions * denominator)\n        pos_enc[:, 1::2] = torch.cos(positions * denominator)\n        self.register_buffer(\"postitons\", pos_enc.unsqueeze(0))\n\n    def forward(self, x):\n        x += self.pos_enc[:, :x.shape[1], :].requires_grad_(False)\n        return self.dropout(x)\n```\n\n### 3. Multi-Head Attention\n\nMulti-head attention allows the model to jointly attend to information from different representation subspaces. The attention mechanism computes a weighted sum of values (V), where the weight assigned to each value is determined by a similarity score between the query (Q) and key (K):\n\n```python\n\nclass MultiHeadAttention(nn.Module):\n    def __init__(self, d_model:int, h: int, dropout: float):\n        super().__init__()\n        self.d_k = d_model // h\n        self.h = h\n        self.w_q = nn.Linear(d_model, d_model)\n        self.w_k = nn.Linear(d_model, d_model)\n        self.w_v = nn.Linear(d_model, d_model)\n        self.w_o = nn.Linear(d_model, d_model)\n        self.dropout = nn.Dropout(dropout)\n        \n    @staticmethod\n    def attention(query, key, value, mask, dropout):\n        d_k = query.shape[-1]\n        attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)\n        if mask is not None:\n            attention_scores.masked_fill_(mask==0, -10e9)\n        attention_scores = attention_scores.softmax(dim=-1)\n        if dropout is not None:\n            attention_scores = dropout(attention_scores)\n        return attention_scores @ value, attention_scores\n        \n    def forward(self, q, k, v, mask):\n        query = self.w_q(q)\n        key = self.w_k(k)\n        value = self.w_v(v)\n        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1, 2)\n        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1, 2)\n        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1, 2)\n        x, self.attention_scores = MultiHeadAttention.attention(query, key, value, mask, self.dropout)\n        x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.h * self.d_k)\n        return self.w_o(x)\n```\n\n### 4. Feed-Forward Network\n\nEach encoder and decoder layer contains a feed-forward neural network that is applied to each position separately and identically:\n\n```python\n\nclass Feed_Forward(nn.Module):\n    def __init__(self, d_model: int, d_ff: int, dropout: float):\n        super().__init__()\n        self.linear_layer_1 = nn.Linear(d_model, d_ff)\n        self.linear_layer_2 = nn.Linear(d_ff, d_model)\n        self.dropout = nn.Dropout(dropout)\n        \n    def forward(self, x):\n        return self.linear_layer_2(self.dropout(torch.relu(self.linear_layer_1(x))))\n```\n\n### 5. Residual Connection and Layer Normalization\n\nResidual connections wrap around the sub-layers of each encoder and decoder layer, followed by layer normalization:\n\n```python\n\nclass Residual_Connection(nn.Module):\n    def __init__(self, features: int, dropout: float):\n        super().__init__()\n        self.dropout = nn.Dropout(dropout)\n        self.norm = Layer_Normalization(features)\n        \n    def forward(self, x, sublayer):\n        return x + self.dropout(sublayer(self.norm(x)))\n```\n\n### 6. Encoder and Decoder\n\nThe encoder and decoder are composed of a stack of identical layers:\n\n```python\n\nclass Encoder(nn.Module):\n    def __init__(self, features: int, layers: nn.ModuleList):\n        super().__init__()\n        self.layers = layers\n        self.norm = Layer_Normalization(features)\n        \n    def forward(self, x, mask):\n        for layer in self.layers:\n            x = layer(x, mask)\n        return self.norm(x)\n```\n\n### 7. Final Projection Layer\n\nThe final output is passed through a linear transformation layer to project the hidden states to the vocabulary size:\n\n```python\n\nclass Projection_layer(nn.Module):\n    def __init__(self, d_model:int, vocab_size:int):\n        super().__init__()\n        self.projection = nn.Linear(d_model, vocab_size)\n        \n    def forward(self, x):\n        return self.projection(x)\n```\n\n## Building the Transformer\n\nThe build_transformer function initializes and returns a Transformer model:\n\n```python\n\ndef build_transformer(source_vocab_size, target_vocab_size, source_sequence_length, target_sequence_length, d_model=512, N_layers=6, h=8, dropout=0.1, d_ff=2048):\n    source_embeddings = Input_Embedding(vocab_size=source_vocab_size, d_model=d_model)\n    target_embeddings = Input_Embedding(vocab_size=target_vocab_size, d_model=d_model)\n    source_pos = Positional_Encoding(d_model, source_sequence_length, dropout)\n    target_pos = Positional_Encoding(d_model, target_sequence_length, dropout)\n\n    encoder_blocks = [Encoder_Block(d_model, MultiHeadAttention(d_model, h, dropout), Feed_Forward(d_model, d_ff, dropout), dropout) for _ in range(N_layers)]\n    decoder_blocks = [Decoder_Block(d_model, MultiHeadAttention(d_model, h, dropout), MultiHeadAttention(d_model, h, dropout), Feed_Forward(d_model, d_ff, dropout), dropout) for _ in range(N_layers)]\n    \n    encoder = Encoder(d_model, nn.ModuleList(encoder_blocks))\n    decoder = Decoder(d_model, nn.ModuleList(decoder_blocks))\n    \n    projection_layer = Projection_layer(d_model, target_vocab_size)\n    \n    transformer = Transformer(encoder, decoder, source_embeddings, target_embeddings, source_pos, target_pos, projection_layer)\n    \n    for p in transformer.parameters():\n        if p.dim() \u003e 1:\n            nn.init.xavier_uniform_(p)\n    \n    return transformer\n```\n## Usage\n  - Create a virtual environment\n  - If you have [uv](https://astral.sh/blog/uv-unified-python-packaging) installed:\n    -  ```uv venv```\n    -  Activate with: ```.venv\\scripts\\activate```\n  - Create without uv:\n    -  ```python -m venv venv```\n    \n  \n- Install PyTorch: PyTorch Installation\n- Clone the repository: ```git clone https://github.com/yourusername/transformer-pytorch.git```\n- Import the `build_transformer` function and initialize the model with your desired configuration.\n- Train the model using your dataset.\n  - Example dataset: [IIT Bombay English-Hindi Translation Dataset](https://www.kaggle.com/datasets/vaibhavkumar11/hindi-english-parallel-corpus)\n\n## References\n- [Umar Jamil's Implementation](https://youtu.be/ISNdQcPhsts?si=hqGDqjfqUiCdF5xx)\n- [Attention Is All you Need paper](https://arxiv.org/abs/1706.03762)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswastikgorai%2Ftransformers_from_scratch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fswastikgorai%2Ftransformers_from_scratch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswastikgorai%2Ftransformers_from_scratch/lists"}