{"id":15921842,"url":"https://github.com/eonu/transformers-from-scratch","last_synced_at":"2026-02-12T09:02:43.928Z","repository":{"id":251291617,"uuid":"829386311","full_name":"eonu/transformers-from-scratch","owner":"eonu","description":"Modular Python implementation of encoder-only, decoder-only and encoder-decoder transformer architectures from scratch, as detailed in Attention Is All You Need.","archived":false,"fork":false,"pushed_at":"2024-08-24T22:26:47.000Z","size":29387,"stargazers_count":1,"open_issues_count":8,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-10T23:50:33.622Z","etag":null,"topics":["attention-is-all-you-need","attention-mechanism","generation","generative-ai","gpt","llm","nlp","nlu","summarization","torch","transformer","transformer-from-scratch","translation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eonu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-16T10:21:46.000Z","updated_at":"2025-05-01T15:25:50.000Z","dependencies_parsed_at":"2024-08-24T23:40:53.819Z","dependency_job_id":null,"html_url":"https://github.com/eonu/transformers-from-scratch","commit_stats":null,"previous_names":["eonu/transformers-from-scratch"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/eonu/transformers-from-scratch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eonu%2Ftransformers-from-scratch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eonu%2Ftransformers-from-scratch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eonu%2Ftransformers-from-scratch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eonu%2Ftransformers-from-scratch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eonu","download_url":"https://codeload.github.com/eonu/transformers-from-scratch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eonu%2Ftransformers-from-scratch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29322738,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-10T20:44:44.282Z","status":"ssl_error","status_checked_at":"2026-02-10T20:44:43.393Z","response_time":65,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention-is-all-you-need","attention-mechanism","generation","generative-ai","gpt","llm","nlp","nlu","summarization","torch","transformer","transformer-from-scratch","translation"],"created_at":"2024-10-06T20:02:28.657Z","updated_at":"2026-02-12T09:02:43.913Z","avatar_url":"https://github.com/eonu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003ch1 align=\"center\"\u003eTransformers From Scratch\u003c/h1\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003csup\u003e\n    \u003cb\u003eContents\u003c/b\u003e:\u0026nbsp;\n    \u003ca href=\"#features\"\u003eFeatures\u003c/a\u003e ·\n    \u003ca href=\"#example\"\u003eExample\u003c/a\u003e ·\n    \u003ca href=\"#details\"\u003eDetails\u003c/a\u003e ·\n    \u003ca href=\"#datasets\"\u003eDatasets\u003c/a\u003e ·\n    \u003ca href=\"#models-and-notebooks\"\u003eModels and notebooks\u003c/a\u003e ·\n    \u003ca href=\"#repository-structure\"\u003eRepository structure\u003c/a\u003e ·\n    \u003ca href=\"#installation\"\u003eInstallation\u003c/a\u003e ·\n    \u003ca href=\"#running\"\u003eRunning\u003c/a\u003e ·\n    \u003ca href=\"#references\"\u003eReferences\u003c/a\u003e\n  \u003c/sup\u003e\n\u003c/p\u003e\n\nThe repository contains a modular Python implementation of transformer architectures for natural language understanding and generation tasks, according to:\n\n- The seminal paper _Attention Is All You Need_ by Vaswani et al.\u003csup\u003e\u003ca href=\"#references\"\u003e[1]\u003c/a\u003e\u003c/sup\u003e that details the novel attention-based transformer architecture and its application to sequence-to-sequence tasks, demonstrating its effectiveness by achieving state-of-the-art performance in machine translation, surpassing previous LSTM and CNN based neural machine translation architectures.\n- The chapter on _Transformers and Large Language Models_ from _Speech and Language Processing_ by Jurafsky \u0026 Martin\u003csup\u003e\u003ca href=\"#references\"\u003e[2]\u003c/a\u003e\u003c/sup\u003e which provides a more comprehensive and illustrative look into some of the high-level details discussed in _Attention Is All You Need_.\n\n## Features\n\n- Generic encoder-only, decoder-only and encoder-decoder transformer architectures.\n- Wrappers for causal language modelling, sequence-to-sequence generation and classification/regression.\n- Various decoding methods for causal/sequence-to-sequence generation:\n  - Search-based (greedy and beam search)\n  - Sampling-based (nucleus, temperature and top-k sampling)\n- Example applications to real-world datasets.\n\n### PyTorch restrictions\n\nThis project is implemented using [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/).\n\nAs PyTorch provides a number of transformer and attention related layers in its [`torch.nn`](https://pytorch.org/docs/stable/nn.html) submodule, this project explicitly avoids the use of:\n\n- [`torch.nn.Transformer`](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html#torch.nn.Transformer)\n- [`torch.nn.TransformerEncoder`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html#torch.nn.TransformerEncoder)/[`torch.nn.TransformerEncoderLayer`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html#torch.nn.TransformerEncoderLayer)\n- [`torch.nn.TransformerDecoder`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoder.html#torch.nn.TransformerDecoder)/[`torch.nn.TransformerDecoderLayer`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoderLayer.html#torch.nn.TransformerDecoderLayer)\n- [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention)\n- [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention)\n\nAll other layers provided by `torch.nn` are allowed, including:\n\n- [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding): For token embedding look-up by vocabulary ID.\n- [`nn.LayerNorm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html#torch.nn.LayerNorm): For layer normalization as implemented in _Attention Is All You Need_.\n\n### Other restrictions\n\n- Transformer models implemented and made available in other libraries such as HuggingFace's [`transformers`](https://huggingface.co/docs/transformers/en/index) are not used in this project.\n- However, the tokenizers provided by `transformers` were used, as developing tokenization algorithms was not the primary objective of this project.\n- No existing _\"x from scratch\"_ resources were used, such as the famous _Let's build GPT: from scratch, in code, spelled out._ by Andrej Karpathy\u003csup\u003e\u003ca href=\"#references\"\u003e[3]\u003c/a\u003e\u003c/sup\u003e.\n- No other online resources were used, apart from official documentation for packages such as [PyTorch](https://pytorch.org/docs/stable/index.html), [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) and [Huggingface Tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).\n\n## Example\n\nTraining a causal language model to generate \"Florida man\"-style news headlines.\n\n```python\nfrom transformers import LlamaTokenizer\n\nfrom transformer.params import TransformerParams, TemperatureSamplingParams\nfrom transformer.models import CausalLM\nfrom transformer.decoding import TemperatureSamplingDecoder\n\n# initialize HuggingFace tokenizer\ntokenizer = LlamaTokenizer.from_pretrained(\n    \"huggyllama/llama-7b\", add_eos_token=True, legacy=False\n)\ntokenizer.add_special_tokens({\"pad_token\": \"\u003cpad\u003e\"})\n\n# initialize the causal language model\nmodel = CausalLM(\n    params=TransformerParams(context_length=64),\n    tokenizer=tokenizer,\n)\n\n# train the language model\nmodel.train(...)\n\n# initialize decoder for sequence generation\ndecoder = TemperatureSamplingDecoder(\n    params=TemperatureSamplingParams(max_length=100, temperature=0.5, k=5),\n    model=model,\n)\n\n# generation without context\ndecoder.generate()\n'Florida man arrested after baby alligator, guns, drugs found inside truck'\n\n# generation with context\ndecoder.generate(\"Florida man shot\")\n'Florida man shot and killed while attempting to steal pizza and Pokemon cards from Target'\n```\n\n## Details\n\nWhile the original architecture described in _Attention Is All You Need_ is an encoder-decoder based architecture using transformers for neural machine translation which is a sequence-to-sequence learning task, this project was designed to be more general, allowing for a variety of natural language tasks by implementing encoder-only, decoder-only and encoder-decoder architectures.\n\n\u003ctable\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cb\u003eEncoder-only\u003c/b\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cb\u003eDecoder-only\u003c/b\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cb\u003eEncoder-decoder\u003c/b\u003e\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003cb\u003eDiagram\u003c/b\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cimg src=\"assets/encoder-only.svg\"/\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cimg src=\"assets/decoder-only.svg\"/\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cimg src=\"assets/encoder-decoder.svg\"/\u003e\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003cb\u003eTasks\u003c/b\u003e\u003c/td\u003e\n            \u003ctd\u003eContextualized embedding and supervised inference\u003c/td\u003e\n            \u003ctd\u003eAutoregressive generation\u003c/td\u003e\n            \u003ctd\u003eSequence-to-sequence generation\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003cb\u003eExample use-cases\u003c/b\u003e\u003c/td\u003e\n            \u003ctd\u003e\n                \u003cul\u003e\n                    \u003cli\u003eProducing contextualized token embeddings\u003c/li\u003e\n                    \u003cli\u003eSentiment classification\u003c/li\u003e\n                    \u003cli\u003eIntent classification\u003c/li\u003e\n                \u003c/ul\u003e\n            \u003c/td\u003e\n            \u003ctd\u003e\n                \u003cul\u003e\n                    \u003cli\u003eText generation\u003c/li\u003e\n                \u003c/ul\u003e\n            \u003c/td\u003e\n            \u003ctd\u003e\n                \u003cul\u003e\n                    \u003cli\u003eMachine translation\u003c/li\u003e\n                    \u003cli\u003eText summarization\u003c/li\u003e\n                \u003c/ul\u003e\n            \u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e\n\n## Datasets\n\nThe following datasets were used to test the above transformer implementations on various tasks.\n\n- [arXiv Paper Abstracts](https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts): arXiv manuscripts and their metadata including titles, abstracts and categories.\n- [CommonLit Readability Prize](https://www.kaggle.com/competitions/commonlitreadabilityprize): Literary passages and their associated \"readability\" score for use in grade 3-12 classrooms.\n- [Reddit r/FloridaMan](https://www.kaggle.com/datasets/bcruise/reddit-rfloridaman): News headlines about various (often funny and irrational) actions performed by Florida men and women.\n- [Europarl](https://www.kaggle.com/datasets/nltkdata/europarl): Transcriptions of European Parliament proceedings between 1996-2006, collected in 11 languages.\n\n## Models and notebooks\n\n### Encoder-only models\n\n- [`ClassifierLM`](transformer/models/classifier.py): A generic transformer-based language model for assigning classes to text.\n  - [`notebooks/arxiv_categorization.ipynb`](notebooks/arxiv_categorization.ipynb) applies this model to the _arXiv Paper Abstracts_ dataset to categorize arXiv manuscripts based on their titles.\n- [`RegressorLM`](transformer/models/regressor.py): A generic transformer-based language model for assigning scores to text.\n  - [`notebooks/commonlit_readability.ipynb`](notebooks/commonlit_readability.ipynb) applies this model to the _CommonLit Readability Prize_ dataset to rate the complexity of literary passages for grade 3-12 students.\n\n### Decoder-only models\n\n- [`CausalLM`](transformer/models/causal.py): A generic transformer-based language model for generating text in an autoregressive manner.\n  - [`notebooks/florida_man_generation.ipynb`](notebooks/florida_man.ipynb) applies this model to the _Reddit r/FloridaMan_ dataset to generate humorous news headlines involving the (mis)adventures of Florida men and women.\n\n### Encoder-decoder models\n\n- [`Seq2SeqLM`](transformer/models/seq2seq.py): A generic transformer-based language model for generating output text given an input text.\n  - [`notebooks/arxiv_summarization.ipynb`](notebooks/arxiv_summarization.ipynb) applies this model to the _arxiv Paper Abstracts_ dataset to generate arXiv paper titles by summarizing their corresponding abstracts.\n  - [`notebooks/europarl_translation.ipynb`](notebooks/europarl_translation.ipynb) applies this model to the _Europarl_ dataset to translate transcribed parliamentiary proceedings from French to English.\n\n## Repository structure\n\n- [**`notebooks/`**](notebooks/): Notebooks applying the models in [`transformer.models`](transformer/models/) to various datasets.\n- [**`transformer/`**](transformer/): Core package containing the transformer implementations.\n  - [**`dataloaders/`**](transformer/dataloaders/): [`LightningDataModule`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html)s for each model in [`transformer.models`](transformer/models/).\n  - [**`decoding/`**](transformers/decoding/): Decoding method implementations for causal and sequence-to-sequence LMs.\n  - [**`models/`**](transformer/models/): Task-specific transformers implemented using [`transformer.modules.transformers`](transformer/modules/transformers/).\n  - [**`modules/`**](transformer/modules/): [`LightningModule`](https://lightning.ai/docs/pytorch/stable/common/lightning_module.html)s used within the transformers in [`transformer.models`](transformer/models/).\n    - [**`transformers/`**](transformer/modules/transformers/): Encoder-only, decoder-only and encoder-decoder transformer definitions.\n    - [`attention.py`](transformer/modules/attention.py): Masked/unmasked multi-head self attention definition.\n    - [`block.py`](transformer/modules/block.py): Transformer block definition.\n    - [`embedding.py`](transformer/modules/embedding.py): Positional encoding and input embedding definition.\n  - [**`params/`**](transformer/params/): Pydantic hyper-parameter classes.\n  - [**`utils/`**](transformer/utils/): Supporting custom layers, functions and constants.\n\n## Installation\n\nThe transformer implementation is installable as a local Python package, named `transformer`.\n\n```console\npip install -e .\n```\n\nTo run the notebooks, you will need additional dependencies which can be installed with the `notebooks` extra.\n\n```console\npip install -e \".[notebooks]\"\n```\n\n**This package was developed on Python 3.11.8, so it is recommended to use a virtual environment with the same version.**\n\n## Running\n\nYou should be able to simply run the Jupyter notebooks in the [`notebooks/`](notebooks/) folder.\n\n_Beware, they take time – even with a good GPU (especially the sequence-to-sequence ones)!_\n\n## References\n\n\u003ctable\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e[1]\u003c/td\u003e\n            \u003ctd\u003e\n            \u003ca href=\"https://dl.acm.org/doi/10.5555/3295222.3295349\"\u003eVaswani et al., \u003cb\u003e\"Attention Is All You Need\"\u003c/b\u003e, \u003cem\u003eProceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017)\u003c/em\u003e, 6000-6010.\u003c/a\u003e\n            \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e[2]\u003c/td\u003e\n            \u003ctd\u003e\n            \u003ca href=\"https://web.stanford.edu/~jurafsky/slp3/10.pdf\"\u003eDan Jurafsky \u0026 James H. Martin, \u003cb\u003e\"Transformers and Large Language Models\"\u003c/b\u003e, \u003cem\u003eSpeech and Language Processing, 3rd ed. draft (2024)\u003c/em\u003e, ch. 10.\u003c/a\u003e\n            \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e[3]\u003c/td\u003e\n            \u003ctd\u003e\n            \u003ca href=\"https://www.youtube.com/watch?v=kCc8FmEb1nY\"\u003eAndrej Karpathy \u003cb\u003e\"Let's build GPT: from scratch, in code, spelled out.\"\u003c/b\u003e, \u003cem\u003eYouTube (2023)\u003c/em\u003e\u003c/a\u003e\n            \u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u0026copy; 2024-2025, Edwin Onuonga - Published under the terms of the \u003ca href=\"https://opensource.org/licenses/MIT\"\u003eMIT\u003c/a\u003e license.\u003cbr/\u003e\n  \u003cem\u003eAuthored and maintained by Edwin Onuonga.\u003c/em\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feonu%2Ftransformers-from-scratch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feonu%2Ftransformers-from-scratch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feonu%2Ftransformers-from-scratch/lists"}