https://github.com/eonu/transformers-from-scratch
Modular Python implementation of encoder-only, decoder-only and encoder-decoder transformer architectures from scratch, as detailed in Attention Is All You Need.
https://github.com/eonu/transformers-from-scratch
attention-is-all-you-need attention-mechanism generation generative-ai gpt llm nlp nlu summarization torch transformer transformer-from-scratch translation
Last synced: 26 days ago
JSON representation
Modular Python implementation of encoder-only, decoder-only and encoder-decoder transformer architectures from scratch, as detailed in Attention Is All You Need.
- Host: GitHub
- URL: https://github.com/eonu/transformers-from-scratch
- Owner: eonu
- License: mit
- Created: 2024-07-16T10:21:46.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-24T22:26:47.000Z (about 1 year ago)
- Last Synced: 2025-02-09T01:43:00.019Z (8 months ago)
- Topics: attention-is-all-you-need, attention-mechanism, generation, generative-ai, gpt, llm, nlp, nlu, summarization, torch, transformer, transformer-from-scratch, translation
- Language: Python
- Homepage:
- Size: 28 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Transformers From Scratch
Contents:
Features ·
Example ·
Details ·
Datasets ·
Models and notebooks ·
Repository structure ·
Installation ·
Running ·
References
The repository contains a modular Python implementation of transformer architectures for natural language understanding and generation tasks, according to:
- The seminal paper _Attention Is All You Need_ by Vaswani et al.[1] that details the novel attention-based transformer architecture and its application to sequence-to-sequence tasks, demonstrating its effectiveness by achieving state-of-the-art performance in machine translation, surpassing previous LSTM and CNN based neural machine translation architectures.
- The chapter on _Transformers and Large Language Models_ from _Speech and Language Processing_ by Jurafsky & Martin[2] which provides a more comprehensive and illustrative look into some of the high-level details discussed in _Attention Is All You Need_.## Features
- Generic encoder-only, decoder-only and encoder-decoder transformer architectures.
- Wrappers for causal language modelling, sequence-to-sequence generation and classification/regression.
- Various decoding methods for causal/sequence-to-sequence generation:
- Search-based (greedy and beam search)
- Sampling-based (nucleus, temperature and top-k sampling)
- Example applications to real-world datasets.### PyTorch restrictions
This project is implemented using [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/).
As PyTorch provides a number of transformer and attention related layers in its [`torch.nn`](https://pytorch.org/docs/stable/nn.html) submodule, this project explicitly avoids the use of:
- [`torch.nn.Transformer`](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html#torch.nn.Transformer)
- [`torch.nn.TransformerEncoder`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html#torch.nn.TransformerEncoder)/[`torch.nn.TransformerEncoderLayer`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html#torch.nn.TransformerEncoderLayer)
- [`torch.nn.TransformerDecoder`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoder.html#torch.nn.TransformerDecoder)/[`torch.nn.TransformerDecoderLayer`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoderLayer.html#torch.nn.TransformerDecoderLayer)
- [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention)
- [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention)All other layers provided by `torch.nn` are allowed, including:
- [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding): For token embedding look-up by vocabulary ID.
- [`nn.LayerNorm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html#torch.nn.LayerNorm): For layer normalization as implemented in _Attention Is All You Need_.### Other restrictions
- Transformer models implemented and made available in other libraries such as HuggingFace's [`transformers`](https://huggingface.co/docs/transformers/en/index) are not used in this project.
- However, the tokenizers provided by `transformers` were used, as developing tokenization algorithms was not the primary objective of this project.
- No existing _"x from scratch"_ resources were used, such as the famous _Let's build GPT: from scratch, in code, spelled out._ by Andrej Karpathy[3].
- No other online resources were used, apart from official documentation for packages such as [PyTorch](https://pytorch.org/docs/stable/index.html), [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) and [Huggingface Tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).## Example
Training a causal language model to generate "Florida man"-style news headlines.
```python
from transformers import LlamaTokenizerfrom transformer.params import TransformerParams, TemperatureSamplingParams
from transformer.models import CausalLM
from transformer.decoding import TemperatureSamplingDecoder# initialize HuggingFace tokenizer
tokenizer = LlamaTokenizer.from_pretrained(
"huggyllama/llama-7b", add_eos_token=True, legacy=False
)
tokenizer.add_special_tokens({"pad_token": ""})# initialize the causal language model
model = CausalLM(
params=TransformerParams(context_length=64),
tokenizer=tokenizer,
)# train the language model
model.train(...)# initialize decoder for sequence generation
decoder = TemperatureSamplingDecoder(
params=TemperatureSamplingParams(max_length=100, temperature=0.5, k=5),
model=model,
)# generation without context
decoder.generate()
'Florida man arrested after baby alligator, guns, drugs found inside truck'# generation with context
decoder.generate("Florida man shot")
'Florida man shot and killed while attempting to steal pizza and Pokemon cards from Target'
```## Details
While the original architecture described in _Attention Is All You Need_ is an encoder-decoder based architecture using transformers for neural machine translation which is a sequence-to-sequence learning task, this project was designed to be more general, allowing for a variety of natural language tasks by implementing encoder-only, decoder-only and encoder-decoder architectures.
Encoder-only
Decoder-only
Encoder-decoder
Diagram
![]()
![]()
![]()
Tasks
Contextualized embedding and supervised inference
Autoregressive generation
Sequence-to-sequence generation
Example use-cases
- Producing contextualized token embeddings
- Sentiment classification
- Intent classification
- Text generation
- Machine translation
- Text summarization
## Datasets
The following datasets were used to test the above transformer implementations on various tasks.
- [arXiv Paper Abstracts](https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts): arXiv manuscripts and their metadata including titles, abstracts and categories.
- [CommonLit Readability Prize](https://www.kaggle.com/competitions/commonlitreadabilityprize): Literary passages and their associated "readability" score for use in grade 3-12 classrooms.
- [Reddit r/FloridaMan](https://www.kaggle.com/datasets/bcruise/reddit-rfloridaman): News headlines about various (often funny and irrational) actions performed by Florida men and women.
- [Europarl](https://www.kaggle.com/datasets/nltkdata/europarl): Transcriptions of European Parliament proceedings between 1996-2006, collected in 11 languages.
## Models and notebooks
### Encoder-only models
- [`ClassifierLM`](transformer/models/classifier.py): A generic transformer-based language model for assigning classes to text.
- [`notebooks/arxiv_categorization.ipynb`](notebooks/arxiv_categorization.ipynb) applies this model to the _arXiv Paper Abstracts_ dataset to categorize arXiv manuscripts based on their titles.
- [`RegressorLM`](transformer/models/regressor.py): A generic transformer-based language model for assigning scores to text.
- [`notebooks/commonlit_readability.ipynb`](notebooks/commonlit_readability.ipynb) applies this model to the _CommonLit Readability Prize_ dataset to rate the complexity of literary passages for grade 3-12 students.
### Decoder-only models
- [`CausalLM`](transformer/models/causal.py): A generic transformer-based language model for generating text in an autoregressive manner.
- [`notebooks/florida_man_generation.ipynb`](notebooks/florida_man.ipynb) applies this model to the _Reddit r/FloridaMan_ dataset to generate humorous news headlines involving the (mis)adventures of Florida men and women.
### Encoder-decoder models
- [`Seq2SeqLM`](transformer/models/seq2seq.py): A generic transformer-based language model for generating output text given an input text.
- [`notebooks/arxiv_summarization.ipynb`](notebooks/arxiv_summarization.ipynb) applies this model to the _arxiv Paper Abstracts_ dataset to generate arXiv paper titles by summarizing their corresponding abstracts.
- [`notebooks/europarl_translation.ipynb`](notebooks/europarl_translation.ipynb) applies this model to the _Europarl_ dataset to translate transcribed parliamentiary proceedings from French to English.
## Repository structure
- [**`notebooks/`**](notebooks/): Notebooks applying the models in [`transformer.models`](transformer/models/) to various datasets.
- [**`transformer/`**](transformer/): Core package containing the transformer implementations.
- [**`dataloaders/`**](transformer/dataloaders/): [`LightningDataModule`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html)s for each model in [`transformer.models`](transformer/models/).
- [**`decoding/`**](transformers/decoding/): Decoding method implementations for causal and sequence-to-sequence LMs.
- [**`models/`**](transformer/models/): Task-specific transformers implemented using [`transformer.modules.transformers`](transformer/modules/transformers/).
- [**`modules/`**](transformer/modules/): [`LightningModule`](https://lightning.ai/docs/pytorch/stable/common/lightning_module.html)s used within the transformers in [`transformer.models`](transformer/models/).
- [**`transformers/`**](transformer/modules/transformers/): Encoder-only, decoder-only and encoder-decoder transformer definitions.
- [`attention.py`](transformer/modules/attention.py): Masked/unmasked multi-head self attention definition.
- [`block.py`](transformer/modules/block.py): Transformer block definition.
- [`embedding.py`](transformer/modules/embedding.py): Positional encoding and input embedding definition.
- [**`params/`**](transformer/params/): Pydantic hyper-parameter classes.
- [**`utils/`**](transformer/utils/): Supporting custom layers, functions and constants.
## Installation
The transformer implementation is installable as a local Python package, named `transformer`.
```console
pip install -e .
```
To run the notebooks, you will need additional dependencies which can be installed with the `notebooks` extra.
```console
pip install -e ".[notebooks]"
```
**This package was developed on Python 3.11.8, so it is recommended to use a virtual environment with the same version.**
## Running
You should be able to simply run the Jupyter notebooks in the [`notebooks/`](notebooks/) folder.
_Beware, they take time – even with a good GPU (especially the sequence-to-sequence ones)!_
## References
[1]
Vaswani et al., "Attention Is All You Need", Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), 6000-6010.
[2]
Dan Jurafsky & James H. Martin, "Transformers and Large Language Models", Speech and Language Processing, 3rd ed. draft (2024), ch. 10.
[3]
Andrej Karpathy "Let's build GPT: from scratch, in code, spelled out.", YouTube (2023)
---
© 2024-2025, Edwin Onuonga - Published under the terms of the MIT license.
Authored and maintained by Edwin Onuonga.