An open API service indexing awesome lists of open source software.

https://github.com/eonu/transformers-from-scratch

Modular Python implementation of encoder-only, decoder-only and encoder-decoder transformer architectures from scratch, as detailed in Attention Is All You Need.
https://github.com/eonu/transformers-from-scratch

attention-is-all-you-need attention-mechanism generation generative-ai gpt llm nlp nlu summarization torch transformer transformer-from-scratch translation

Last synced: 26 days ago
JSON representation

Modular Python implementation of encoder-only, decoder-only and encoder-decoder transformer architectures from scratch, as detailed in Attention Is All You Need.

Awesome Lists containing this project

README

          


Transformers From Scratch



Contents
Features ·
Example ·
Details ·
Datasets ·
Models and notebooks ·
Repository structure ·
Installation ·
Running ·
References

The repository contains a modular Python implementation of transformer architectures for natural language understanding and generation tasks, according to:

- The seminal paper _Attention Is All You Need_ by Vaswani et al.[1] that details the novel attention-based transformer architecture and its application to sequence-to-sequence tasks, demonstrating its effectiveness by achieving state-of-the-art performance in machine translation, surpassing previous LSTM and CNN based neural machine translation architectures.
- The chapter on _Transformers and Large Language Models_ from _Speech and Language Processing_ by Jurafsky & Martin[2] which provides a more comprehensive and illustrative look into some of the high-level details discussed in _Attention Is All You Need_.

## Features

- Generic encoder-only, decoder-only and encoder-decoder transformer architectures.
- Wrappers for causal language modelling, sequence-to-sequence generation and classification/regression.
- Various decoding methods for causal/sequence-to-sequence generation:
- Search-based (greedy and beam search)
- Sampling-based (nucleus, temperature and top-k sampling)
- Example applications to real-world datasets.

### PyTorch restrictions

This project is implemented using [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/).

As PyTorch provides a number of transformer and attention related layers in its [`torch.nn`](https://pytorch.org/docs/stable/nn.html) submodule, this project explicitly avoids the use of:

- [`torch.nn.Transformer`](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html#torch.nn.Transformer)
- [`torch.nn.TransformerEncoder`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html#torch.nn.TransformerEncoder)/[`torch.nn.TransformerEncoderLayer`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html#torch.nn.TransformerEncoderLayer)
- [`torch.nn.TransformerDecoder`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoder.html#torch.nn.TransformerDecoder)/[`torch.nn.TransformerDecoderLayer`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoderLayer.html#torch.nn.TransformerDecoderLayer)
- [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention)
- [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention)

All other layers provided by `torch.nn` are allowed, including:

- [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding): For token embedding look-up by vocabulary ID.
- [`nn.LayerNorm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html#torch.nn.LayerNorm): For layer normalization as implemented in _Attention Is All You Need_.

### Other restrictions

- Transformer models implemented and made available in other libraries such as HuggingFace's [`transformers`](https://huggingface.co/docs/transformers/en/index) are not used in this project.
- However, the tokenizers provided by `transformers` were used, as developing tokenization algorithms was not the primary objective of this project.
- No existing _"x from scratch"_ resources were used, such as the famous _Let's build GPT: from scratch, in code, spelled out._ by Andrej Karpathy[3].
- No other online resources were used, apart from official documentation for packages such as [PyTorch](https://pytorch.org/docs/stable/index.html), [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) and [Huggingface Tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).

## Example

Training a causal language model to generate "Florida man"-style news headlines.

```python
from transformers import LlamaTokenizer

from transformer.params import TransformerParams, TemperatureSamplingParams
from transformer.models import CausalLM
from transformer.decoding import TemperatureSamplingDecoder

# initialize HuggingFace tokenizer
tokenizer = LlamaTokenizer.from_pretrained(
"huggyllama/llama-7b", add_eos_token=True, legacy=False
)
tokenizer.add_special_tokens({"pad_token": ""})

# initialize the causal language model
model = CausalLM(
params=TransformerParams(context_length=64),
tokenizer=tokenizer,
)

# train the language model
model.train(...)

# initialize decoder for sequence generation
decoder = TemperatureSamplingDecoder(
params=TemperatureSamplingParams(max_length=100, temperature=0.5, k=5),
model=model,
)

# generation without context
decoder.generate()
'Florida man arrested after baby alligator, guns, drugs found inside truck'

# generation with context
decoder.generate("Florida man shot")
'Florida man shot and killed while attempting to steal pizza and Pokemon cards from Target'
```

## Details

While the original architecture described in _Attention Is All You Need_ is an encoder-decoder based architecture using transformers for neural machine translation which is a sequence-to-sequence learning task, this project was designed to be more general, allowing for a variety of natural language tasks by implementing encoder-only, decoder-only and encoder-decoder architectures.




Encoder-only
Decoder-only
Encoder-decoder


Diagram





Tasks
Contextualized embedding and supervised inference
Autoregressive generation
Sequence-to-sequence generation


Example use-cases


  • Producing contextualized token embeddings

  • Sentiment classification

  • Intent classification





  • Text generation





  • Machine translation

  • Text summarization




## Datasets

The following datasets were used to test the above transformer implementations on various tasks.

- [arXiv Paper Abstracts](https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts): arXiv manuscripts and their metadata including titles, abstracts and categories.
- [CommonLit Readability Prize](https://www.kaggle.com/competitions/commonlitreadabilityprize): Literary passages and their associated "readability" score for use in grade 3-12 classrooms.
- [Reddit r/FloridaMan](https://www.kaggle.com/datasets/bcruise/reddit-rfloridaman): News headlines about various (often funny and irrational) actions performed by Florida men and women.
- [Europarl](https://www.kaggle.com/datasets/nltkdata/europarl): Transcriptions of European Parliament proceedings between 1996-2006, collected in 11 languages.

## Models and notebooks

### Encoder-only models

- [`ClassifierLM`](transformer/models/classifier.py): A generic transformer-based language model for assigning classes to text.
- [`notebooks/arxiv_categorization.ipynb`](notebooks/arxiv_categorization.ipynb) applies this model to the _arXiv Paper Abstracts_ dataset to categorize arXiv manuscripts based on their titles.
- [`RegressorLM`](transformer/models/regressor.py): A generic transformer-based language model for assigning scores to text.
- [`notebooks/commonlit_readability.ipynb`](notebooks/commonlit_readability.ipynb) applies this model to the _CommonLit Readability Prize_ dataset to rate the complexity of literary passages for grade 3-12 students.

### Decoder-only models

- [`CausalLM`](transformer/models/causal.py): A generic transformer-based language model for generating text in an autoregressive manner.
- [`notebooks/florida_man_generation.ipynb`](notebooks/florida_man.ipynb) applies this model to the _Reddit r/FloridaMan_ dataset to generate humorous news headlines involving the (mis)adventures of Florida men and women.

### Encoder-decoder models

- [`Seq2SeqLM`](transformer/models/seq2seq.py): A generic transformer-based language model for generating output text given an input text.
- [`notebooks/arxiv_summarization.ipynb`](notebooks/arxiv_summarization.ipynb) applies this model to the _arxiv Paper Abstracts_ dataset to generate arXiv paper titles by summarizing their corresponding abstracts.
- [`notebooks/europarl_translation.ipynb`](notebooks/europarl_translation.ipynb) applies this model to the _Europarl_ dataset to translate transcribed parliamentiary proceedings from French to English.

## Repository structure

- [**`notebooks/`**](notebooks/): Notebooks applying the models in [`transformer.models`](transformer/models/) to various datasets.
- [**`transformer/`**](transformer/): Core package containing the transformer implementations.
- [**`dataloaders/`**](transformer/dataloaders/): [`LightningDataModule`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html)s for each model in [`transformer.models`](transformer/models/).
- [**`decoding/`**](transformers/decoding/): Decoding method implementations for causal and sequence-to-sequence LMs.
- [**`models/`**](transformer/models/): Task-specific transformers implemented using [`transformer.modules.transformers`](transformer/modules/transformers/).
- [**`modules/`**](transformer/modules/): [`LightningModule`](https://lightning.ai/docs/pytorch/stable/common/lightning_module.html)s used within the transformers in [`transformer.models`](transformer/models/).
- [**`transformers/`**](transformer/modules/transformers/): Encoder-only, decoder-only and encoder-decoder transformer definitions.
- [`attention.py`](transformer/modules/attention.py): Masked/unmasked multi-head self attention definition.
- [`block.py`](transformer/modules/block.py): Transformer block definition.
- [`embedding.py`](transformer/modules/embedding.py): Positional encoding and input embedding definition.
- [**`params/`**](transformer/params/): Pydantic hyper-parameter classes.
- [**`utils/`**](transformer/utils/): Supporting custom layers, functions and constants.

## Installation

The transformer implementation is installable as a local Python package, named `transformer`.

```console
pip install -e .
```

To run the notebooks, you will need additional dependencies which can be installed with the `notebooks` extra.

```console
pip install -e ".[notebooks]"
```

**This package was developed on Python 3.11.8, so it is recommended to use a virtual environment with the same version.**

## Running

You should be able to simply run the Jupyter notebooks in the [`notebooks/`](notebooks/) folder.

_Beware, they take time – even with a good GPU (especially the sequence-to-sequence ones)!_

## References



[1]

Vaswani et al., "Attention Is All You Need", Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), 6000-6010.



[2]

Dan Jurafsky & James H. Martin, "Transformers and Large Language Models", Speech and Language Processing, 3rd ed. draft (2024), ch. 10.



[3]

Andrej Karpathy "Let's build GPT: from scratch, in code, spelled out.", YouTube (2023)


---


© 2024-2025, Edwin Onuonga - Published under the terms of the MIT license.

Authored and maintained by Edwin Onuonga.