https://github.com/gs-101/nanogpt-from-scratch

Model created by following the "Let's build GPT: from scratch, in code, spelled out." lecture by Andrej Karpathy, in Org Mode.
https://github.com/gs-101/nanogpt-from-scratch
llm nanogpt org-mode python
Last synced: 6 months ago
JSON representation
Model created by following the "Let's build GPT: from scratch, in code, spelled out." lecture by Andrej Karpathy, in Org Mode.
Host: GitHub
URL: https://github.com/gs-101/nanogpt-from-scratch
Owner: gs-101
License: mit
Created: 2024-11-16T21:31:22.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-01-28T21:01:05.000Z (8 months ago)
Last Synced: 2025-02-06T09:13:46.014Z (8 months ago)
Topics: llm, nanogpt, org-mode, python
Language: Python
Homepage: https://codeberg.org/gs-101/nanoGPT-from-scratch
Size: 439 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.org
- License: LICENSE
Awesome Lists containing this project

README

          :PROPERTIES:

:ID:       fdda529f-14ed-4fe5-b898-5a2161d5d6b5

:ROAM_REFS: @karpathyLetsBuildGPT2023

:END:

#+title: Let’s Build GPT: From Scratch, in Code, Spelled Out.

#+filetags: :artificial_intelligence:computer_science:machine_learning:neural_and_evolutionary_computing:

#+OPTIONS: f:t

* Summary

Large Language Model lecture by Andrej Karpathy, going through the process of building a Generatively Pretrained Transformer (GPT), following the papers "Attention is All You Need"[fn:1] and "Language Models are Few-Shot Learners"[fn:2].

* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=0s][Intro]]

Andrej introduces GPT. It's an auto-complete system that takes input and outputs the most likely outcome. The Tranformer part is the one that does the heavy lifting, and it comes from the "Attention is All You Need" paper. This paper was extermely influential for the field of artificial intelligence, but the authors didn't fully acknowledge the impact they would make in the paper. This lecture has the goal of developing a transformer-based language model. The goal isn't to replicate ChatGPT here, since it has a massive chunk of data. The dataset used in this lecture is [[https://huggingface.co/datasets/karpathy/tiny_shakespeare][tiny_shakespeare]], which is a complete collection of the works of Shakespeare, in 1 megabyte. In this local model, words are generated by character basis, while modern LLMs use tokens, which are closer to word chunks. Another dataset that can be used is [[https://huggingface.co/datasets/Skylion007/openwebtext][OpenWebText]].

NOTE: This lecture uses Python, and GPT too, so this is probably valid for other LLMs. I hope that the reason for this is explained here.

ANSWER: After watching the video, the reason Python is used is because of [[https://pytorch.org/][pyTorch]], a machine learning library, implementing a lot of mathematical funcitons relating to tensors, which are multidimensional arrays. Arrays are crucial in machine learning as they are how characters are arranged.

* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=472s][Reading and Exploring Data]]

First things first, start with downloading a dataset:

#+begin_src bash

wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

#+end_src

Then we let Python interact with the file:

#+name: dataset

#+begin_src python :session train

  with open("input.txt", "r", encoding="utf-8") as f:

      text = f.read()

#+end_src

Then, we get all the unique occuring characters in the text:

#+name: chars-train

#+begin_src python :session train

  chars = sorted(list(set(text)))

  vocab_size = len(chars)

  print("".join(chars))

  print(vocab_size)

#+end_src

* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=568s][Tokenization]]

This encodes a text, returning the value of each of its characters as integers in an array. The encoded text can also be decoded to return the exact same string.

We then create a mapping to translate the characters:

#+name: decoding-train

#+begin_src python :session train

  # create a mapping from characters to integers

  stoi = {ch: i for i, ch in enumerate(chars)}

  itos = {i: ch for i, ch in enumerate(chars)}

  encode = lambda s: [

      stoi[c] for c in s

  ]  # encoder: take a string, output a list of integers

  decode = lambda l: "".join(

      [itos[i] for i in l]

  )  # decoder: take a list of integers, output a string

  # let's now encode the entire text dataset and store it into a torch.Tensor

  print(encode("hii there"))

  print(decode(encode("hii there")))

#+end_src

There are other ways to do this, such as using sentencepiece[fn:3] (which encodes "sub-words"), or using tiktoken[fn:4]. This is supposed to be a simple transformer model, so these tokenisers aren't used here.

Now that we have both an encoder and a decoder, we can translate the Shakespeare dataset.

#+name: import

#+begin_src python :session train

  import torch

#+end_src

#+name: data_translate-train

#+begin_src python :session train :results output

  data = torch.tensor(encode(text), dtype=torch.long)

  print(data.shape, data.dtype)

  print(data[:1000])

#+end_src

#+RESULTS: data_translate-train

#+begin_example

torch.Size([1115394]) torch.int64

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,

        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,

         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,

        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,

         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,

        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,

         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,

        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,

        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,

         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,

         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,

        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,

        47, 59, 57,  1, 47, 57,  1, 41, 46, 47, 43, 44,  1, 43, 52, 43, 51, 63,

         1, 58, 53,  1, 58, 46, 43,  1, 54, 43, 53, 54, 50, 43,  8,  0,  0, 13,

        50, 50, 10,  0, 35, 43,  1, 49, 52, 53, 61,  5, 58,  6,  1, 61, 43,  1,

        49, 52, 53, 61,  5, 58,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47, 58,

        47, 64, 43, 52, 10,  0, 24, 43, 58,  1, 59, 57,  1, 49, 47, 50, 50,  1,

        46, 47, 51,  6,  1, 39, 52, 42,  1, 61, 43,  5, 50, 50,  1, 46, 39, 60,

        43,  1, 41, 53, 56, 52,  1, 39, 58,  1, 53, 59, 56,  1, 53, 61, 52,  1,

        54, 56, 47, 41, 43,  8,  0, 21, 57,  5, 58,  1, 39,  1, 60, 43, 56, 42,

        47, 41, 58, 12,  0,  0, 13, 50, 50, 10,  0, 26, 53,  1, 51, 53, 56, 43,

         1, 58, 39, 50, 49, 47, 52, 45,  1, 53, 52,  5, 58, 11,  1, 50, 43, 58,

         1, 47, 58,  1, 40, 43,  1, 42, 53, 52, 43, 10,  1, 39, 61, 39, 63,  6,

         1, 39, 61, 39, 63,  2,  0,  0, 31, 43, 41, 53, 52, 42,  1, 15, 47, 58,

        47, 64, 43, 52, 10,  0, 27, 52, 43,  1, 61, 53, 56, 42,  6,  1, 45, 53,

        53, 42,  1, 41, 47, 58, 47, 64, 43, 52, 57,  8,  0,  0, 18, 47, 56, 57,

        58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 35, 43,  1, 39, 56, 43,  1,

        39, 41, 41, 53, 59, 52, 58, 43, 42,  1, 54, 53, 53, 56,  1, 41, 47, 58,

        47, 64, 43, 52, 57,  6,  1, 58, 46, 43,  1, 54, 39, 58, 56, 47, 41, 47,

        39, 52, 57,  1, 45, 53, 53, 42,  8,  0, 35, 46, 39, 58,  1, 39, 59, 58,

        46, 53, 56, 47, 58, 63,  1, 57, 59, 56, 44, 43, 47, 58, 57,  1, 53, 52,

         1, 61, 53, 59, 50, 42,  1, 56, 43, 50, 47, 43, 60, 43,  1, 59, 57, 10,

         1, 47, 44,  1, 58, 46, 43, 63,  0, 61, 53, 59, 50, 42,  1, 63, 47, 43,

        50, 42,  1, 59, 57,  1, 40, 59, 58,  1, 58, 46, 43,  1, 57, 59, 54, 43,

        56, 44, 50, 59, 47, 58, 63,  6,  1, 61, 46, 47, 50, 43,  1, 47, 58,  1,

        61, 43, 56, 43,  0, 61, 46, 53, 50, 43, 57, 53, 51, 43,  6,  1, 61, 43,

         1, 51, 47, 45, 46, 58,  1, 45, 59, 43, 57, 57,  1, 58, 46, 43, 63,  1,

        56, 43, 50, 47, 43, 60, 43, 42,  1, 59, 57,  1, 46, 59, 51, 39, 52, 43,

        50, 63, 11,  0, 40, 59, 58,  1, 58, 46, 43, 63,  1, 58, 46, 47, 52, 49,

         1, 61, 43,  1, 39, 56, 43,  1, 58, 53, 53,  1, 42, 43, 39, 56, 10,  1,

        58, 46, 43,  1, 50, 43, 39, 52, 52, 43, 57, 57,  1, 58, 46, 39, 58,  0,

        39, 44, 44, 50, 47, 41, 58, 57,  1, 59, 57,  6,  1, 58, 46, 43,  1, 53,

        40, 48, 43, 41, 58,  1, 53, 44,  1, 53, 59, 56,  1, 51, 47, 57, 43, 56,

        63,  6,  1, 47, 57,  1, 39, 57,  1, 39, 52,  0, 47, 52, 60, 43, 52, 58,

        53, 56, 63,  1, 58, 53,  1, 54, 39, 56, 58, 47, 41, 59, 50, 39, 56, 47,

        57, 43,  1, 58, 46, 43, 47, 56,  1, 39, 40, 59, 52, 42, 39, 52, 41, 43,

        11,  1, 53, 59, 56,  0, 57, 59, 44, 44, 43, 56, 39, 52, 41, 43,  1, 47,

        57,  1, 39,  1, 45, 39, 47, 52,  1, 58, 53,  1, 58, 46, 43, 51,  1, 24,

        43, 58,  1, 59, 57,  1, 56, 43, 60, 43, 52, 45, 43,  1, 58, 46, 47, 57,

         1, 61, 47, 58, 46,  0, 53, 59, 56,  1, 54, 47, 49, 43, 57,  6,  1, 43,

        56, 43,  1, 61, 43,  1, 40, 43, 41, 53, 51, 43,  1, 56, 39, 49, 43, 57,

        10,  1, 44, 53, 56,  1, 58, 46, 43,  1, 45, 53, 42, 57,  1, 49, 52, 53,

        61,  1, 21,  0, 57, 54, 43, 39, 49,  1, 58, 46, 47, 57,  1, 47, 52,  1,

        46, 59, 52, 45, 43, 56,  1, 44, 53, 56,  1, 40, 56, 43, 39, 42,  6,  1,

        52, 53, 58,  1, 47, 52,  1, 58, 46, 47, 56, 57, 58,  1, 44, 53, 56,  1,

        56, 43, 60, 43, 52, 45, 43,  8,  0,  0])

#+end_example

Here, we used [[https://pytorch.org/][PyTorch]], a Python-based machine learning framework, to store the encoded data. We then print the saved data to visualize it, but only the first 1000 characters.

Andrej likes to split his dataset in a 90% - 10% basis. That is, 90% will be used for training data, while the other 10% will be used for validation.

#+name: data_split-train

#+begin_src python :session train

  # Let's now split up the data into train and validation sets

  n = int(0.9 * len(data))  # first 90% will be train, rest val

  train_data = data[:n]

  val_data = data[n:]

#+end_src

If the LLM was fed 100% of the data, it would just recreate the dataset, not generate a new thing.

* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=867s][Data Loader]]

Now, we'll plug the text sequences into the transformer. This is done by sending chunks of the dataset. The transformer takes the data and trains itself on each of the positions. The ~+1~ is there to add the prediction function. Separating the context like this gets the LLM used to different sizes of text.

#+name: optimizer-train

#+begin_src python :session train

  block_size = 8

  train_data[: block_size + 1]

  x = train_data[:block_size]

  y = train_data[1: block_size + 1]

  for t in range(block_size):

      context = x[: t + 1]

      target = y[t]

      print(f"when input is {context} the target: {target}")

#+end_src

Every time we feed the text to the transformer, it's going to be sent in multiple batches of text, to keep the GPU busy, since they are really efficient in parallel processing. The seed is set here to a specific value for demonstration.

#+name: reproducibility-train

#+begin_src python :session train

  torch.manual_seed(1337)

  batch_size = 4  # how many independent sequences will we process in parallel?

  block_size = 8  # what is the maximum content length for predictions?

#+end_src

#+name: get_batch-train

#+begin_src python :session train

  def get_batch(split):

      # generate a small batch of data of inputs x and targets y

      data = train_data if split == "train" else val_data

#+end_src

#+name: get_batch_randint-train

#+begin_src python :session train

  ix = torch.randint(len(data) - block_size, (batch_size,))

#+end_src

This line generates a randomly generated number sequence based on the value of ~batch_size~. So a ~batch_size~ of 4 would generate four random integers.

#+name: get_batch_offset-train

#+begin_src python :session train

  x = torch.stack([data[i: i + block_size] for i in ix])

#+end_src

The ~i~ is the offset by 1 of the ~block_size~. It takes all the one-dimensional tensors and stacks them, like a table.

#+name: get_batch_targets-train

#+begin_src python :session train

  y = torch.stack([data[i + 1: i + block_size + 1] for i in ix])

  return x, y

#+end_src

This marks the targets (the character following the sequence defined in ~x~).

#+name: get_batch_results-train

#+begin_src python :session train :results output

  xb, yb = get_batch("train")

  print("inputs")

  print(xb.shape)

  print(xb)

  print("targets:")

  print(yb.shape)

  print(yb)

  print("----")

  for b in range(batch_size):  # batch dimension

      for t in range(block_size):  # time dimension

          context = xb[b, : t + 1]

          target = yb[b, t]

          print(f"when input is {context.tolist()} the target: {target}")

#+end_src

* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=1331s][Simplest Baseline]]

Now we're going to use the simplest language model available, a bigram model.

#+name: import_nn

#+begin_src python :session train

  import torch.nn as nn

  from torch.nn import functional as F

#+end_src

#+name: model-train

#+begin_src python :session train

  class BigramLanguageModel(nn.Module):

      def __init__(self, vocab_size):

          super().__init__()

          # each token directly reads off the logits for the next token

          # from a lookup table

          self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

#+end_src

Here, we are creating a table for the LLM, that is based on the vocabulary size, where each integer takes up their specific space on the table (so 25 goes to position 25, 17 to position 17, so and so forth).

#+name: forward_v1-train

#+begin_src python :session train

  def forward(self, idx, targets):

      # idx and targets are both (B, T) tensor of integers

      logits = self.token_embedding_table(idx)  # (B, T, C)

      return logits

#+end_src

The targets are determined in this bit. The index ~idx~ is passed to the embedding table, and arranged in a batch ~B~, time ~T~ and channel ~C~ rows. The logits are the predictions.

#+name: model_vocab_size-train

#+begin_src python :session train

  m = BigramLanguageModel(vocab_size)

  out = m(xb, yb)

#+end_src

A good way measure prediction quality is to calculate its likelihood. This operation is implemented in pyTorch under the name ~cross_entropy~.

#+name: loss-train

#+begin_src python :session train

  loss = F.cross_entropy(logits, targets)

#+end_src

This runs in the predictions (logits) and the targets (which represent the next character). This code won't run because pyTorch expects a different logit format (+B, T, C+ \rightarrow B, C, T).

#+name: logits_shape-train

#+begin_src python :session train

  B, T, C = logits.shape

  logits = logits.view(B * T, C)

  targets = targets.view(B * T)

#+end_src

This is how to correct the shape of the logits. We take B and T and couple them together to form a two-dimensional array, and then fulfil pyTorch's requirements. The same also has to be done to our targets (but, in their case, it's 2D \rightarrow 1D).

And then we print out our results:

#+name: loss_results-train

#+begin_src python :session train :results output

  print(logits.shape)

  print(loss)

#+end_src

#+RESULTS: loss_results-train

: torch.Size([256, 65])

: tensor(2.4860, grad_fn=)

Which, in our case, are the result of the logarithm of 1 (the current character) by 65 (the amount of possible results). \(-\log{\frac{1}{65}}\)

The loss on the video was lower, ~4.8786~. I got a different result here.

Now, our next step is the generation:

#+name: generate-train

#+begin_src python :session train

  def generate(self, idx, max_new_tokens):

      # idx is (B, T) array of indices in the current context

      for _ in range(max_new_tokens):

#+end_src

#+name: generate_loss-train

#+begin_src python :session train

  # get the predictions

  logits, loss = self(idx)

#+end_src

The loss might appear in this part, but it's not actually used.

#+name: generate_idx-train

#+begin_src python :session train

  # focus only on the last time step

  logits = logits[:, -1, :]  # becomes (B, C)

  # apply softmax to get probabilities

  probs = F.softmax(logits, dim=-1)  # (B, C)

  # sample from the distribution

  idx_next = torch.multinomial(probs, num_samples=1)  # (B , 1)

#+end_src

~idx~ is the current context of the batch. It has the job to take the logits and expand them 1 by 1.

#+name: prediction_results-train

#+begin_src python :session train

      # append sampled index to the running sequence

      idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)

  return idx

#+end_src

In this part, the results of the prediction ~idx_next~ is concatenated to the previous context, ~idx~, keeping all the text together (NOTE: so ~idx~ FO + ~idx_next~ O = ~idx~ FOO). This is a generalist approach, not really useful, since we are feeding the model the whole character history, but only feeding it the previous character in the sequence. This function will be expanded on later.

This layout would give us an error, so we have to return to the ~forward~ function and set its targets to ~None~, since they aren't defined here.

#+name: forward_v2-train

#+begin_src python :session train

  def forward(self, idx, targets=None):

      # idx and targets are both (B, T) tensor of integers

      logits = self.token_embedding_table(idx)  # (B, T, C)

      if targets is None:

          loss = None

      else:

          B, T, C = logits.shape

          logits = logits.view(B * T, C)

          targets = targets.view(B * T)

          loss = F.cross_entropy(logits, targets)

      return logits, loss

#+end_src

This makes it so that, if we actually have targets, the loss is also returned, but this won't happen here as targets is set to ~None~.

#+name: generate_print-train

#+begin_src python :session train :results output

  print(

      decode(

          m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[

              0

          ].tolist()

      )

  )

#+end_src

#+RESULTS: generate_print-train

: 

: I ler totel me otche, PERCofonjothir pe et s men:

: ORKIIV: m, y rit I m ay s tathearer ate win pit po

This is the code used for generation. It converts a pyTorch list to a conventional Python list. This first result is completly nonsensical, because the model has no previous context.

* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2093s][Training the Bigram Model]]

Now, we're going to actually train our model, to give the output some sense.

We start creating a PyTorch optimizer:

#+name: training-train

#+begin_src python :session train :results output

  # create a PyTorch optimizer

  optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

  batch_size = 32

  for steps in range(10000):

      # sample a batch of data

      xb, yb = get_batch('train')

      # evaluate the loss

      logits, loss = m(xb, yb)

      optimizer.zero_grad(set_to_none=True)

      loss.backward()

      optimizer.step()

  print(loss.item())

#+end_src

#+RESULTS: training-train

: 2.4859519004821777

There are other optimizer options, but [[https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html][AdamW]] was the most popular at the time of the video (<2023-01-17 ter>). The ~range~ determines the number of iterations. By doing this, we reduced our loss by 3! Now our +decode+ generate function should work better.

* Final Result

#+begin_src python :noweb yes :mkdirp yes :tangle ~/Projects/Code/study/nanoGPT-from-scratch/train.py

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

      <>

      <>

      <>

  <>

  <>

      <>

      <>

              <>

              <>

          <>

  <>

  <>

  <>

  <>

#+end_src

* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2280s][Port our Code to a Script]]

#+name: hyperparameters-bigram

#+begin_src python

  batch_size = 32  # how many independent sequences will we process in parallel?

  block_size = 8  # what is the maximum content length for predictions?

  max_itters = 3000

  eval_interval = 300

  learning_rate = 1e-2

  device = 'cuda' if torch.cuda.is_available() else 'cpu'

  eval_iters = 200

#+end_src

The device here allows for the ability to run the cude using a GPU, through [[https://developer.nvidia.com/blog/even-easier-introduction-cuda/][CUDA]].

#+name: reproducibility

#+begin_src python

  torch.manual_seed(1337)

#+end_src

#+name: decoding

#+begin_src python

  # here are all the unique characters that occur in this text

  chars = sorted(list(set(text)))

  vocab_size = len(chars)

  # create a mapping from characters to integers

  stoi = {ch: i for i, ch in enumerate(chars)}

  itos = {i: ch for i, ch in enumerate(chars)}

  encode = lambda s: [

      stoi[c] for c in s

  ]  # encoder: take a string, output a list of integers

  decode = lambda l: "".join(

      [itos[i] for i in l]

  )  # decoder: take a list of integers, output a string

#+end_src

#+name: data_split

#+begin_src python

  data = torch.tensor(encode(text), dtype=torch.long)

  n = int(0.9 * len(data))  # first 90% will be train, rest val

  train_data = data[:n]

  val_data = data[n:]

#+end_src

#+name: get_batch

#+begin_src python

  def get_batch(split):

      # generate a small batch of data of inputs x and targets y

      data = train_data if split == "train" else val_data

      ix = torch.randint(len(data) - block_size, (batch_size,))

      x = torch.stack([data[i: i + block_size] for i in ix])

      y = torch.stack([data[i + 1: i + block_size + 1] for i in ix])

      return x, y

#+end_src

#+name: estimate_loss

#+begin_src python

  @torch.no_grad()

  def estimate_loss():

      out = {}

      model.eval()

      for split in ['train', 'val']:

          losses = torch.zeros(eval_iters)

          for k in range(eval_iters):

              X, Y = get_batch(split)

              logits, loss = model(X, Y)

              losses[k] = loss.item()

              out[split] = losses.mean()

      model.train()

      return out

#+end_src

This chunk gets the average of multiple losses, optimizing the process of its calculation. The model is set to evaluation phase and then it returns to the training phase in this code, but this currently doesn't do anything, because the phases haven't actually been defined in the model. It is good practice to think through what mode the model is, because some layers could introduce different, mode-specific behaviour which could give unintended results. ~@torch.no_grad~ is a context manager, it tells pyTorch to *not* call ~.backward~ on what happens inside this function.

#+name: model-bigram

#+begin_src python

  # super simple bigram model

  class BigramLanguageModel(nn.Module):

      def __init__(self, vocab_size):

          super().__init__()

          # each token directly reads off the logits for the next token

          # from a lookup table

          self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

      def forward(self, idx, targets=None):

          # idx and targets are both (B, T) tensor of integers

          logits = self.token_embedding_table(idx)  # (B, T, C)

          if targets is None:

              loss = None

          else:

              B, T, C = logits.shape

              logits = logits.view(B * T, C)

              targets = targets.view(B * T)

              loss = F.cross_entropy(logits, targets)

          return logits, loss

          B, T, C = logits.shape

          logits = logits.view(B * T, C)

          targets = targets.view(B * T)

          loss = F.cross_entropy(logits, targets)

          return logits, loss

      def generate(self, idx, max_new_tokens):

          # idx is (B, T) array of indices in the current context

          for _ in range(max_new_tokens):

              # get the predictions

              logits, loss = self(idx)

              # focus only on the last time step

              logits = logits[:, -1, :]  # becomes (B, C)

              # apply softmax to get probabilities

              probs = F.softmax(logits, dim=-1)  # (B, C)

              # sample from the distribution

              idx_next = torch.multinomial(probs, num_samples=1)  # (B , 1)

              # append sampled index to the running sequence

              idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)

          return idx

  model = BigramLanguageModel(vocab_size)

  m = model.to(device)

#+end_src

#+name: optimizer-bigram

#+begin_src python

optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

#+end_src

#+name: training-bigram

#+begin_src python

  for iter in range(max_itters):

      # every once in a while evaluate the loss on train and val sets

      if iter % eval_interval == 0:

          losses = estimate_loss()

          print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

      # sample a batch of data

      xb, yb = get_batch('train')

      # evaluate the loss

      logits, loss = model(xb, yb)

      optimizer.zero_grad(set_to_none=True)

      loss.backward()

      optimizer.step()

#+end_src

#+name: context-bigram

#+begin_src python

  context = torch.zeros((1, 1), dtype=torch.long, device=device)

  print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

#+end_src

~device~ here allows us to pass the context to the device (GPU) if it is available.

** Final Result

#+begin_src python :session bigram :noweb yes :results output

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

#+end_src

* Building Self-Attention

Now, we'll get to the main point of the video and the paper[fn:1]: self-attention.

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2533s][Averaging Past Context with For Loops]]

Before just jumping into the topic of self-attention, Andrej shows us a simple but essential part of an efficent self-attention implementation, which is token-to-token communication, /excluding/ any connection with following tokens (NOTE: so in ~FOOBAR~, ~B~ would have ~FOO~ as context, not ~BAR~). Information should flow *sequentially*.

The simplest way to do this introduced here is to calculate the average of the previous tokens. This can generate a lot of loss, but this will be expanded on later.

#+begin_src python :session bigram :results output

   torch.manual_seed(1337)

   B, T, C = 4, 8, 2  # batch, time, channels

   x = torch.randn(B, T, C)

   x.shape

   torch.Size([4, 8, 2])

   # We want x[b, t] = mean_{i<=t} x[b, i]

   xbow = torch.zeros((B, T, C))

   for b in range(B):

       for t in range(T):

           xprev = x[b, :t+1]  # (t, C)

           xbow[b, t] = torch.mean(xprev, 0)

#+end_src

~bow~ here means /Bag of Words/ (just bundling them together). This is a simple for loop, so it's not that efficient. It iterates over the batch, them the time (tokens), and collects the current token and the previous ones. This is then passed to ~torch.mean~, which does the averaging.

This process can be optimized using matrix multiplication:

#+begin_src python :session bigram :results output

  torch.manual_seed(42)

  a = torch.ones(3, 3)

  b = torch.randint(0, 10, (3, 2)).float()

  c = a @ b

  print('a=')

  print(a)

  print('--')

  print('b=')

  print(b)

  print('--')

  print('c=')

  print(c)

#+end_src

#+RESULTS:

#+begin_example

  a=

  tensor([[1., 1., 1.],

          [1., 1., 1.],

          [1., 1., 1.]])

  --

  b=

  tensor([[2., 7.],

          [6., 4.],

          [6., 5.]])

  --

  c=

  tensor([[14., 16.],

          [14., 16.],

          [14., 16.]])

#+end_example

~c~ is the result of matrix ~a~ multiplied by ~c~. Since ~a~ is just a matrix of 1s, the results of ~c~ will be just the sum of the columns of ~b~ (2 + 6 + 6 = 14).

pyTorch has a function called ~tril~ (~torch.tril~), which takes a matrix and arranges its values in a triangular fashion (starting from left to right). This can be used to average out the values of the rows.

#+begin_src python :session bigram :results output

  torch.manual_seed(42)

  a = torch.tril(torch.ones(3, 3))

  a = a / torch.sum(a, 1, keepdim=True)

  b = torch.randint(0, 10, (3, 2)).float()

  c = a @ b

  print('a=')

  print(a)

  print('--')

  print('b=')

  print(b)

  print('--')

  print('c=')

  print(c)

#+end_src

#+RESULTS:

#+begin_example

a=

tensor([[1.0000, 0.0000, 0.0000],

        [0.5000, 0.5000, 0.0000],

        [0.3333, 0.3333, 0.3333]])

--

b=

tensor([[2., 7.],

        [6., 4.],

        [6., 5.]])

--

c=

tensor([[2.0000, 7.0000],

        [4.0000, 5.5000],

        [4.6667, 5.3333]])

#+end_example

Now, the second row of ~c~ is the average of the first and second rows of ~b~, and its third row is the average of all the rows of ~b~.

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3114s][Matrix Multiplication]]

After learning this, we can return to our for loop and optimize it:

#+begin_src python :session bigram :results output

  # We want x[b, t] = mean_{i<=t} x[b, i]

  xbow = torch.zeros((B, T, C))

  for b in range(B):

      for t in range(T):

          xprev = x[b, :t+1]  # (t, C)

          xbow[b, t] = torch.mean(xprev, 0)

  wei = torch.tril(torch.ones(T, T))

  wei = wei / wei.sum(1, keepdim=True)

  print(wei)

#+end_src

#+RESULTS:

: tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],

:         [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],

:         [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],

:         [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],

:         [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],

:         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],

:         [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],

:         [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

The values under the first row are the average because they sum (row number + row number) to 1. This allows us to create identical tensors, replacing a nested loop. (NOTE: You may notice that I removed ~torch.allclose~ here. That's because it was printing ~False~, but I verified they were the same by using ~xbow[0]~ and ~xbow2[0]~, just like in the video).

#+begin_src python :session bigram :results output

  # We want x[b, t] = mean_{i<=t} x[b, i]

  xbow = torch.zeros((B, T, C))

  for b in range(B):

      for t in range(T):

          xprev = x[b, :t+1]  # (t, C)

          xbow[b, t] = torch.mean(xprev, 0)

  wei = torch.tril(torch.ones(T, T))

  wei = wei / wei.sum(1, keepdim=True)

  xbow2 = wei @ x  # (T, T) @ (B, T, C) -> (B, T, T) @ (B, T, C) = (B, T, C)

#+end_src

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3282s][softmax]]

:PROPERTIES:

:CUSTOM_ID:  softmax

:END:

Now we'll se a different operation for generating tensors, using ~softmax~ instead.

#+begin_src python :session bigram :results output

  tril = torch.tril(torch.ones(T, T))

  wei = torch.zeros((T, T))

  wei = wei.masked_fill(tril == 0, float('-inf'))

  wei = F.softmax(wei, dim=-1)

  xbow3 = wei @ x

#+end_src

Masked fill is used to replace all the zeros in the tensors with negative infinity. This is used to normalize our new tensor (~xbow3~), getting an identical matrix.

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3506s][Code Cleanup]]

Now we can go back and clean up some of the code. First, there's no need to pass the ~vocab_size~ to the constructor, because it is set as a global variable.

#+name: model_constructor_global_vocab_size-bigram

#+begin_src python :session bigram

  model = BigramLanguageModel()

  m = model.to(device)

#+end_src

We'll also introduce a new variable, ~n_embd~, to determine the number of current embedding dimensions.

#+begin_src python

  ...

  device = 'cuda' if torch.cuda.is_available() else 'cpu'

  eval_iters = 200

  n_embd = 32

#+end_src

#+name: init_n_embd-bigram

#+begin_src python :session bigram

  # super simple bigram model

  class BigramLanguageModel(nn.Module):

      def __init__(self):

          super().__init__()

          # each token directly reads off the logits for the next token

          # from a lookup table

          self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

#+end_src

The forward function doesn't return logits directly, but rather token embeddings, so we'll call it ~tok_emb~ instead.

#+name: forward_tok_emb-bigram

#+begin_src python :session bigram

  # idx and targets are both (B, T) tensor of integers

  tok_emb = self.token_embedding_table(idx)  # (B, T ,C)

#+end_src

And then to actually get the logits, we'll need a linear layer:

#+name: init_linear_layer-bigram

#+begin_src python

  self.lm_head = nn.Linear(n_embd, vocab_size)

#+end_src

Back to the forward function:

#+name: init_logits-bigram

#+begin_src python

  logits = self.lm_head(tok_emb)  # (B, T, vocab_size)

#+end_src

#+name: model_new_body-bigram

#+begin_src python :session bigram

      if targets is None:

          loss = None

      else:

          B, T, C = logits.shape

          logits = logits.view(B * T, C)

          targets = targets.view(B * T)

          loss = F.cross_entropy(logits, targets)

      return logits, loss

      B, T, C = logits.shape

      logits = logits.view(B * T, C)

      targets = targets.view(B * T)

      loss = F.cross_entropy(logits, targets)

      return logits, loss

  def generate(self, idx, max_new_tokens):

      # idx is (B, T) array of indices in the current context

      for _ in range(max_new_tokens):

          # get the predictions

          logits, loss = self(idx)

          # focus only on the last time step

          logits = logits[:, -1, :]  # becomes (B, C)

          # apply softmax to get probabilities

          probs = F.softmax(logits, dim=-1)  # (B, C)

          # sample from the distribution

          idx_next = torch.multinomial(probs, num_samples=1)  # (B , 1)

          # append sampled index to the running sequence

          idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)

      return idx

#+end_src

We'll also add a position to the embedding table:

#+name: init_position_embedding_table-bigram

#+begin_src python :session bigram

  self.position_embedding_table = nn.Embedding(block_size, n_embd)

#+end_src

Then we decode ~B, T~ to have ~x~ hold the token embeddings *and* their positions:

#+name: forward_decode-bigram

#+begin_src python :session bigram

  def forward(self, idx, targets=None):

      B, T = idx.shape

      tok_emb = self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

      pos_emb = self.position_embedding_table(torch.arrange(T, device=device))  # (T, C)

      x = tok_emb + pos_emb  # (B, T, C)

      logits = self.lm_head(x)  # (B, T, vocab_size)

#+end_src

#+begin_src python :session bigram :noweb yes

  <>

  <>

  <>

  n_embd = 32

  <>

  <>

  <>

  <>

  <>

  <>

  <>

          <>

          <>

      <>

          <>

      <>

  <>

  <>

  <>

#+end_src

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3720s][Self-Attention]]

We'll now start our implementation of self-attention.

#+begin_src python :session bigram

  torch.manual_seed(1337)

  B, T, C = 4, 8, 32

  x = torch.randn(B, T, C)

  tril = torch.tril(torch.ones(T, T))

  wei = torch.zeros((T, T))

  wei = wei.masked_fill(tril == 0, float('-inf'))

  wei = F.softmax(wei, dim=-1)

  out = wei @ x

  out.shape

#+end_src

#+RESULTS:

: torch.Size([4, 8, 32])

We amplified the number of channels and combined the previous concepts (like [[#softmax][softmax]]) to average out the total information.

Self-attention is used to contextualize past tokens, and it does this by having tokens emit two vectors, a *query* and a *key*. This is used to get the affinity between different tokens, by doing the *product* between the queries and the keys. (NOTE: higher = more affinity)

Now, we'll have a single head perform self-attention:

#+begin_src python

  head_size = 16

  key = nn.Linear(C, head_size, bias=False)

  query = nn.Linear(C, head_size, bias=False)

  k = key(x)  # (B, T, 16)

  q = query(x)  # (B, T, 16)

  wei = q @ k.transpose(-2, -1)  # (B, T, 16) @ (B, 16, T) -> (B, T, T)

  tril = torch.tril(torch.ones(T, T))

  wei = torch.zeros((T, T))

  wei = wei.masked_fill(tril == 0, float('-inf'))

  wei = F.softmax(wei, dim=-1)

  out = wei @ x

  out.shape

#+end_src

*** Notes

- [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4298s][Communication]] ::

  Attention is a communication mechanism for information. It aggregates information from the weighted sum of the data.

- [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4366s][No Notion of Space]] ::

  Attention, atleast in this implementation, has no notion of space. It simply acts over the vectors. That's why its necessary to positionally encode the tokens.

- [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4420s][No Batch Communication]] ::

  The elements on the batch dimension never communicate with each other.

- [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4454s][Encoder vs. Decoder]] ::

  In some cases where the full context is important, nodes will be allowed to communicate with each other, using encoder blocks. If following this implementation, just delete the masked fill and the nodes will be allowed to communicate with each other. It is the *decoder* block (it basically hides future tokens so the model has to predict them itself).

- [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4539s][Attention vs. Self-Attention vs. Cross-Attention]] ::

  - Attention ::

    Variables can get their value from *different* sources.

    

    - Self-Attention ::

      Variables get their value from the *same* source.

    - Cross-Attention ::

      Variables get their value from a *separate* source of nodes.

* Building the Transformer

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4751s][Introducing Self-Attention to our Network]]

#+name: head-bigram

#+begin_src python :session bigram

  class Head(nn.Module):

      """ one head of self-attention """

      def __init__(self, head_size):

          super().__init__()

          self.key = nn.Linear(n_embd, head_size, bias=False)

          self.query = nn.Linear(n_embd, head_size, bias=False)

          self.value = nn.Linear(n_embd, head_size, bias=False)

          self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

      def forward(self, x, head_size):

          B, T, C = x.shape

          k = self.key(x)    # (B, T, C)

          q = self.query(x)  # (B, T, C)

          # compute attention scores ("affinities")

          wei = q @ k.transpose(-2, -1) * C**-0.5  # (B, T, C) @ (B, C, T) -> (B, T, T)

          wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # (B, T, T)

          wei = F.softmax(wei, dim=-1)  # (B, T, T)

          # perform the weighted aggregation of the values

          v = self.value(x)  # (B, T, C)

          out = wei @ v  # (B, T, T) @ (B, T, C) -> (B, T, C)

          return out

#+end_src

#+name: model_self_attention-bigram

#+begin_src python :session bigram

  class BigramLanguageModel(nn.Module):

      def __init__(self, vocab_size):

          super().__init__()

          # each token directly reads off the logits for the next token

          # from a lookup table

          self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

          self.position_embedding_table = nn.Embedding(block_size, n_embd)

          self.sa_head = Head(head_size=n_embd)

          self.lm_head = nn.Linear(n_embd, vocab_size)

      def forward(self, idx, targets=None):

          B, T = idx.shape

          # idx and targets are both (B, T) tensor of integers

          tok_emb = self.token_embedding_table(idx)  # (B, T, C)

          pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, C)

          x = tok_emb + pos_emb  # (B, T, C)

          x = self.sa_head(x)  # apply one head of self-attention (B, T, C)

          logits = self.lm_head(x)  # (B, T, vocab_size)

          if targets is None:

              loss = None

          else:

              B, T, C = logits.shape

              logits = logits.view(B * T, C)

              targets = targets.view(B * T)

              loss = F.cross_entropy(logits, targets)

          return logits, loss

      def generate(self, idx, max_new_tokens):

          # idx is (B, T) array of indices in the current context

          for _ in range(max_new_tokens):

              # crop idx to the last block_size tokens

              idx_cond = idx[:, -block_size:]

              # get the predictions

              logits, loss = self(idx_cond)

              # focus only on the last time step

              logits = logits[:, -1, :]  # becomes (B, C)

              # apply softmax to get probabilities

              probs = F.softmax(logits, dim=-1)  # (B, C)

              # sample from the distribution

              idx_next = torch.multinomial(probs, num_samples=1)  # (B , 1)

              # append sampled index to the running sequence

              idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)

          return idx

  model = BigramLanguageModel(vocab_size)

  m = model.to(device)

#+end_src

#+begin_src python :session bigram :noweb yes

  <>

  <>

  <>

  n_embd = 32

  

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

  <>

#+end_src

#+RESULTS:

#+begin_example

step 0: train loss 4.2000, val loss 4.2047

step 300: train loss 2.9418, val loss 2.9574

step 600: train loss 2.6309, val loss 2.6456

step 900: train loss 2.5385, val loss 2.5451

step 1200: train loss 2.5040, val loss 2.5033

step 1500: train loss 2.4695, val loss 2.4808

step 1800: train loss 2.4520, val loss 2.4641

step 2100: train loss 2.4385, val loss 2.4422

step 2400: train loss 2.4241, val loss 2.4419

step 2700: train loss 2.4122, val loss 2.4340

step 3000: train loss 2.4227, val loss 2.4241

step 3300: train loss 2.4159, val loss 2.4231

step 3600: train loss 2.4055, val loss 2.4177

step 3900: train loss 2.4023, val loss 2.4179

step 4200: train loss 2.3958, val loss 2.4302

step 4500: train loss 2.3888, val loss 2.4024

step 4800: train loss 2.3838, val loss 2.4043

ARI

HAThan, thivet'ly meirar ysuch, fo thest

K

CAV:

HEDIERESTAD Pfr:

BENENENLING CENNUCINA IO: yond houstoruloubleles, be sordweloe yot bed thed Ond Doue bou

His, to I lllat;

Kind yo Icad imoralceng, wpatsl my peer furs bust hasthe sor merals of ss whacer gaers, ver: biolllay seirost fat:

Thones som, wer ran

Toupls om?

Sll: hy athithos benciet bs fof.

Amathalcbenou liles caly, detave shpreearg

Whe wad wolf sd it totosthe, wank th in thit shin

S ht ande conit iul cert gpapam oun llkeavilint hmy r

#+end_example

Still no coherent output, but our loss is down.

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4919s][Multi-headed Self-Attention]]

Basically consists of applying multiple heads of self-attention, and concatenating their results.

#+begin_src python

  class MultiHeadAttention(nn.Module):

      """multiple heads of self-attention in parallel"""

      def __init__(self, num_heads, head_size):

          super().__init__()

          self.heads == nn.ModuleList([Head(head_size) for _ in range(num_heads)])

      def forward(self, x):

          return torch.cat([h(x) for h in self.heads], dim=-1)

#+end_src

This can be done by simply creating multiple heads with pyTorch, running them through a list and concatenating them.

Now, we go back to our model and add our new attention method:

#+begin_src python

  self.sa_heads = MultiHeadAttention(4, n_embd//4)  # 4 heads of 8-dimensional self-attention

#+end_src

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=5065s][Feed Forward Layers]]

In a similar fashion to what is presented on the paper (a multi-layer perceptron), we'll add computation to the network.

#+begin_src python

  class FeedForward(nn.Module):

      """a simple linear layer followed by a non-linearity"""

      def __init__(self, n_embd):

          super().__init__()

          self.net = nn.Sequential(

              nn.Linear(n_embd, n_embd),

              nn.ReLU(),

          )

      def forward(self, x):

          return self.net(x)

#+end_src

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=5208s][Residual Connections]]

Our next step is to connect the communication with the computation.

#+begin_src python

  class Block(nn.Module):

      """Transformer block: communication followed by computation"""

      def __init__(self, n_embd, n_head):

          # n_embd: embedding dimension, n_head: the number of heads we'd like

          super().__init__()

          head_size = n_embd // n_head

          self.sa = MultiHeadAttention(n_head, head_size)

          self.ffwd = FeedForward(n_embd)

      def forward(self, x):

          x = self.sa(x)

          x = self.ffwd(x)

          return x

#+end_src

With our new class, we can introduce this communication \times computation process to our model:

#+begin_src python

  self.position_embedding_table = nn.Embedding(block_size, n_embd)

  self.blocks = nn.Sequential(

      Block(n_embd, n_head=4),

      Block(n_embd, n_head=4),

      Block(n_embd, n_head=4),

  )

  self.lm_head = nn.Linear(n_embd, vocab_size)

  def forward(self, idx, targets=None):

      ...

      x = tok_emb + pos_emb  # (B, T, C)

      x = self.blocks(x)  # (B, T, C)

#+end_src

This is starting to get into the zone of deep neural networks, so we have to further optimize this based on the paper for better results. From the paper "Deep Residual Learning for Image Recognition"[fn:5], we introduce residual connections, that gives the data two pathways, one with computation and an other that skips it. This kind of delays the processing of the data, but what was skipped goes back into the processing path during the process of training, which helps with performance.

#+begin_src python

  class MultiHeadAttention(nn.Module):

      ...

      def __init__(self, num_heads, head_size):

          ...

          self.proj = nn.Linear(n_embd, n_embd)

      def forward(self, x):

          out = torch.cat([h(x) for h in self.heads], dim=-1)  # (B, T, C)

          out = self.proj(out)

          return out

#+end_src

#+name: FeedForward-bigram

#+begin_src python

  class FeedForward(nn.Module):

      """a simple linear layer followed by a non-linearity"""

      def __init__(self, n_embd):

          super().__init__()

          self.net = nn.Sequential(

              nn.Linear(n_embd, 4 * n_embd),

              nn.ReLU(),

              nn.Linear(4* n_embd, n_embd),

          )

      def forward(self, x):

          return self.net(x)

#+end_src

#+begin_src python

  class Block(nn.Module):

      ...

      def forward(self, x):

          x = x + self.sa(x)

          x = x + self.ffwd(x)

#+end_src

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=5571s][LayerNorm]]

An additional optimization strategy is layer normalization[fn:6].

#+begin_src python

  class BatchNorm1d:

      def __init__(self, dim, eps=1e-5, momentum=0.1):

          self.eps = eps

          # paramters (trained with backprop)

          self.gamma = torch.ones(dim)

          self.beta = torch.zeros(dim)

      def __call_(self, x):

          # calculate the forward pass

          xmean = x.mean(1, keepdim=True)  # batch mean

          xvar = x.var(1, keepdim=True)  # batch variance

          xhat = (x - xmean) / torch.sqrt(xvar + self.eps)  # normalize to unit variance

          self.out = self.gamma * xhat + self.beta

          return self.out

      def parameters(self):

          return [self.gamma, self.beta]

  torch.manual_seed(1337)

  module = BatchNorm1d(100)

  x = torch.randn(32, 100)  # batch_size 32 of 100-dimensional vectors

  x = module(x)

  x.shape

#+end_src

This deviates from the paper, as we apply it before transformation (~FeedForward~).

* Final Result

This is the final result, after following the video. For an updated implementation, see nanoGPT[fn:8].

#+begin_src python :session final :noweb yes :mkdirp yes :tangle ~/Projects/Code/study/nanoGPT-from-scratch/bigram.py

  <>

  <>

  # hyperparameters

  batch_size = 64  # how many independent sequences will we process in parallel?

  block_size = 256  # what is the maximum content length for predictions?

  max_itters = 5000

  eval_interval = 500

  learning_rate = 3e-4

  device = 'cuda' if torch.cuda.is_available() else 'cpu'

  eval_iters = 200

  n_embd = 384  # 384/6 = 64

  n_head = 6    # Every head has 64 dimensions

  n_layer = 6

  dropout = 0.2

  # ----------------

#+end_src

Dropout[fn:7] shuts down some of the nodes of the neural network and randomly shuts off some of the neurons, training the model without them. This simulates the effect of training multiple sub-networks. At test time, everything is back (enabled). This was added as the model progressively scaled up.

#+begin_src python :session final :noweb yes :mkdirp yes :tangle ~/Projects/Code/study/nanoGPT-from-scratch/bigram.py :results output file :file output.org

  <>

  <>

  <>

  <>

  <>

  <>

  class Head(nn.Module):

      """ one head of self-attention """

      def __init__(self, head_size):

          super().__init__()

          self.key = nn.Linear(n_embd, head_size, bias=False)

          self.query = nn.Linear(n_embd, head_size, bias=False)

          self.value = nn.Linear(n_embd, head_size, bias=False)

          self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

          self.dropout = nn.Dropout(dropout)

      def forward(self, x):

          B, T, C = x.shape

          k = self.key(x)    # (B, T, C)

          q = self.query(x)  # (B, T, C)

          # compute attention scores ("affinities")

          wei = q @ k.transpose(-2, -1) * C**-0.5  # (B, T, C) @ (B, C, T) -> (B, T, T)

          wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # (B, T, T)

          wei = F.softmax(wei, dim=-1)  # (B, T, T)

          wei = self.dropout(wei)

          # perform the weighted aggregation of the values

          v = self.value(x)  # (B, T, C)

          out = wei @ v  # (B, T, T) @ (B, T, C) -> (B, T, C)

          return out

  class MultiHeadAttention(nn.Module):

      """ multiple heads of self-attention in parallel """

      def __init__(self, num_heads, head_size):

          super().__init__()

          self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

          self.proj = nn.Linear(n_embd, n_embd)

          self.dropout = nn.Dropout(dropout)

      def forward(self, x):

          out = torch.cat([h(x) for h in self.heads], dim=-1)

          out = self.dropout(self.proj(out))

          return out

  class FeedForward(nn.Module):

      """ a simple linear layer followed by a non-linearity """

      def __init__(self, n_embd):

          super().__init__()

          self.net = nn.Sequential(

              nn.Linear(n_embd, 4 * n_embd),

              nn.ReLU(),

              nn.Linear(4 * n_embd, n_embd),

              nn.Dropout(dropout),

          )

      def forward(self, x):

          return self.net(x)

  class Block(nn.Module):

      """Transformer block: communication followed by computation"""

      def __init__(self, n_embd, n_head):

          # n_embd: embedding dimension, n_head: the number of heads we'd like

          super().__init__()

          head_size = n_embd // n_head

          self.sa = MultiHeadAttention(n_head, head_size)

          self.ffwd = FeedForward(n_embd)

          self.ln1 = nn.LayerNorm(n_embd)

          self.ln2 = nn.LayerNorm(n_embd)

      def forward(self, x):

          x = self.sa(self.ln1(x))

          x = self.ffwd(self.ln2(x))

          return x

  class BigramLanguageModel(nn.Module):

      def __init__(self, vocab_size):

          super().__init__()

          # each token directly reads off the logits for the next token

          # from a lookup table

          self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

          self.position_embedding_table = nn.Embedding(block_size, n_embd)

          self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])

          self.ln_f = nn.LayerNorm(n_embd)  # final layer normalization

          self.lm_head = nn.Linear(n_embd, vocab_size)

      def forward(self, idx, targets=None):

          B, T = idx.shape

          # idx and targets are both (B, T) tensor of integers

          tok_emb = self.token_embedding_table(idx)  # (B, T, C)

          pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, C)

          x = tok_emb + pos_emb  # (B, T, C)

          x = self.blocks(x)  # (B, T, C)

          x = self.ln_f(x)  # (B, T, C)

          logits = self.lm_head(x)  # (B, T, vocab_size)

          if targets is None:

              loss = None

          else:

              B, T, C = logits.shape

              logits = logits.view(B * T, C)

              targets = targets.view(B * T)

              loss = F.cross_entropy(logits, targets)

          return logits, loss

      def generate(self, idx, max_new_tokens):

          # idx is (B, T) array of indices in the current context

          for _ in range(max_new_tokens):

              # crop idx to the last block_size tokens

              idx_cond = idx[:, -block_size:]

              # get the predictions

              logits, loss = self(idx_cond)

              # focus only on the last time step

              logits = logits[:, -1, :]  # becomes (B, C)

              # apply softmax to get probabilities

              probs = F.softmax(logits, dim=-1)  # (B, C)

              # sample from the distribution

              idx_next = torch.multinomial(probs, num_samples=1)  # (B , 1)

              # append sampled index to the running sequence

              idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)

          return idx

  model = BigramLanguageModel(vocab_size)

  m = model.to(device)

  optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)

  for iter in range(max_itters):

      # every once in a while evaluate the loss on train and val sets

      if iter % eval_interval == 0 or iter == max_itters - 1:

          losses = estimate_loss()

          print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

      # sample a batch of data

      xb, yb = get_batch('train')

      # evaluate the loss

      logits, loss = model(xb, yb)

      optimizer.zero_grad(set_to_none=True)

      loss.backward()

      optimizer.step()

  context = torch.zeros((1, 1), dtype=torch.long, device=device)

  print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

#+end_src

* Notes

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=6159s][Encoder vs. Decoder vs. Transformer]]

What was actually implemented in this lecture is a decoder-only transformer. This is because we are only *generating* text unconditionally (continously generating according to a dataset). What makes it a decoder is how we are passing the triangular mask in the transformer, while encoders pass no mask, to allow better intercommunication (remember, decoding uses masking to hide further tokens). Transformers with both blocks start and end decoding with special tokens.

** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=6533s][OpenAI GPT]]

This model has about 1 million *tokens*, as that's the number of characters in our dataset. This is different to the *parameter* measurement, as these are ased on sub-words instead. If we referred to our model in OpenAI terminology, it would have about 300 thousand parameters.

* References

[fn:1] Vaswani, A. et al. (2023) Attention is all you need. Available at: https://doi.org/10.48550/arXiv.1706.03762 (Accessed: November 11, 2024).

[fn:2] Brown, T.B. et al. (2020) Language models are few-shot learners. Available at: https://doi.org/10.48550/arXiv.2005.14165 (Accessed: November 11, 2024).

[fn:3] “Google/sentencepiece” (2024). Google. Available at: https://github.com/google/sentencepiece (Accessed: November 12, 2024).

[fn:4] “Openai/tiktoken” (2024). OpenAI. Available at: https://github.com/openai/tiktoken (Accessed: November 12, 2024).

[fn:5] He, K. et al. (2015) Deep residual learning for image recognition. Available at: https://doi.org/10.48550/arXiv.1512.03385 (Accessed: November 11, 2024).

[fn:6] Ba, J.L., Kiros, J.R. and Hinton, G.E. (2016) Layer normalization. Available at: https://doi.org/10.48550/arXiv.1607.06450 (Accessed: November 11, 2024).

[fn:7] Srivastava, N. et al. (2014) “Dropout: A simple way to prevent neural networks from overﬁtting,” The journal of machine learning research, Volume 15(1), pp. 1929–1958. Available at: https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf (Accessed: November 16, 2024).

[fn:8] Karpathy, A. (2024) “Karpathy/nanogpt.” Available at: https://github.com/karpathy/nanoGPT (Accessed: October 28, 2024).

# Local Variables:

# eval: (add-hook 'after-save-hook #'org-babel-tangle t t)

# End:
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gs-101/nanogpt-from-scratch

Awesome Lists containing this project

README