https://github.com/gs-101/nanogpt-from-scratch
Model created by following the "Let's build GPT: from scratch, in code, spelled out." lecture by Andrej Karpathy, in Org Mode.
https://github.com/gs-101/nanogpt-from-scratch
llm nanogpt org-mode python
Last synced: 6 months ago
JSON representation
Model created by following the "Let's build GPT: from scratch, in code, spelled out." lecture by Andrej Karpathy, in Org Mode.
- Host: GitHub
- URL: https://github.com/gs-101/nanogpt-from-scratch
- Owner: gs-101
- License: mit
- Created: 2024-11-16T21:31:22.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-01-28T21:01:05.000Z (8 months ago)
- Last Synced: 2025-02-06T09:13:46.014Z (8 months ago)
- Topics: llm, nanogpt, org-mode, python
- Language: Python
- Homepage: https://codeberg.org/gs-101/nanoGPT-from-scratch
- Size: 439 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.org
- License: LICENSE
Awesome Lists containing this project
README
:PROPERTIES:
:ID: fdda529f-14ed-4fe5-b898-5a2161d5d6b5
:ROAM_REFS: @karpathyLetsBuildGPT2023
:END:
#+title: Let’s Build GPT: From Scratch, in Code, Spelled Out.
#+filetags: :artificial_intelligence:computer_science:machine_learning:neural_and_evolutionary_computing:
#+OPTIONS: f:t* Summary
Large Language Model lecture by Andrej Karpathy, going through the process of building a Generatively Pretrained Transformer (GPT), following the papers "Attention is All You Need"[fn:1] and "Language Models are Few-Shot Learners"[fn:2].
* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=0s][Intro]]
Andrej introduces GPT. It's an auto-complete system that takes input and outputs the most likely outcome. The Tranformer part is the one that does the heavy lifting, and it comes from the "Attention is All You Need" paper. This paper was extermely influential for the field of artificial intelligence, but the authors didn't fully acknowledge the impact they would make in the paper. This lecture has the goal of developing a transformer-based language model. The goal isn't to replicate ChatGPT here, since it has a massive chunk of data. The dataset used in this lecture is [[https://huggingface.co/datasets/karpathy/tiny_shakespeare][tiny_shakespeare]], which is a complete collection of the works of Shakespeare, in 1 megabyte. In this local model, words are generated by character basis, while modern LLMs use tokens, which are closer to word chunks. Another dataset that can be used is [[https://huggingface.co/datasets/Skylion007/openwebtext][OpenWebText]].
NOTE: This lecture uses Python, and GPT too, so this is probably valid for other LLMs. I hope that the reason for this is explained here.
ANSWER: After watching the video, the reason Python is used is because of [[https://pytorch.org/][pyTorch]], a machine learning library, implementing a lot of mathematical funcitons relating to tensors, which are multidimensional arrays. Arrays are crucial in machine learning as they are how characters are arranged.
* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=472s][Reading and Exploring Data]]
First things first, start with downloading a dataset:
#+begin_src bash
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
#+end_srcThen we let Python interact with the file:
#+name: dataset
#+begin_src python :session train
with open("input.txt", "r", encoding="utf-8") as f:
text = f.read()
#+end_srcThen, we get all the unique occuring characters in the text:
#+name: chars-train
#+begin_src python :session train
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print(vocab_size)
#+end_src* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=568s][Tokenization]]
This encodes a text, returning the value of each of its characters as integers in an array. The encoded text can also be decoded to return the exact same string.
We then create a mapping to translate the characters:
#+name: decoding-train
#+begin_src python :session train
# create a mapping from characters to integers
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [
stoi[c] for c in s
] # encoder: take a string, output a list of integers
decode = lambda l: "".join(
[itos[i] for i in l]
) # decoder: take a list of integers, output a string# let's now encode the entire text dataset and store it into a torch.Tensor
print(encode("hii there"))
print(decode(encode("hii there")))
#+end_srcThere are other ways to do this, such as using sentencepiece[fn:3] (which encodes "sub-words"), or using tiktoken[fn:4]. This is supposed to be a simple transformer model, so these tokenisers aren't used here.
Now that we have both an encoder and a decoder, we can translate the Shakespeare dataset.
#+name: import
#+begin_src python :session train
import torch
#+end_src#+name: data_translate-train
#+begin_src python :session train :results output
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])
#+end_src#+RESULTS: data_translate-train
#+begin_example
torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44,
53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63,
1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1,
57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49,
6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47,
58, 47, 64, 43, 52, 10, 0, 37, 53, 59, 1, 39, 56, 43, 1, 39, 50, 50,
1, 56, 43, 57, 53, 50, 60, 43, 42, 1, 56, 39, 58, 46, 43, 56, 1, 58,
53, 1, 42, 47, 43, 1, 58, 46, 39, 52, 1, 58, 53, 1, 44, 39, 51, 47,
57, 46, 12, 0, 0, 13, 50, 50, 10, 0, 30, 43, 57, 53, 50, 60, 43, 42,
8, 1, 56, 43, 57, 53, 50, 60, 43, 42, 8, 0, 0, 18, 47, 56, 57, 58,
1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 18, 47, 56, 57, 58, 6, 1, 63,
53, 59, 1, 49, 52, 53, 61, 1, 15, 39, 47, 59, 57, 1, 25, 39, 56, 41,
47, 59, 57, 1, 47, 57, 1, 41, 46, 47, 43, 44, 1, 43, 52, 43, 51, 63,
1, 58, 53, 1, 58, 46, 43, 1, 54, 43, 53, 54, 50, 43, 8, 0, 0, 13,
50, 50, 10, 0, 35, 43, 1, 49, 52, 53, 61, 5, 58, 6, 1, 61, 43, 1,
49, 52, 53, 61, 5, 58, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58,
47, 64, 43, 52, 10, 0, 24, 43, 58, 1, 59, 57, 1, 49, 47, 50, 50, 1,
46, 47, 51, 6, 1, 39, 52, 42, 1, 61, 43, 5, 50, 50, 1, 46, 39, 60,
43, 1, 41, 53, 56, 52, 1, 39, 58, 1, 53, 59, 56, 1, 53, 61, 52, 1,
54, 56, 47, 41, 43, 8, 0, 21, 57, 5, 58, 1, 39, 1, 60, 43, 56, 42,
47, 41, 58, 12, 0, 0, 13, 50, 50, 10, 0, 26, 53, 1, 51, 53, 56, 43,
1, 58, 39, 50, 49, 47, 52, 45, 1, 53, 52, 5, 58, 11, 1, 50, 43, 58,
1, 47, 58, 1, 40, 43, 1, 42, 53, 52, 43, 10, 1, 39, 61, 39, 63, 6,
1, 39, 61, 39, 63, 2, 0, 0, 31, 43, 41, 53, 52, 42, 1, 15, 47, 58,
47, 64, 43, 52, 10, 0, 27, 52, 43, 1, 61, 53, 56, 42, 6, 1, 45, 53,
53, 42, 1, 41, 47, 58, 47, 64, 43, 52, 57, 8, 0, 0, 18, 47, 56, 57,
58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 35, 43, 1, 39, 56, 43, 1,
39, 41, 41, 53, 59, 52, 58, 43, 42, 1, 54, 53, 53, 56, 1, 41, 47, 58,
47, 64, 43, 52, 57, 6, 1, 58, 46, 43, 1, 54, 39, 58, 56, 47, 41, 47,
39, 52, 57, 1, 45, 53, 53, 42, 8, 0, 35, 46, 39, 58, 1, 39, 59, 58,
46, 53, 56, 47, 58, 63, 1, 57, 59, 56, 44, 43, 47, 58, 57, 1, 53, 52,
1, 61, 53, 59, 50, 42, 1, 56, 43, 50, 47, 43, 60, 43, 1, 59, 57, 10,
1, 47, 44, 1, 58, 46, 43, 63, 0, 61, 53, 59, 50, 42, 1, 63, 47, 43,
50, 42, 1, 59, 57, 1, 40, 59, 58, 1, 58, 46, 43, 1, 57, 59, 54, 43,
56, 44, 50, 59, 47, 58, 63, 6, 1, 61, 46, 47, 50, 43, 1, 47, 58, 1,
61, 43, 56, 43, 0, 61, 46, 53, 50, 43, 57, 53, 51, 43, 6, 1, 61, 43,
1, 51, 47, 45, 46, 58, 1, 45, 59, 43, 57, 57, 1, 58, 46, 43, 63, 1,
56, 43, 50, 47, 43, 60, 43, 42, 1, 59, 57, 1, 46, 59, 51, 39, 52, 43,
50, 63, 11, 0, 40, 59, 58, 1, 58, 46, 43, 63, 1, 58, 46, 47, 52, 49,
1, 61, 43, 1, 39, 56, 43, 1, 58, 53, 53, 1, 42, 43, 39, 56, 10, 1,
58, 46, 43, 1, 50, 43, 39, 52, 52, 43, 57, 57, 1, 58, 46, 39, 58, 0,
39, 44, 44, 50, 47, 41, 58, 57, 1, 59, 57, 6, 1, 58, 46, 43, 1, 53,
40, 48, 43, 41, 58, 1, 53, 44, 1, 53, 59, 56, 1, 51, 47, 57, 43, 56,
63, 6, 1, 47, 57, 1, 39, 57, 1, 39, 52, 0, 47, 52, 60, 43, 52, 58,
53, 56, 63, 1, 58, 53, 1, 54, 39, 56, 58, 47, 41, 59, 50, 39, 56, 47,
57, 43, 1, 58, 46, 43, 47, 56, 1, 39, 40, 59, 52, 42, 39, 52, 41, 43,
11, 1, 53, 59, 56, 0, 57, 59, 44, 44, 43, 56, 39, 52, 41, 43, 1, 47,
57, 1, 39, 1, 45, 39, 47, 52, 1, 58, 53, 1, 58, 46, 43, 51, 1, 24,
43, 58, 1, 59, 57, 1, 56, 43, 60, 43, 52, 45, 43, 1, 58, 46, 47, 57,
1, 61, 47, 58, 46, 0, 53, 59, 56, 1, 54, 47, 49, 43, 57, 6, 1, 43,
56, 43, 1, 61, 43, 1, 40, 43, 41, 53, 51, 43, 1, 56, 39, 49, 43, 57,
10, 1, 44, 53, 56, 1, 58, 46, 43, 1, 45, 53, 42, 57, 1, 49, 52, 53,
61, 1, 21, 0, 57, 54, 43, 39, 49, 1, 58, 46, 47, 57, 1, 47, 52, 1,
46, 59, 52, 45, 43, 56, 1, 44, 53, 56, 1, 40, 56, 43, 39, 42, 6, 1,
52, 53, 58, 1, 47, 52, 1, 58, 46, 47, 56, 57, 58, 1, 44, 53, 56, 1,
56, 43, 60, 43, 52, 45, 43, 8, 0, 0])
#+end_exampleHere, we used [[https://pytorch.org/][PyTorch]], a Python-based machine learning framework, to store the encoded data. We then print the saved data to visualize it, but only the first 1000 characters.
Andrej likes to split his dataset in a 90% - 10% basis. That is, 90% will be used for training data, while the other 10% will be used for validation.
#+name: data_split-train
#+begin_src python :session train
# Let's now split up the data into train and validation sets
n = int(0.9 * len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
#+end_srcIf the LLM was fed 100% of the data, it would just recreate the dataset, not generate a new thing.
* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=867s][Data Loader]]
Now, we'll plug the text sequences into the transformer. This is done by sending chunks of the dataset. The transformer takes the data and trains itself on each of the positions. The ~+1~ is there to add the prediction function. Separating the context like this gets the LLM used to different sizes of text.
#+name: optimizer-train
#+begin_src python :session train
block_size = 8
train_data[: block_size + 1]
x = train_data[:block_size]
y = train_data[1: block_size + 1]
for t in range(block_size):
context = x[: t + 1]
target = y[t]
print(f"when input is {context} the target: {target}")
#+end_srcEvery time we feed the text to the transformer, it's going to be sent in multiple batches of text, to keep the GPU busy, since they are really efficient in parallel processing. The seed is set here to a specific value for demonstration.
#+name: reproducibility-train
#+begin_src python :session train
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum content length for predictions?
#+end_src#+name: get_batch-train
#+begin_src python :session train
def get_batch(split):
# generate a small batch of data of inputs x and targets y
data = train_data if split == "train" else val_data
#+end_src#+name: get_batch_randint-train
#+begin_src python :session train
ix = torch.randint(len(data) - block_size, (batch_size,))
#+end_srcThis line generates a randomly generated number sequence based on the value of ~batch_size~. So a ~batch_size~ of 4 would generate four random integers.
#+name: get_batch_offset-train
#+begin_src python :session train
x = torch.stack([data[i: i + block_size] for i in ix])
#+end_srcThe ~i~ is the offset by 1 of the ~block_size~. It takes all the one-dimensional tensors and stacks them, like a table.
#+name: get_batch_targets-train
#+begin_src python :session train
y = torch.stack([data[i + 1: i + block_size + 1] for i in ix])
return x, y
#+end_srcThis marks the targets (the character following the sequence defined in ~x~).
#+name: get_batch_results-train
#+begin_src python :session train :results output
xb, yb = get_batch("train")
print("inputs")
print(xb.shape)
print(xb)
print("targets:")
print(yb.shape)
print(yb)print("----")
for b in range(batch_size): # batch dimension
for t in range(block_size): # time dimension
context = xb[b, : t + 1]
target = yb[b, t]
print(f"when input is {context.tolist()} the target: {target}")
#+end_src* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=1331s][Simplest Baseline]]
Now we're going to use the simplest language model available, a bigram model.
#+name: import_nn
#+begin_src python :session train
import torch.nn as nn
from torch.nn import functional as F
#+end_src#+name: model-train
#+begin_src python :session train
class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):
super().__init__()
# each token directly reads off the logits for the next token
# from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
#+end_srcHere, we are creating a table for the LLM, that is based on the vocabulary size, where each integer takes up their specific space on the table (so 25 goes to position 25, 17 to position 17, so and so forth).
#+name: forward_v1-train
#+begin_src python :session train
def forward(self, idx, targets):# idx and targets are both (B, T) tensor of integers
logits = self.token_embedding_table(idx) # (B, T, C)return logits
#+end_srcThe targets are determined in this bit. The index ~idx~ is passed to the embedding table, and arranged in a batch ~B~, time ~T~ and channel ~C~ rows. The logits are the predictions.
#+name: model_vocab_size-train
#+begin_src python :session train
m = BigramLanguageModel(vocab_size)
out = m(xb, yb)
#+end_srcA good way measure prediction quality is to calculate its likelihood. This operation is implemented in pyTorch under the name ~cross_entropy~.
#+name: loss-train
#+begin_src python :session train
loss = F.cross_entropy(logits, targets)
#+end_srcThis runs in the predictions (logits) and the targets (which represent the next character). This code won't run because pyTorch expects a different logit format (+B, T, C+ \rightarrow B, C, T).
#+name: logits_shape-train
#+begin_src python :session train
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
#+end_srcThis is how to correct the shape of the logits. We take B and T and couple them together to form a two-dimensional array, and then fulfil pyTorch's requirements. The same also has to be done to our targets (but, in their case, it's 2D \rightarrow 1D).
And then we print out our results:
#+name: loss_results-train
#+begin_src python :session train :results output
print(logits.shape)
print(loss)
#+end_src#+RESULTS: loss_results-train
: torch.Size([256, 65])
: tensor(2.4860, grad_fn=)Which, in our case, are the result of the logarithm of 1 (the current character) by 65 (the amount of possible results). \(-\log{\frac{1}{65}}\)
The loss on the video was lower, ~4.8786~. I got a different result here.
Now, our next step is the generation:
#+name: generate-train
#+begin_src python :session train
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
#+end_src#+name: generate_loss-train
#+begin_src python :session train
# get the predictions
logits, loss = self(idx)
#+end_srcThe loss might appear in this part, but it's not actually used.
#+name: generate_idx-train
#+begin_src python :session train
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B , 1)
#+end_src~idx~ is the current context of the batch. It has the job to take the logits and expand them 1 by 1.
#+name: prediction_results-train
#+begin_src python :session train
# append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
#+end_srcIn this part, the results of the prediction ~idx_next~ is concatenated to the previous context, ~idx~, keeping all the text together (NOTE: so ~idx~ FO + ~idx_next~ O = ~idx~ FOO). This is a generalist approach, not really useful, since we are feeding the model the whole character history, but only feeding it the previous character in the sequence. This function will be expanded on later.
This layout would give us an error, so we have to return to the ~forward~ function and set its targets to ~None~, since they aren't defined here.
#+name: forward_v2-train
#+begin_src python :session train
def forward(self, idx, targets=None):# idx and targets are both (B, T) tensor of integers
logits = self.token_embedding_table(idx) # (B, T, C)if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)return logits, loss
#+end_srcThis makes it so that, if we actually have targets, the loss is also returned, but this won't happen here as targets is set to ~None~.
#+name: generate_print-train
#+begin_src python :session train :results output
print(
decode(
m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[
0
].tolist()
)
)
#+end_src#+RESULTS: generate_print-train
:
: I ler totel me otche, PERCofonjothir pe et s men:
: ORKIIV: m, y rit I m ay s tathearer ate win pit poThis is the code used for generation. It converts a pyTorch list to a conventional Python list. This first result is completly nonsensical, because the model has no previous context.
* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2093s][Training the Bigram Model]]
Now, we're going to actually train our model, to give the output some sense.
We start creating a PyTorch optimizer:
#+name: training-train
#+begin_src python :session train :results output
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
batch_size = 32
for steps in range(10000):
# sample a batch of data
xb, yb = get_batch('train')# evaluate the loss
logits, loss = m(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()print(loss.item())
#+end_src#+RESULTS: training-train
: 2.4859519004821777There are other optimizer options, but [[https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html][AdamW]] was the most popular at the time of the video (<2023-01-17 ter>). The ~range~ determines the number of iterations. By doing this, we reduced our loss by 3! Now our +decode+ generate function should work better.
* Final Result
#+begin_src python :noweb yes :mkdirp yes :tangle ~/Projects/Code/study/nanoGPT-from-scratch/train.py
<>
<><>
<>
<>
<>
<>
<>
<>
<>
<>
<>
<><>
<>
<>
<>
<>
<>
<><>
<>
<>
<>
#+end_src
* [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2280s][Port our Code to a Script]]
#+name: hyperparameters-bigram
#+begin_src python
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum content length for predictions?
max_itters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
#+end_srcThe device here allows for the ability to run the cude using a GPU, through [[https://developer.nvidia.com/blog/even-easier-introduction-cuda/][CUDA]].
#+name: reproducibility
#+begin_src python
torch.manual_seed(1337)
#+end_src#+name: decoding
#+begin_src python
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [
stoi[c] for c in s
] # encoder: take a string, output a list of integers
decode = lambda l: "".join(
[itos[i] for i in l]
) # decoder: take a list of integers, output a string
#+end_src#+name: data_split
#+begin_src python
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
#+end_src#+name: get_batch
#+begin_src python
def get_batch(split):
# generate a small batch of data of inputs x and targets y
data = train_data if split == "train" else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i: i + block_size] for i in ix])
y = torch.stack([data[i + 1: i + block_size + 1] for i in ix])
return x, y
#+end_src#+name: estimate_loss
#+begin_src python
@torch.no_grad()
def estimate_loss():
out = {}
model.eval()
for split in ['train', 'val']:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y = get_batch(split)
logits, loss = model(X, Y)
losses[k] = loss.item()
out[split] = losses.mean()
model.train()
return out
#+end_srcThis chunk gets the average of multiple losses, optimizing the process of its calculation. The model is set to evaluation phase and then it returns to the training phase in this code, but this currently doesn't do anything, because the phases haven't actually been defined in the model. It is good practice to think through what mode the model is, because some layers could introduce different, mode-specific behaviour which could give unintended results. ~@torch.no_grad~ is a context manager, it tells pyTorch to *not* call ~.backward~ on what happens inside this function.
#+name: model-bigram
#+begin_src python
# super simple bigram model
class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):
super().__init__()
# each token directly reads off the logits for the next token
# from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)def forward(self, idx, targets=None):
# idx and targets are both (B, T) tensor of integers
logits = self.token_embedding_table(idx) # (B, T, C)if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)return logits, loss
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# get the predictions
logits, loss = self(idx)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B , 1)
# append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idxmodel = BigramLanguageModel(vocab_size)
m = model.to(device)
#+end_src#+name: optimizer-bigram
#+begin_src python
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
#+end_src#+name: training-bigram
#+begin_src python
for iter in range(max_itters):# every once in a while evaluate the loss on train and val sets
if iter % eval_interval == 0:
losses = estimate_loss()
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")# sample a batch of data
xb, yb = get_batch('train')# evaluate the loss
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
#+end_src#+name: context-bigram
#+begin_src python
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))
#+end_src~device~ here allows us to pass the context to the device (GPU) if it is available.
** Final Result
#+begin_src python :session bigram :noweb yes :results output
<>
<><>
<>
<>
<>
<>
<>
<>
<>
<>
<>
<>
#+end_src* Building Self-Attention
Now, we'll get to the main point of the video and the paper[fn:1]: self-attention.
** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2533s][Averaging Past Context with For Loops]]
Before just jumping into the topic of self-attention, Andrej shows us a simple but essential part of an efficent self-attention implementation, which is token-to-token communication, /excluding/ any connection with following tokens (NOTE: so in ~FOOBAR~, ~B~ would have ~FOO~ as context, not ~BAR~). Information should flow *sequentially*.
The simplest way to do this introduced here is to calculate the average of the previous tokens. This can generate a lot of loss, but this will be expanded on later.
#+begin_src python :session bigram :results output
torch.manual_seed(1337)
B, T, C = 4, 8, 2 # batch, time, channels
x = torch.randn(B, T, C)
x.shape
torch.Size([4, 8, 2])# We want x[b, t] = mean_{i<=t} x[b, i]
xbow = torch.zeros((B, T, C))
for b in range(B):
for t in range(T):
xprev = x[b, :t+1] # (t, C)
xbow[b, t] = torch.mean(xprev, 0)
#+end_src~bow~ here means /Bag of Words/ (just bundling them together). This is a simple for loop, so it's not that efficient. It iterates over the batch, them the time (tokens), and collects the current token and the previous ones. This is then passed to ~torch.mean~, which does the averaging.
This process can be optimized using matrix multiplication:
#+begin_src python :session bigram :results output
torch.manual_seed(42)
a = torch.ones(3, 3)
b = torch.randint(0, 10, (3, 2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)
#+end_src#+RESULTS:
#+begin_example
a=
tensor([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
--
b=
tensor([[2., 7.],
[6., 4.],
[6., 5.]])
--
c=
tensor([[14., 16.],
[14., 16.],
[14., 16.]])
#+end_example~c~ is the result of matrix ~a~ multiplied by ~c~. Since ~a~ is just a matrix of 1s, the results of ~c~ will be just the sum of the columns of ~b~ (2 + 6 + 6 = 14).
pyTorch has a function called ~tril~ (~torch.tril~), which takes a matrix and arranges its values in a triangular fashion (starting from left to right). This can be used to average out the values of the rows.
#+begin_src python :session bigram :results output
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0, 10, (3, 2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)
#+end_src#+RESULTS:
#+begin_example
a=
tensor([[1.0000, 0.0000, 0.0000],
[0.5000, 0.5000, 0.0000],
[0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
[6., 4.],
[6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
[4.0000, 5.5000],
[4.6667, 5.3333]])
#+end_exampleNow, the second row of ~c~ is the average of the first and second rows of ~b~, and its third row is the average of all the rows of ~b~.
** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3114s][Matrix Multiplication]]
After learning this, we can return to our for loop and optimize it:
#+begin_src python :session bigram :results output
# We want x[b, t] = mean_{i<=t} x[b, i]
xbow = torch.zeros((B, T, C))
for b in range(B):
for t in range(T):
xprev = x[b, :t+1] # (t, C)
xbow[b, t] = torch.mean(xprev, 0)
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
print(wei)
#+end_src#+RESULTS:
: tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
: [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
: [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
: [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
: [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
: [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
: [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
: [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])The values under the first row are the average because they sum (row number + row number) to 1. This allows us to create identical tensors, replacing a nested loop. (NOTE: You may notice that I removed ~torch.allclose~ here. That's because it was printing ~False~, but I verified they were the same by using ~xbow[0]~ and ~xbow2[0]~, just like in the video).
#+begin_src python :session bigram :results output
# We want x[b, t] = mean_{i<=t} x[b, i]
xbow = torch.zeros((B, T, C))
for b in range(B):
for t in range(T):
xprev = x[b, :t+1] # (t, C)
xbow[b, t] = torch.mean(xprev, 0)
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (T, T) @ (B, T, C) -> (B, T, T) @ (B, T, C) = (B, T, C)
#+end_src** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3282s][softmax]]
:PROPERTIES:
:CUSTOM_ID: softmax
:END:Now we'll se a different operation for generating tensors, using ~softmax~ instead.
#+begin_src python :session bigram :results output
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
#+end_srcMasked fill is used to replace all the zeros in the tensors with negative infinity. This is used to normalize our new tensor (~xbow3~), getting an identical matrix.
** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3506s][Code Cleanup]]
Now we can go back and clean up some of the code. First, there's no need to pass the ~vocab_size~ to the constructor, because it is set as a global variable.
#+name: model_constructor_global_vocab_size-bigram
#+begin_src python :session bigram
model = BigramLanguageModel()
m = model.to(device)
#+end_srcWe'll also introduce a new variable, ~n_embd~, to determine the number of current embedding dimensions.
#+begin_src python
...
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
#+end_src#+name: init_n_embd-bigram
#+begin_src python :session bigram
# super simple bigram model
class BigramLanguageModel(nn.Module):def __init__(self):
super().__init__()
# each token directly reads off the logits for the next token
# from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
#+end_srcThe forward function doesn't return logits directly, but rather token embeddings, so we'll call it ~tok_emb~ instead.
#+name: forward_tok_emb-bigram
#+begin_src python :session bigram
# idx and targets are both (B, T) tensor of integers
tok_emb = self.token_embedding_table(idx) # (B, T ,C)
#+end_srcAnd then to actually get the logits, we'll need a linear layer:
#+name: init_linear_layer-bigram
#+begin_src python
self.lm_head = nn.Linear(n_embd, vocab_size)
#+end_srcBack to the forward function:
#+name: init_logits-bigram
#+begin_src python
logits = self.lm_head(tok_emb) # (B, T, vocab_size)
#+end_src#+name: model_new_body-bigram
#+begin_src python :session bigram
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)return logits, loss
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# get the predictions
logits, loss = self(idx)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B , 1)
# append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
#+end_srcWe'll also add a position to the embedding table:
#+name: init_position_embedding_table-bigram
#+begin_src python :session bigram
self.position_embedding_table = nn.Embedding(block_size, n_embd)
#+end_srcThen we decode ~B, T~ to have ~x~ hold the token embeddings *and* their positions:
#+name: forward_decode-bigram
#+begin_src python :session bigram
def forward(self, idx, targets=None):
B, T = idx.shapetok_emb = self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
pos_emb = self.position_embedding_table(torch.arrange(T, device=device)) # (T, C)
x = tok_emb + pos_emb # (B, T, C)
logits = self.lm_head(x) # (B, T, vocab_size)
#+end_src#+begin_src python :session bigram :noweb yes
<>
<><>
n_embd = 32<>
<>
<>
<>
<>
<>
<>
<>
<><>
<>
<><>
<>
<>
#+end_src** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3720s][Self-Attention]]
We'll now start our implementation of self-attention.
#+begin_src python :session bigram
torch.manual_seed(1337)
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ xout.shape
#+end_src#+RESULTS:
: torch.Size([4, 8, 32])We amplified the number of channels and combined the previous concepts (like [[#softmax][softmax]]) to average out the total information.
Self-attention is used to contextualize past tokens, and it does this by having tokens emit two vectors, a *query* and a *key*. This is used to get the affinity between different tokens, by doing the *product* between the queries and the keys. (NOTE: higher = more affinity)
Now, we'll have a single head perform self-attention:
#+begin_src python
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, 16)
q = query(x) # (B, T, 16)
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) -> (B, T, T)tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ xout.shape
#+end_src*** Notes
- [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4298s][Communication]] ::
Attention is a communication mechanism for information. It aggregates information from the weighted sum of the data.
- [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4366s][No Notion of Space]] ::
Attention, atleast in this implementation, has no notion of space. It simply acts over the vectors. That's why its necessary to positionally encode the tokens.
- [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4420s][No Batch Communication]] ::
The elements on the batch dimension never communicate with each other.
- [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4454s][Encoder vs. Decoder]] ::
In some cases where the full context is important, nodes will be allowed to communicate with each other, using encoder blocks. If following this implementation, just delete the masked fill and the nodes will be allowed to communicate with each other. It is the *decoder* block (it basically hides future tokens so the model has to predict them itself).
- [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4539s][Attention vs. Self-Attention vs. Cross-Attention]] ::
- Attention ::
Variables can get their value from *different* sources.
- Self-Attention ::Variables get their value from the *same* source.
- Cross-Attention ::
Variables get their value from a *separate* source of nodes.
* Building the Transformer
** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4751s][Introducing Self-Attention to our Network]]
#+name: head-bigram
#+begin_src python :session bigram
class Head(nn.Module):
""" one head of self-attention """def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))def forward(self, x, head_size):
B, T, C = x.shape
k = self.key(x) # (B, T, C)
q = self.query(x) # (B, T, C)
# compute attention scores ("affinities")
wei = q @ k.transpose(-2, -1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
# perform the weighted aggregation of the values
v = self.value(x) # (B, T, C)
out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
return out
#+end_src#+name: model_self_attention-bigram
#+begin_src python :session bigram
class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):
super().__init__()
# each token directly reads off the logits for the next token
# from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.sa_head = Head(head_size=n_embd)
self.lm_head = nn.Linear(n_embd, vocab_size)def forward(self, idx, targets=None):
B, T = idx.shape# idx and targets are both (B, T) tensor of integers
tok_emb = self.token_embedding_table(idx) # (B, T, C)
pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, C)
x = tok_emb + pos_emb # (B, T, C)
x = self.sa_head(x) # apply one head of self-attention (B, T, C)
logits = self.lm_head(x) # (B, T, vocab_size)if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)return logits, loss
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# crop idx to the last block_size tokens
idx_cond = idx[:, -block_size:]
# get the predictions
logits, loss = self(idx_cond)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B , 1)
# append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idxmodel = BigramLanguageModel(vocab_size)
m = model.to(device)
#+end_src#+begin_src python :session bigram :noweb yes
<>
<><>
n_embd = 32
<>
<>
<>
<>
<>
<>
<>
<>
<>
<>
<>
#+end_src#+RESULTS:
#+begin_example
step 0: train loss 4.2000, val loss 4.2047
step 300: train loss 2.9418, val loss 2.9574
step 600: train loss 2.6309, val loss 2.6456
step 900: train loss 2.5385, val loss 2.5451
step 1200: train loss 2.5040, val loss 2.5033
step 1500: train loss 2.4695, val loss 2.4808
step 1800: train loss 2.4520, val loss 2.4641
step 2100: train loss 2.4385, val loss 2.4422
step 2400: train loss 2.4241, val loss 2.4419
step 2700: train loss 2.4122, val loss 2.4340
step 3000: train loss 2.4227, val loss 2.4241
step 3300: train loss 2.4159, val loss 2.4231
step 3600: train loss 2.4055, val loss 2.4177
step 3900: train loss 2.4023, val loss 2.4179
step 4200: train loss 2.3958, val loss 2.4302
step 4500: train loss 2.3888, val loss 2.4024
step 4800: train loss 2.3838, val loss 2.4043ARI
HAThan, thivet'ly meirar ysuch, fo thest
K
CAV:
HEDIERESTAD Pfr:
BENENENLING CENNUCINA IO: yond houstoruloubleles, be sordweloe yot bed thed Ond Doue bou
His, to I lllat;
Kind yo Icad imoralceng, wpatsl my peer furs bust hasthe sor merals of ss whacer gaers, ver: biolllay seirost fat:
Thones som, wer ran
Toupls om?
Sll: hy athithos benciet bs fof.
Amathalcbenou liles caly, detave shpreearg
Whe wad wolf sd it totosthe, wank th in thit shin
S ht ande conit iul cert gpapam oun llkeavilint hmy r
#+end_exampleStill no coherent output, but our loss is down.
** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4919s][Multi-headed Self-Attention]]
Basically consists of applying multiple heads of self-attention, and concatenating their results.
#+begin_src python
class MultiHeadAttention(nn.Module):
"""multiple heads of self-attention in parallel"""def __init__(self, num_heads, head_size):
super().__init__()
self.heads == nn.ModuleList([Head(head_size) for _ in range(num_heads)])def forward(self, x):
return torch.cat([h(x) for h in self.heads], dim=-1)
#+end_srcThis can be done by simply creating multiple heads with pyTorch, running them through a list and concatenating them.
Now, we go back to our model and add our new attention method:
#+begin_src python
self.sa_heads = MultiHeadAttention(4, n_embd//4) # 4 heads of 8-dimensional self-attention
#+end_src** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=5065s][Feed Forward Layers]]
In a similar fashion to what is presented on the paper (a multi-layer perceptron), we'll add computation to the network.
#+begin_src python
class FeedForward(nn.Module):
"""a simple linear layer followed by a non-linearity"""def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, n_embd),
nn.ReLU(),
)def forward(self, x):
return self.net(x)
#+end_src** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=5208s][Residual Connections]]
Our next step is to connect the communication with the computation.
#+begin_src python
class Block(nn.Module):
"""Transformer block: communication followed by computation"""def __init__(self, n_embd, n_head):
# n_embd: embedding dimension, n_head: the number of heads we'd like
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForward(n_embd)def forward(self, x):
x = self.sa(x)
x = self.ffwd(x)
return x
#+end_srcWith our new class, we can introduce this communication \times computation process to our model:
#+begin_src python
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(
Block(n_embd, n_head=4),
Block(n_embd, n_head=4),
Block(n_embd, n_head=4),
)
self.lm_head = nn.Linear(n_embd, vocab_size)def forward(self, idx, targets=None):
...
x = tok_emb + pos_emb # (B, T, C)
x = self.blocks(x) # (B, T, C)
#+end_srcThis is starting to get into the zone of deep neural networks, so we have to further optimize this based on the paper for better results. From the paper "Deep Residual Learning for Image Recognition"[fn:5], we introduce residual connections, that gives the data two pathways, one with computation and an other that skips it. This kind of delays the processing of the data, but what was skipped goes back into the processing path during the process of training, which helps with performance.
#+begin_src python
class MultiHeadAttention(nn.Module):
...def __init__(self, num_heads, head_size):
...
self.proj = nn.Linear(n_embd, n_embd)def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, C)
out = self.proj(out)
return out
#+end_src#+name: FeedForward-bigram
#+begin_src python
class FeedForward(nn.Module):
"""a simple linear layer followed by a non-linearity"""def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4* n_embd, n_embd),
)def forward(self, x):
return self.net(x)
#+end_src#+begin_src python
class Block(nn.Module):
...def forward(self, x):
x = x + self.sa(x)
x = x + self.ffwd(x)
#+end_src** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=5571s][LayerNorm]]
An additional optimization strategy is layer normalization[fn:6].
#+begin_src python
class BatchNorm1d:def __init__(self, dim, eps=1e-5, momentum=0.1):
self.eps = eps
# paramters (trained with backprop)
self.gamma = torch.ones(dim)
self.beta = torch.zeros(dim)def __call_(self, x):
# calculate the forward pass
xmean = x.mean(1, keepdim=True) # batch mean
xvar = x.var(1, keepdim=True) # batch variance
xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
self.out = self.gamma * xhat + self.beta
return self.outdef parameters(self):
return [self.gamma, self.beta]torch.manual_seed(1337)
module = BatchNorm1d(100)
x = torch.randn(32, 100) # batch_size 32 of 100-dimensional vectors
x = module(x)
x.shape
#+end_srcThis deviates from the paper, as we apply it before transformation (~FeedForward~).
* Final Result
This is the final result, after following the video. For an updated implementation, see nanoGPT[fn:8].
#+begin_src python :session final :noweb yes :mkdirp yes :tangle ~/Projects/Code/study/nanoGPT-from-scratch/bigram.py
<>
<># hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum content length for predictions?
max_itters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384 # 384/6 = 64
n_head = 6 # Every head has 64 dimensions
n_layer = 6
dropout = 0.2
# ----------------
#+end_srcDropout[fn:7] shuts down some of the nodes of the neural network and randomly shuts off some of the neurons, training the model without them. This simulates the effect of training multiple sub-networks. At test time, everything is back (enabled). This was added as the model progressively scaled up.
#+begin_src python :session final :noweb yes :mkdirp yes :tangle ~/Projects/Code/study/nanoGPT-from-scratch/bigram.py :results output file :file output.org
<>
<>
<>
<>
<>
<>
class Head(nn.Module):
""" one head of self-attention """def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))self.dropout = nn.Dropout(dropout)
def forward(self, x):
B, T, C = x.shape
k = self.key(x) # (B, T, C)
q = self.query(x) # (B, T, C)
# compute attention scores ("affinities")
wei = q @ k.transpose(-2, -1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
# perform the weighted aggregation of the values
v = self.value(x) # (B, T, C)
out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
return outclass MultiHeadAttention(nn.Module):
""" multiple heads of self-attention in parallel """def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(n_embd, n_embd)
self.dropout = nn.Dropout(dropout)def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.dropout(self.proj(out))
return outclass FeedForward(nn.Module):
""" a simple linear layer followed by a non-linearity """def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)def forward(self, x):
return self.net(x)class Block(nn.Module):
"""Transformer block: communication followed by computation"""def __init__(self, n_embd, n_head):
# n_embd: embedding dimension, n_head: the number of heads we'd like
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)def forward(self, x):
x = self.sa(self.ln1(x))
x = self.ffwd(self.ln2(x))
return xclass BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
# each token directly reads off the logits for the next token
# from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd) # final layer normalization
self.lm_head = nn.Linear(n_embd, vocab_size)def forward(self, idx, targets=None):
B, T = idx.shape# idx and targets are both (B, T) tensor of integers
tok_emb = self.token_embedding_table(idx) # (B, T, C)
pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, C)
x = tok_emb + pos_emb # (B, T, C)
x = self.blocks(x) # (B, T, C)
x = self.ln_f(x) # (B, T, C)
logits = self.lm_head(x) # (B, T, vocab_size)if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)return logits, loss
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# crop idx to the last block_size tokens
idx_cond = idx[:, -block_size:]
# get the predictions
logits, loss = self(idx_cond)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B , 1)
# append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idxmodel = BigramLanguageModel(vocab_size)
m = model.to(device)optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)
for iter in range(max_itters):
# every once in a while evaluate the loss on train and val sets
if iter % eval_interval == 0 or iter == max_itters - 1:
losses = estimate_loss()
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")# sample a batch of data
xb, yb = get_batch('train')# evaluate the loss
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))
#+end_src* Notes
** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=6159s][Encoder vs. Decoder vs. Transformer]]
What was actually implemented in this lecture is a decoder-only transformer. This is because we are only *generating* text unconditionally (continously generating according to a dataset). What makes it a decoder is how we are passing the triangular mask in the transformer, while encoders pass no mask, to allow better intercommunication (remember, decoding uses masking to hide further tokens). Transformers with both blocks start and end decoding with special tokens.
** [[https://www.youtube.com/watch?v=kCc8FmEb1nY&t=6533s][OpenAI GPT]]
This model has about 1 million *tokens*, as that's the number of characters in our dataset. This is different to the *parameter* measurement, as these are ased on sub-words instead. If we referred to our model in OpenAI terminology, it would have about 300 thousand parameters.
* References
[fn:1] Vaswani, A. et al. (2023) Attention is all you need. Available at: https://doi.org/10.48550/arXiv.1706.03762 (Accessed: November 11, 2024).
[fn:2] Brown, T.B. et al. (2020) Language models are few-shot learners. Available at: https://doi.org/10.48550/arXiv.2005.14165 (Accessed: November 11, 2024).
[fn:3] “Google/sentencepiece” (2024). Google. Available at: https://github.com/google/sentencepiece (Accessed: November 12, 2024).
[fn:4] “Openai/tiktoken” (2024). OpenAI. Available at: https://github.com/openai/tiktoken (Accessed: November 12, 2024).
[fn:5] He, K. et al. (2015) Deep residual learning for image recognition. Available at: https://doi.org/10.48550/arXiv.1512.03385 (Accessed: November 11, 2024).
[fn:6] Ba, J.L., Kiros, J.R. and Hinton, G.E. (2016) Layer normalization. Available at: https://doi.org/10.48550/arXiv.1607.06450 (Accessed: November 11, 2024).
[fn:7] Srivastava, N. et al. (2014) “Dropout: A simple way to prevent neural networks from overfitting,” The journal of machine learning research, Volume 15(1), pp. 1929–1958. Available at: https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf (Accessed: November 16, 2024).
[fn:8] Karpathy, A. (2024) “Karpathy/nanogpt.” Available at: https://github.com/karpathy/nanoGPT (Accessed: October 28, 2024).
# Local Variables:
# eval: (add-hook 'after-save-hook #'org-babel-tangle t t)
# End: