Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/johnnixon6972/llm

This project implements two language models from scratch: a Bigram Language Model and a GPT-style Transformer model.
https://github.com/johnnixon6972/llm

bigram-model gpt-2 language-model python3 torch transformers

Last synced: about 2 months ago
JSON representation

This project implements two language models from scratch: a Bigram Language Model and a GPT-style Transformer model.

Host: GitHub
URL: https://github.com/johnnixon6972/llm
Owner: JohnNixon6972
Created: 2024-10-20T19:49:36.000Z (2 months ago)
Default Branch: main
Last Pushed: 2024-10-22T07:38:35.000Z (2 months ago)
Last Synced: 2024-10-22T17:02:41.229Z (2 months ago)
Topics: bigram-model, gpt-2, language-model, python3, torch, transformers
Language: Jupyter Notebook
Homepage:
Size: 49.9 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Language Model Project: Bigram and GPT-style Implementation

This project explores two distinct language modeling techniques: a simple Bigram Language Model (LMM) and a more sophisticated GPT-style model. Both models are implemented from scratch using PyTorch, and the GPT model is trained on the **OpenWebText** dataset, following the principles of GPT-2.

## Models

### 1. Bigram Language Model

The **Bigram Language Model** is a simple and foundational approach to language modeling. It predicts the next word in a sequence by relying on the frequency of word pairs (bigrams) within a corpus.

#### Features:

- **Token Embedding**: Maps each token to a vector of size equal to the vocabulary size.

- **Next-word Prediction**: Predicts the next word based on the preceding one using logits derived from token embeddings.

- **Cross-Entropy Loss**: During training, the model computes the cross-entropy loss between the predicted and actual tokens.

#### Bigram Model Code Overview:

```python

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):

        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, vocab_size)

    

    def forward(self, index, targets=None):

        logits = self.token_embedding(index)

        

        if targets is None:

            return logits, None

        

        B, T, C = logits.shape

        logits = logits.view(B * T, C)

        targets = targets.view(B * T)

        loss = F.cross_entropy(logits, targets)

        return logits, loss

    

    def generate(self, index, max_new_tokens):

        for _ in range(max_new_tokens):

            logits, _ = self.forward(index)

            logits = logits[:, -1, :]

            probs = F.softmax(logits, dim=-1)

            index_next = torch.multinomial(probs, num_samples=1)

            index = torch.cat([index, index_next], dim=-1)

        return index

```

### 2. GPT-style Language Model

The **GPT-style Language Model** is a more advanced model based on the Transformer architecture. It uses **multi-head self-attention** to capture long-range dependencies in the text, allowing for more accurate next-word prediction.

#### Features:

- **Self-Attention Mechanism**: Each token in the sequence attends to all previous tokens using scaled dot-product attention.

- **Multi-Head Attention**: Captures different attention patterns by using multiple attention heads.

- **Feed-Forward Layers**: Non-linear transformations are applied after attention layers to learn complex patterns.

- **Layer Normalization**: Stabilizes the learning process by normalizing each layer's output.

- **Text Generation**: The model can generate new sequences by sampling from the probability distribution over the vocabulary.

#### GPT Model Code Overview:

```python

class GPTLanguageModel(nn.Module):

    def __init__(self, vocab_size):

        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, n_embd)

        self.position_embedding = nn.Embedding(block_size, n_embd)

        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])

        self.ln_f = nn.LayerNorm(n_embd)

        self.ln_head = nn.Linear(n_embd, vocab_size)

        self.apply(self._init_weights)

    def _init_weights(self, module):

        if isinstance(module, nn.Linear):

            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

            if module.bias is not None:

                torch.nn.init.zeros_(module.bias)

        elif isinstance(module, nn.Embedding):

            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, index, targets=None):

        B, T = index.shape

        tok_emb = self.token_embedding(index)

        pos_emb = self.position_embedding(torch.arange(T, device=device))

        x = tok_emb + pos_emb

        x = self.blocks(x)

        x = self.ln_f(x)

        logits = self.ln_head(x)

        

        if targets is None:

            return logits, None

        

        B, T, C = logits.shape

        logits = logits.view(B * T, C)

        targets = targets.view(B * T)

        loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, index, max_new_tokens):

        for _ in range(max_new_tokens):

            index_cond = index[:, -block_size:]

            logits, loss = self.forward(index_cond)

            logits = logits[:, -1, :]

            probs = F.softmax(logits, dim=-1)

            index_next = torch.multinomial(probs, num_samples=1)

            index = torch.cat([index, index_next], dim=-1)

        return index

```

## Dataset

The **OpenWebText** dataset is used for training the GPT-style model. OpenWebText is a publicly available dataset that closely mimics the original web scrape used to train GPT-2, making it ideal for large-scale language modeling tasks.

## Getting Started

### Prerequisites

- Python 3.x

- PyTorch

- Jupyter Notebook (for running the code interactively)

- Required Python libraries: `torch`, `numpy`

### Installation

1. Clone the repository:

    ```bash

    git clone https://github.com/your-username/language-model.git

    cd language-model

    ```

2. Install required libraries:

    ```bash

    pip install torch numpy

    ```

3. Run the Jupyter notebook:

    ```bash

    jupyter notebook

    ```

### Usage

You can run both the Bigram Language Model and the GPT-style model by loading the respective classes in your environment.

#### Bigram Language Model:

```python

bigram_model = BigramLanguageModel(vocab_size)

logits, loss = bigram_model(index, targets)

generated_tokens = bigram_model.generate(index, max_new_tokens=10)

```

#### GPT Language Model:

```python

gpt_model = GPTLanguageModel(vocab_size)

logits, loss = gpt_model(index, targets)

generated_text = gpt_model.generate(index, max_new_tokens=50)

```

## Future Improvements

- Fine-tune both models on domain-specific datasets.

- Add support for trigrams and higher-order n-grams for the Bigram model.

- Experiment with larger models and deeper Transformer blocks.

- Optimize training with techniques like mixed precision and distributed training.