An open API service indexing awesome lists of open source software.

https://github.com/henilp105/tinystoriesgpt-5m

A PyTorch implementation of a Bigram Language Model using Transformer architecture for character-level text generation.
https://github.com/henilp105/tinystoriesgpt-5m

artificial-intelligence gpt machine-learning natural-language-processing tinystories transformers

Last synced: 4 months ago
JSON representation

A PyTorch implementation of a Bigram Language Model using Transformer architecture for character-level text generation.

Awesome Lists containing this project

README

          

# Transformer from Scratch (on TinyStories dataset)

## Overview

This project implements a Bigram Language Model using a Transformer architecture in PyTorch for TinyStories dataset (https://arxiv.org/abs/2305.07759) . The model is designed to predict the next character in a sequence based on the context provided by preceding characters. It leverages multi-head self-attention and feedforward neural networks to achieve this.

## Features

- Token and position embeddings
- Multi-head self-attention mechanism
- Feedforward neural network layers
- Layer normalization
- Dropout for regularization
- Character-level text generation

## Requirements

- Python 3.x
- PyTorch 1.7 or later
- CUDA (optional, for GPU acceleration)

## Files

- `input.txt`: The input text file used for training the language model. (from TinyStories: https://arxiv.org/abs/2305.07759)
- `main.py`: The main script containing the implementation of the model and training loop.

## Hyperparameters

The following hyperparameters can be adjusted to tune the model:

- `batch_size`: Number of sequences processed in parallel (default: 2048)
- `block_size`: Maximum context length for predictions (default: 128)
- `max_iters`: Number of training iterations (default: 1000)
- `eval_interval`: Interval for evaluating the model on validation data (default: 100)
- `learning_rate`: Learning rate for the optimizer (default: 1e-3)
- `eval_iters`: Number of iterations for evaluation (default: 200)
- `n_embd`: Dimensionality of the embeddings (default: 128)
- `n_head`: Number of attention heads (default: 4)
- `n_layer`: Number of Transformer blocks (default: 4)
- `dropout`: Dropout rate (default: 0.0)

## Usage

### Preparing the Data

1. Place your training text in a file named `input.txt`.
2. Ensure that `input.txt` is in the same directory as `train.py`.

### Running the Script

To train the model and generate text, simply run:

```bash
python train.py
```

The script will:

1. Read the input text from `input.txt`.
2. Encode the text into integer sequences.
3. Split the data into training and validation sets.
4. Train the model for the specified number of iterations.
5. Periodically evaluate the model on the validation set and print the losses.
6. Generate text samples at regular intervals during training.
7. Print a final text sample after training is complete.

### Code Explanation

#### Data Preparation

- The input text is read from `input.txt`.
- Unique characters are extracted to create a vocabulary.
- Characters are mapped to integers for model processing.

#### Batch Generation

- `get_batch(split)`: Generates batches of input and target sequences for training and validation.

#### Model Components

- `Head`: Implements a single head of self-attention.
- `MultiHeadAttention`: Combines multiple heads of self-attention.
- `FeedFoward`: A feedforward neural network layer.
- `Block`: A single Transformer block, combining self-attention and feedforward layers.
- `BigramLanguageModel`: The main language model combining embedding layers, Transformer blocks, and an output layer.

#### Training and Evaluation

- The model is trained using the AdamW optimizer.
- Loss is computed using cross-entropy.
- The `estimate_loss()` function evaluates the model on both training and validation sets.

#### Text Generation

- The `generate()` method of `BigramLanguageModel` generates text by sampling from the learned distribution of next characters.

## Example Output

During training, the script will periodically print text samples generated by the model. Here is an example of what you might see:

```
...
step 0: train loss 4.1234, val loss 4.5678
Generated text: "Sample text generated by the model..."
...
```