https://github.com/shubham0204/language-modelling-with-pytorch

Using the Transformer model to generate English sentences, given a prompt - in PyTorch
https://github.com/shubham0204/language-modelling-with-pytorch

Last synced: 7 months ago
JSON representation

Using the Transformer model to generate English sentences, given a prompt - in PyTorch

Host: GitHub
URL: https://github.com/shubham0204/language-modelling-with-pytorch
Owner: shubham0204
License: gpl-3.0
Created: 2023-06-02T10:39:03.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-04-04T15:24:03.000Z (over 1 year ago)
Last Synced: 2025-01-23T13:43:54.430Z (9 months ago)
Language: Python
Homepage:
Size: 24.7 MB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Implementing Transformers with PyTorch - GPT (Decoder)
> Language modelling with the WikiHow dataset by implementing a transformer model (decoder) in PyTorch. Text is read
> from a static source (a text file), processed, tokenized for training. After training, the model can be inferred
> from a CLI-tool, a web app and a HTTP API.

## Features

- Data, training and model configuration managed with a single TOML file
- HTTP API ([FastAPI](https://fastapi.tiangolo.com/)) and web-application ([Streamlit](https://streamlit.io/)) for testing the model
- Metrics logging with [Weights & Biases](https://wandb.ai/site)

## Setup

1. Clone the repository and create a new Python virtual environment,

```commandline
$> git clone --depth=1 shubham0204/language-modelling-with-pytorch
$> cd language-modelling-with-pytorch
$> python -m venv project_env
$> ./project_env/bin/activate
```

After activating the virtual environment, install Python dependencies,

```commandline
$> (project_env) pip install -r requirements.txt
```

2. The dataset is included in the `dataset/` directory as a text file. The next step is to execute the `process_data.py` script
which will read the articles from the text file and transform it into a tokenized corpus of sentences.
The `process_data.py` requires an input directory (where the text file is) and an output directory to store
the tokenized corpus along with `index2word` and `word2index` mappings, serialized as Python `pickle`.
The input and the output directories are specified with `data_path` and `data_tensors_path` respectively in
project's configuration file `project_config.toml`.

`data_path` defaults to the `dataset/` directory in the project's root.

```commandline
$> (project_env) python process_data.py
```

3. After execution of the `process_data.py` script, the `vocab_size` parameter gets changed in the `project_config.toml`.
We're now ready to train the model. The parameters within the `train` group in `project_config.toml` are used to
control the training process. See [Understanding `project_config.toml`](#understanding-projectconfigtoml) to get more
details about the `train` parameters.

## Inference and pretrained model

## Understanding `project_config.toml`

The `project_config.toml` file contains all settings required by the project which are used by nearly all scripts in the
project. Having a global configuration in a TOML file enhances control over the project and provides a *sweet spot*
where all settings could be viewed/modified at once.

> While developing the project, I wrote a blog - [Managing Deep Learning Models Easily With TOML Configurations](https://towardsdatascience.com/managing-deep-learning-models-easily-with-toml-configurations-fb680b9deabe). Do check it out.

The following sections describe each setting that can be changed through `project_config.toml`

### `train` configuration

- `num_train_iter` (`int`): The number of iterations to be performed on the training dataset. Note, an iteration refers to the forward-pass
of a single batch of data, and a back-pass to update the parameters.
- `num_test_iter` (`int`): The number of iterations to be performed on the test dataset.
- `test_interval` (`int`): The number of iterations after which testing should be performed.
- `batch_size` (`int`): Number of samples present in a batch.
- `learning_rate` (`float`): The learning rate used by the `optimizer` in `train.py`.
- `checkpoint_path` (`str`): Path where checkpoints will be saved during training.
- `wandb_logging_enabled` (`bool`): Enable/disable logging to Weights&Biases console in `train.py`.
- `wandb_project_name` (`str`): If logging is enabled, the name of the `project` that is to be used for W&B.
- `resume_training` (`bool`): If `True`, a checkpoint will be loaded and training will be resumed.
- `resume_training_checkpoint_path` (`str`): If `resume_training = True`, the checkpoint will be loaded from this path.
- `compile_model` (`bool`): Whether to use `torch.compile` to speedup training in `train.py`

### `data` configuration

- `vocab_size` (`int`): Number of tokens in the vocabulary. This variable is set when `process_data.py` script is executed.
- `test_split` (`float`): Fraction of data used for testing the model.
- `seq_length` (`int`): Context length of the model. Input sequences of `seq_length` will be produced in `train.py` to train the model.
- `data_path` (`str`): Path of the text file containing the articles.
- `data_tensors_path` (`str`): Path of the directory where tensors of processed data will be stored.

### `model` configuration

- `embedding_dim` (`int`): Dimensions of the output embedding for `torch.nn.Embedding`.
- `num_blocks` (`int`): Number of blocks to be used in the transformer model. A single block contains `MultiHeadAttention`,
`LayerNorm` and `Linear` layers. See `layers.py`.
- `num_heads_in_block` (`int`): Number of heads used in `MultiHeadAttention`.
- `dropout` (`float`): Dropout rate for the transformer.

### `deploy` configuration

- `host` (`str`): Host IP used to deploy API endpoints.
- `port` (`int`): Port through which API endpoints will be exposed.

## Deployment

The trained ML model can be deployed in two ways,

* As a [Streamlit]() app
* As an API endpoint with [FastAPI]()

### Web app with Streamlit

The model can be used with a [StreamLit]() app easily,

```commandline
$> (project_env) streamlit run app.py
```

In the app, we need to select a model for inference, the `data_tensors_path` (required for tokenization) and other
parameters like number of words to generate and the temperature.

### API Endpoints with FastAPI

The model can be accessed with REST APIs built with [FastAPI](),

```commandline
$> (project_env) uvicorn api:server
```

A `GET` request at the `/predict` endpoint with query parameters `prompt`, `num_tokens` and `temperature` can
be initiated to generate a response from the model,

```commandline
curl --location 'http://127.0.0.1:8000/predict?prompt=prompt_text_here&temperature=1.0&num_tokens=100'
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shubham0204/language-modelling-with-pytorch

Awesome Lists containing this project

README