https://github.com/zphang/minimal-opt

Last synced: over 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/zphang/minimal-opt
Owner: zphang
Created: 2022-07-08T19:55:57.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-08-24T04:58:10.000Z (almost 4 years ago)
Last Synced: 2025-03-27T01:46:33.359Z (over 1 year ago)
Language: Python
Size: 19.5 KB
Stars: 67
Watchers: 3
Forks: 5
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Minimal OPT

This is a minimal PyTorch implementation of [OPT models](https://arxiv.org/abs/2205.01068).

It is based heavily on the [Hugging Face implementation of OPT models](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) as well as [Minimal GPT-NeoX-20B](https://github.com/zphang/minimal-gpt-neox-20b).

The code currently includes both a single-GPU as well as a simple pipeline-parallel implementation.

This means that in theory you should be able to run up to the 175B models on something like an 8xA100.

Currently, I have only tested up to a 66B on a 4xA100.

*This was a very quick implementation and may potentially have bugs. Contributions and additional features are welcome!*

## Setup

### Installation

Install PyTorch with your appropriate CUDA version, and then install from the `requirements.txt` (basically just `tokenizers`).

### Model Weights

This repo loads weights directly from the [model weights](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT#pretrained-model-weights) available from Metaseq.

No further processing is required, except for the 175B weights which require shard merging.

### Generate text

Here is some sample code to generate text. Note that since we are greedily decoding with no fancy tricks, repetition frequently occurs in generations.

```python

import minimal_opt

import torch

import transformers  # Just for the tokenizer!

model = minimal_opt.OPTModel(minimal_opt.OPT_2_7B_CONFIG, device="cuda:0", use_cache=True)

tokenizer = transformers.GPT2Tokenizer.from_pretrained(

    "facebook/opt-125m"

)

# Takes a while? I should add a status bar

minimal_opt.load_sharded_weights(model, [

    "/path/to/2.7b/reshard-model_part-0.pt",

    "/path/to/2.7b/reshard-model_part-1.pt",

    "/path/to/2.7b/reshard-model_part-2.pt",

    "/path/to/2.7b/reshard-model_part-3.pt",

])

with torch.inference_mode():

    text = minimal_opt.greedy_generate_text(

        model, tokenizer,

        "Large language models, which are often trained for hundreds of thousands"

        " of compute days, have shown remarkable capabilities for zero- and"

        " few-shot learning. Given their computational cost, these models are"

        " difficult to replicate without significant capital. For the few that"

        " are available through APIs, no access is granted to the full model"

        " weights, making them difficult to study. We present Open Pre-trained"

        " Transformers (OPT)",

        max_seq_len=128,

    )

    print(text)

```

Generation only supports greedy decoding for now, but the nice thing about a minimal implementation is that it is easy to modify!

The generation code is available [here](minimal_opt/generate.py) and should be easily modifiable to other decoding schemes.

### Pipeline Parallel

Pipeline parallelism distributes the layers of the model across different devices.

It's not the most efficient for of parallelism, but hey, it works.

By default, the `PPOPTModel` distributes layers equally across all visible devices, but you can provide an alternative layer-device allocation.

```python

import minimal_opt

import torch

import transformers  # Just for the tokenizer!

model = minimal_opt.PPOPTModel(minimal_opt.OPT_66B_CONFIG, use_cache=True)

tokenizer = transformers.GPT2Tokenizer.from_pretrained(

    "facebook/opt-125m"

)

# Takes a while? I should add a status bar. Also although it is loading shard by

# shard (not all at once), it still takes a good amount of RAM.

minimal_opt.load_sharded_weights(model, [

    "/path/to/66b/reshard-model_part-0-shard0.pt",

    "/path/to/66b/reshard-model_part-1-shard0.pt",

    "/path/to/66b/reshard-model_part-2-shard0.pt",

    "/path/to/66b/reshard-model_part-3-shard0.pt",

    "/path/to/66b/reshard-model_part-4-shard0.pt",

    "/path/to/66b/reshard-model_part-5-shard0.pt",

    "/path/to/66b/reshard-model_part-6-shard0.pt",

    "/path/to/66b/reshard-model_part-7-shard0.pt",

])

with torch.inference_mode():

    text = minimal_opt.greedy_generate_text(

        model, tokenizer,

        "Large language models, which are often trained for hundreds of thousands"

        " of compute days, have shown remarkable capabilities for zero- and"

        " few-shot learning. Given their computational cost, these models are"

        " difficult to replicate without significant capital. For the few that"

        " are available through APIs, no access is granted to the full model"

        " weights, making them difficult to study. We present Open Pre-trained"

        " Transformers (OPT)",

        max_seq_len=128,

    )

    print(text)

```

## Why another implementation?

- Writing a minimal implementation is good exercise for understanding the internals of a model.

- A minimal implementation is easy to hack around and modify.

- A minimal implementation is also easy to inspect and use as a reference for downstream ports.

## Other notes:

- 350M not currently supported because it apparently [has some architectural differences](https://github.com/huggingface/transformers/blob/8b332a6a160c6df82e4267aaf118d87377d78a67/src/transformers/models/opt/modeling_opt.py#L328-L330).

- I need to clean up the way masking works.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zphang/minimal-opt

Awesome Lists containing this project

README