https://github.com/lopuhin/transformer-lm
Transformer language model (GPT-2) with sentencepiece tokenizer
https://github.com/lopuhin/transformer-lm
gpt-2 language-model tensorflow
Last synced: 11 months ago
JSON representation
Transformer language model (GPT-2) with sentencepiece tokenizer
- Host: GitHub
- URL: https://github.com/lopuhin/transformer-lm
- Owner: lopuhin
- Created: 2019-03-17T07:30:07.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2021-02-27T15:23:53.000Z (over 5 years ago)
- Last Synced: 2025-06-15T14:09:10.605Z (about 1 year ago)
- Topics: gpt-2, language-model, tensorflow
- Language: Python
- Homepage:
- Size: 359 KB
- Stars: 164
- Watchers: 9
- Forks: 46
- Open Issues: 10
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
Training GPT-2 transformer language model with sentencepiece tokenizer
======================================================================
.. image:: https://img.shields.io/travis/lopuhin/transformer-lm/master.svg
:target: https://travis-ci.org/lopuhin/transformer-lm
:alt: Build Status
Training GPT-2 transformer language model on your own corpora
with `sentencepiece `_ tokenization.
This repo contains a PyTorch implementation of GPT-2, which support multi-GPU
training.
It also contains a TensorFlow implementation in ``lm/gpt_2_tf``,
but it is not developed any more. They share the same data preparation scripts.
TF training command is ``gpt-2-tf-train`` and needs TensorFlow 1.13.
Documentation below is for PyTorch version.
.. contents::
Installation
------------
Python 3.6+ is required with torch nightly or 1.6.0+.
Working in a virtualenv is assumed below.
`Install `__
appropriate version of pytorch first, and then::
pip install -r requirements.txt
python setup.py develop
Usage
-----
Instructions are below. See also ``test/test_shakespeare.sh``
for a complete pipeline demo on a small corpus (takes a minute on a CPU).
Prepare data for training
+++++++++++++++++++++++++
Corpus format: a directory with top-level ``train``, ``valid`` and ``test``
folders. Each top-level folder may contain sub-folders. Inside them,
there must be utf-8 encoded text files with ``.txt`` extension.
The commands to train sentencepiece model and encode the corpus support
multiple corpora,
in below examples we assume they can be listed as ``data/corpora-*``.
1. Train sentencepiece model (``sp-text.txt`` can be removed after running).
This can consume a large amount of memory, adjust sentencepiece arguments
as advised if needed
(this is not supported in the ``sp-train`` command directly)::
sp-train data/corpora-* sp-text.txt sp-model
2. Encode corpora, producing numpy files::
sp-encode data/corpora-* sp-model.model data/encoded
Training
++++++++
Example command::
gpt-2 run-root data/encoded sp-model.model
``run-root`` would contain model checkpoints and json-lines logs,
which can be plotted in a jupyter notebook with
``json_log_plots.plot("run-root")``, with number of tokens seen on the X axis.
Default hyperparameters correspond to released "small" GPT-2 model.
When multiple GPUs are available, they would be used for training with the
help of ``torch.distributed``.
If the path exists and ``--clean`` key is NOT passed, training would be resumed.
Note that all parameters still need to be specified and
model parameters need to match.
Notes on training parameters:
- ``--batch-size`` is per-GPU, so you don't need to re-tune it when changing
number of GPUs, just use max that fits into memory.
- ``--g-accum-gradients`` is the global number of gradient accumulations,
it must be divisible by the number of GPUs. Effective global batch size is
always ``batch_size * g_accum_gradients``.
- ``--lr`` does not need to be changed when changing
``--batch-size`` or ``--g-accum-gradients`` or number of GPUs
or ``--n-ctx``: loss is already scaled appropriately.
Inference
+++++++++
Example command::
gpt-2-gen run-root "Artificial intelligence"
``run-root`` would contain model checkpoints
``"Artificial intelligence"`` is the text prefix used as a starting point for generating tokens
Notes on inference parameters:
- ``--tokens-to-generate``: number of tokens to generate, default is 42
- ``--top-k``: number of token candidates to generate for each position (beam width), default is 8.
License & credits
-----------------
License is MIT.
TensorFlow GPT-2 model is taken from
https://github.com/openai/gpt-2/blob/master/src/model.py
and TensorFlow GPT-2 training code is based on
https://github.com/nshepperd/gpt-2/blob/finetuning/train.py
PyTorch port is based on original OpenAI code.
Test Shakespeare corpus under ``tests/shakespeare``
is from http://shakespeare.mit.edu under public domain.
See also OpenAI GPT-2
`paper `_
and `blog `_.