https://github.com/google-research/byt5

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/google-research/byt5
Owner: google-research
License: apache-2.0
Created: 2021-05-26T17:37:59.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2024-02-13T14:06:43.000Z (over 1 year ago)
Last Synced: 2025-03-29T02:04:11.113Z (4 months ago)
Language: Python
Size: 31.3 KB
Stars: 501
Watchers: 11
Forks: 32
Open Issues: 7
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - google-research/byt5 - R、T5、GPT-3）那样使用子词词汇表，而是直接在 UTF-8 字节上运行，无需任何文本预处理。除了降低系统复杂性之外，我们发现参数匹配的 ByT5 模型在一系列任务中与 mT5 具有竞争力，并且在涉及嘈杂文本或对拼写和发音敏感的任务上优于 mT5。此 repo 可用于重现 ByT5 论文中的实验。 (预训练模型)

README

        # ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5 is a tokenizer-free extension of the [mT5

model](https://arxiv.org/abs/2010.11934). Instead of using a subword vocabulary

like most other pretrained language models (BERT, XLM-R, T5, GPT-3), our ByT5

model operates directly on UTF-8 bytes, removing the need for any text

preprocessing. Beyond the reduction in system complexity, we find that

parameter-matched ByT5 models are competitive with mT5 across a range of tasks,

and outperform mT5 on tasks that involve noisy text or are sensitive to

spelling and pronunciation. This repo can be used to reproduce the experiments

in the [ByT5 paper][paper].

## Usage

### Training

To run this code, you need to install the [t5

library](https://pypi.org/project/t5/). General instructions for training,

fine-tuning, evaluation, and exporting models for inference can be found in the

[t5

repo](https://github.com/google-research/text-to-text-transfer-transformer). In

order to use the additional ByT5 tasks provided in this library with the

`t5_mesh_transformer` command, run from this directory and add the flag

`--module_import="byt5.tasks"`.

To train a `ByT5-Large` model on the

[mc4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual_nights_stay)

task from scratch as described in the paper:

```

export PROJECT=yourproject

export ZONE=yourzone

export BUCKET=yourbucket

export TPU=yourtpu

ctpu up --name=$TPU --project=$PROJECT --zone=$ZONE --tpu-size=v3-256 --tpu-only --noconf

TASK=byt5_mc4

MODEL_DIR="${BUCKET}${TASK}"

python -m t5.models.mesh_transformer_main \

  --tpu="${TPU}" \

  --gcp_project="${PROJECT}" \

  --tpu_zone="${ZONE}" \

  --model_dir="${MODEL_DIR}" \

  --gin_file="models/byt5.large.gin" \

  --gin_param="MIXTURE_NAME = '${TASK}'" \

  --gin_param="utils.run.sequence_length = {'inputs': 1024, 'targets': 189}" \

  --gin_param="utils.run.batch_size = ('tokens_per_batch', 1048576)" \

  --gin_param="utils.run.learning_rate_schedule=@learning_rate_schedules.rsqrt_no_ramp_down" \

  --gin_param="run.train_steps = 1000000" \

  --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" \

  --gin_param="utils.tpu_mesh_shape.tpu_topology = 'v3-256'" \

  --eval_mode="perplexity_eval" \

  --eval_gin_param="mesh_eval_dataset_fn.num_eval_examples = 10000" \

  --t5_tfds_data_dir="${BUCKET}/t5-tfds" \

  --module_import="byt5.tasks"

```

### Fine-Tuning

The example below shows how to finetune the `ByT5-Large` model on the XNLI

zeroshot task.

```

export PROJECT=yourproject

export ZONE=yourzone

export BUCKET=yourbucket

export TPU=yourtpu

ctpu up --name=$TPU --project=$PROJECT --zone=$ZONE --tpu-size=v3-256 --tpu-only --noconf

TASK=byt5_xnli_zeroshot

PRETRAINED_DIR=gs://t5-data/pretrained_models/byt5/large

PRETRAINED_STEPS=1000000

FINETUNE_STEPS=262144

MODEL_DIR="${BUCKET}${TASK}"

# Run fine-tuning

python -m t5.models.mesh_transformer_main \

  --tpu="${TPU}" \

  --gcp_project="${PROJECT}" \

  --tpu_zone="${ZONE}" \

  --model_dir="${MODEL_DIR}" \

  --gin_file="${PRETRAINED_DIR}/operative_config.gin" \

  --gin_param="utils.tpu_mesh_shape.tpu_topology = 'v3-256'" \

  --gin_param="MIXTURE_NAME = '${TASK}'" \

  --gin_param="utils.run.train_steps=$((PRETRAINED_STEPS+FINETUNE_STEPS))" \

  --gin_param="utils.run.init_checkpoint='${PRETRAINED_DIR}/model.ckpt-${PRETRAINED_STEPS}'" \

  --t5_tfds_data_dir="${BUCKET}/t5-tfds" \

  --module_import="byt5.tasks"

  --gin_param="utils.run.batch_size = ('tokens_per_batch', 1048576)" \

  --gin_param="utils.run.sequence_length = {'inputs': 2048, 'targets': 56}"

  --eval_gin_param="Bitransformer.decode.max_decode_length = 56" \

```

The remaining experiments are shown in the [tasks.py](byt5/tasks.py) file.

## Released Model Checkpoints

We have released the following checkpoints for pre-trained models described in

our [paper][paper]:

* **ByT5-Small** (300 million parameters): [gs://t5-data/pretrained_models/byt5/small](https://console.cloud.google.com/storage/browser/t5-data/pretrained_models/byt5/small/)

* **ByT5-Base** (580 million parameters): [gs://t5-data/pretrained_models/byt5/base](https://console.cloud.google.com/storage/browser/t5-data/pretrained_models/byt5/base/)

* **ByT5-Large** (1.2 billion parameters): [gs://t5-data/pretrained_models/byt5/large](https://console.cloud.google.com/storage/browser/t5-data/pretrained_models/byt5/large/)

* **ByT5-XL** (3.7 billion parameters): [gs://t5-data/pretrained_models/byt5/xl](https://console.cloud.google.com/storage/browser/t5-data/pretrained_models/byt5/xl/)

* **ByT5-XXL** (13 billion parameters): [gs://t5-data/pretrained_models/byt5/xxl](https://console.cloud.google.com/storage/browser/t5-data/pretrained_models/byt5/xxl/)

# How to Cite

If you extend or use this work, please cite the [paper][paper] where it was

introduced:

```

@misc{xue2021byt5,

    title={ByT5: Towards a token-free future with pre-trained byte-to-byte models},

    author={Linting Xue and Aditya Barua and Noah Constant and Rami Al-Rfou and Sharan Narang and Mihir Kale and Adam Roberts and Colin Raffel},

    year={2021},

    eprint={2105.13626},

    archivePrefix={arXiv},

    primaryClass={cs.CL}

}

```

[paper]: https://arxiv.org/abs/2105.13626

This is not an officially supported Google product.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/google-research/byt5

Awesome Lists containing this project

README