https://github.com/bminixhofer/zett

Code for Zero-Shot Tokenizer Transfer
https://github.com/bminixhofer/zett

language-model llm llms multilingual tokenization transfer-learning

Last synced: 7 months ago
JSON representation

Code for Zero-Shot Tokenizer Transfer

Host: GitHub
URL: https://github.com/bminixhofer/zett
Owner: bminixhofer
Created: 2024-05-06T09:58:30.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-01-14T09:53:55.000Z (10 months ago)
Last Synced: 2025-03-29T20:03:26.692Z (7 months ago)
Topics: language-model, llm, llms, multilingual, tokenization, transfer-learning
Language: Python
Homepage: https://arxiv.org/abs/2405.07883
Size: 1.04 MB
Stars: 125
Watchers: 2
Forks: 10
Open Issues: 8
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Zero-Shot Tokenizer Transfer

This repository contains the code for the paper [Zero-Shot Tokenizer Transfer](https://arxiv.org/abs/2405.07883). ZeTT frees language models from their tokenizer, allowing you to use any model with any tokenizer, with little or no extra training⚡

## Available pretrained hypernetworks

| Hypernetwork | ..for Model | Comments |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|----------------------------|
| [benjamin/zett-hypernetwork-xlm-roberta-base](https://huggingface.co/benjamin/zett-hypernetwork-xlm-roberta-base) | [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | multilingual, 26 languages |
| [benjamin/zett-hypernetwork-Mistral-7B-v0.1](https://huggingface.co/benjamin/zett-hypernetwork-Mistral-7B-v0.1) | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | English + Code |
| [benjamin/zett-hypernetwork-multilingual-Mistral-7B-v0.1](https://huggingface.co/benjamin/zett-hypernetwork-multilingual-Mistral-7B-v0.1) | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | multilingual, 26 languages |
| [benjamin/zett-hypernetwork-TinyLlama-1.1B-intermediate-step-1431k-3T](https://huggingface.co/benjamin/zett-hypernetwork-TinyLlama-1.1B-intermediate-step-1431k-3T) | [TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) | English + Code |
| [benjamin/zett-hypernetwork-Meta-Llama-3-8B-experimental](https://huggingface.co/benjamin/zett-hypernetwork-Meta-Llama-3-8B-experimental) | [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | __experimental__ English + Code, seems to underperform on Code |

## Using a pretrained hypernetwork

### Environment Setup

Requirements are in `requirements.txt`, This, for example, creates a working environment:

```
conda create -n zett Python=3.11
conda activate zett

pip install -r requirements.txt
pip install -U "jax[cuda12_pip]==0.4.23" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html # adjust based on your CUDA version
pip install -e .
```

### Transferring to a new tokenizer

Let's transfer XLM-RoBERTa to the GPT2 tokenizer.

```bash
git clone https://huggingface.co/benjamin/zett-hypernetwork-xlm-roberta-base

python3 scripts/transfer.py \
--target_model=FacebookAI/xlm-roberta-base \
--tokenizer_name=gpt2 \
--output=my-new-fancy-xlm-r \
--model_class=AutoModelForMaskedLM \
--lang_code=en \
--checkpoint_path=zett-hypernetwork-xlm-roberta-base \
--save_pt # otherwise saves only Flax weights
```

Tada!

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my-new-fancy-xlm-r")
model = AutoModelForMaskedLM.from_pretrained("my-new-fancy-xlm-r")

out = model(**tokenizer("Hello world!", return_tensors="pt"))
```

..or Mistral-7B to the GPT-NeoX tokenizer:

```bash
git clone https://huggingface.co/benjamin/zett-hypernetwork-Mistral-7B-v0.1

# because Flax weights are not merged in the main branch, we need to specify the revision of a PR containing Flax weights
python3 scripts/transfer.py \
--target_model=mistralai/Mistral-7B-v0.1 \
--revision=refs/pr/95 \
--tokenizer_name=EleutherAI/gpt-neox-20b \
--output=my-new-fancy-mistral \
--model_class=AutoModelForCausalLM \
--checkpoint_path=zett-hypernetwork-Mistral-7B-v0.1 \
--save_pt # otherwise saves only Flax weights
```

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my-new-fancy-mistral")
model = AutoModelForCausalLM.from_pretrained("my-new-fancy-mistral")

out = model(**tokenizer("Hello world!", return_tensors="pt"))
```

Although the codebase is in Jax/Flax, there are Pytorch bindings for the model in `./hf_hypernet`. You can use them as follows:

```python
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
from zett.utils import get_surface_form_matrix

base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
hypernet = AutoModel.from_pretrained("benjamin/zett-hypernetwork-Mistral-7B-v0.1", trust_remote_code=True)

source_embeddings = torch.concatenate([
base_model.get_input_embeddings().weight.data,
base_model.get_output_embeddings().weight.data,
], axis=1)

hn_tokenizer = AutoTokenizer.from_pretrained("benjamin/zett-hypernetwork-Mistral-7B-v0.1")

target_surface_forms = get_surface_form_matrix(
["Ġhello", "Ġworld"], # byte representation of the tokens to predict
maxlen=hypernet.config.hn_surface_maxlen,
tokenizer_to_use=hn_tokenizer,
)[0]

# the last output is the predicted bias in case the model uses a bias (e.g. XLM-R)
predicted_input_embeddings, predicted_output_embeddings, _ = hypernet(
torch.from_numpy(target_surface_forms),
source_embeddings=source_embeddings
)

```

but `transfer.py` is currently not ported to PyTorch (PRs welcome!).

## Advanced usage

### Training a Hypernetwork

The script used to train the hypernetwork is `train.py`.

But first, you'll need to download and prepare the data via `data/prepare.py` and `data/prepare_code.py`.

You'll also need to install the Rust module in `rust_utils` (used to quickly sample tokenizers) via e.g. `cd rust_utils && maturin develop --release`.

Once finished, you can run training using the configs in `configs/`. For example:

```bash
python3 train.py configs/zeroshot/v7:tinyllama_en+code:lw=0.5_long.json
```

to train a hypernetwork for TinyLlama on English and Code.

### Transferring fine-tuned models to a new tokenizer using a base model hypernetwork

Use `scripts/apply_to_ft.py` to transfer the tokenizers of a fine-tuned model, given a base model with already transferred tokenizer. For example:

```bash
python3 scripts/apply_to_ft.py \
--output=transferred-chat-mistral \
--base_model_path=mistralai/Mistral-7B-v0.1 \
--ft_model_path=mistralai/Mistral-7B-Instruct-v0.1 \
--tokenizer_swapped_base_model_path=path-to-base-model-with-new-tokenizer \
--lambdas 0.5 \
```

### Reproducing the experiments from the paper

There are bash scripts in `experiments/` to allow reproducing the main results from the paper.

Evaluation on code is still missing because we are using a fork of `bigcode-evaluation-harness` to fix some issues we encountered. They will be added soon.

### Unigramifying, using n-shot transferred models, reproducing the tokenizers from the paper, etc.

Guide coming soon... (but feel free to dig into `scripts/` in the meantime)

###

## Disclaimer

I prioritized releasing the code quickly instead of making it perfectly clean. There may still be remnants of my personal environment used to train the models and other non-niceties. I am in the process of cleaning this up. If you run into any problems or have any questions, please open an issue.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bminixhofer/zett

Awesome Lists containing this project

README

Zero-Shot Tokenizer Transfer