https://github.com/swiss-ai/parity-aware-bpe
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [arXiv 2025]
https://github.com/swiss-ai/parity-aware-bpe
bpe llms multilingual-nlp multilingual-tokenization tokenization
Last synced: about 1 month ago
JSON representation
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [arXiv 2025]
- Host: GitHub
- URL: https://github.com/swiss-ai/parity-aware-bpe
- Owner: swiss-ai
- License: mit
- Created: 2025-07-29T11:50:30.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-12-01T14:59:35.000Z (6 months ago)
- Last Synced: 2025-12-04T03:50:42.222Z (6 months ago)
- Topics: bpe, llms, multilingual-nlp, multilingual-tokenization, tokenization
- Language: Python
- Homepage: https://arxiv.org/abs/2508.04796
- Size: 21.5 KB
- Stars: 15
- Watchers: 0
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
==================================
This repository provides an implementation of the **Parity-Aware BPE** algorithm.
Paper: ["Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization"](https://arxiv.org/abs/2508.04796) [ACL 2026]
Overview
------------
**Parity-aware BPE** learns a tokenization that ensures parity in token lengths across languages on a multi-parallel development set.
Unlike standard BPE, which optimizes merges based on a single corpus, this approach explicitly considers cross-lingual fairness during the tokenization process.
Installation
------------
You can install this package directly from GitHub:
```bash
pip install git+https://github.com/swiss-ai/parity-aware-bpe.git
```
For development installation:
```bash
git clone https://github.com/swiss-ai/parity-aware-bpe.git
cd parity-aware-bpe
pip install -e .
```
Usage Instructions
------------------
The arguments of `parity-aware-bpe` are as follows:
- `--variant`: Parity-aware BPE variant. Options:
- base – standard parity-aware BPE (default)
- window – moving-window balancing version
- `--input`: Space-separated list of training corpora (one per language).
- `--dev`: Space-separated list of development texts used for parity computation (multi-parallel). The tool assumes that the language of the nth input corpus corresponds to the nth dev corpus (same order as `--input`).
- `--ratio`: Space-separated list of desired compression ratios (floats), relative to pre-tokenized training set length, per input language. Can be used for parity computation (on training data) in lieu of development set.
- `--global-merges`: Optionally, one can perform the first M merge operations based on global frequency statistics (equivalent to standard BPE), and only switch to a parity-optimizing mode after (Hybrid parity-aware BPE). This argument controls how many merge operations are performed based on global statistics.
- `--symbols`: Total number of BPE merges to perform.
- `--output`: Path to the output file where BPE merge rules will be saved (one per line).
- `--total-symbols`: Adjusts the number of merges by subtracting character counts (so `--symbols` approximates total symbols needed).
- `--min-frequency`: Minimum pair frequency to continue merging (default: `2`).
- `--window-size`: Context window size for the window-balancing variant (default: `100`).
- `--alpha`: Parameter controlling the moving-window balancing behavior (default: `2`).
### Example Usage
```bash
Python3 parity_aware_bpe/parity_aware_learn_bpe.py \
--symbols {num_operations} \
--variant {"base" or "window"} \
--input {train_files} \
--dev {development_files} \
--output {output_file}
```
Classical BPE
------------------
To run the classical BPE algorithm you can use `learn_bpe.py`:
```bash
Python3 parity_aware_bpe/learn_bpe.py \
--symbols {num_operations} \
--input {train_files} \
--dev {development_files} \
--output {output_file}
```
Generating a Vocabulary
------------------
After learning the merges, you can build a vocabulary file using the `build_vocab_from_merges` function in `HF_tokenizer.py`.
To create a Hugging Face-compatible tokenizer:
```bash
Python3 parity_aware_bpe/HF_tokenizer.py \
--merges_file_path {merge_file_path} \
--tokenizer_path {tokenizer_save_folder}
```
Loading the tokenizer
------------------
```python
import os
from transformers import PreTrainedTokenizerFast
from tokenizers.pre_tokenizers import Whitespace, ByteLevel
from tokenizers.models import BPE
from tokenizers import Tokenizer, pre_tokenizers
merge_file = os.path.join(tokenizer_path, "merges.txt")
vocab_file = os.path.join(tokenizer_path, "vocab.json")
tokenizer = Tokenizer(BPE(vocab=vocab_file, merges=merge_file))
pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), ByteLevel(use_regex=False)]) # You need to use the same pre_tokenizer as the one used in BPE training
tokenizer.pre_tokenizer = pre_tokenizer
wrapped_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
```
Intrinsic Evaluation
------------------
For our intrinsic evaluation, we use `Tok##Suite` to analyze and compare tokenizers across multiple languages and metrics. You can find the evaluation suite [here](https://github.com/cimeister/tokenizer-analysis-suite).
Citation
------------------
If you use this code for your research, please cite our paper:
``` bib
@article{foroutan-meister-et-al-2025-parity-aware-bpe,
title={Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization},
author={Foroutan, Negar and Meister, Clara and Paul, Debjit and Niklaus, Joel and Ahmadi, Sina, and Bosselut, Antoine and Sennrich, Rico},
url={https://arxiv.org/abs/2508.04796},
booktitle={arXiv},
year={2025}
}
```