https://github.com/freedomintelligence/x-tokenization
https://github.com/freedomintelligence/x-tokenization
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/freedomintelligence/x-tokenization
- Owner: FreedomIntelligence
- Created: 2023-10-30T08:48:19.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-11-03T07:56:17.000Z (over 2 years ago)
- Last Synced: 2025-01-18T08:38:32.227Z (over 1 year ago)
- Size: 1.95 KB
- Stars: 0
- Watchers: 10
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Testing tokenization in multilingual context
This is also inspired by [tokenmonster](https://github.com/alasdairforsythe/tokenmonster).
## tokenizer 速度测试
| 模型 | Arabic (203673) | | English (403630) | | Chinese (121106) | |
| --- | --- | --- | --- | --- | --- | --- |
| | token | time(s) | token | time(s) | token | time(s) |
| Bloom (250k) | 50825 | 0.2077 | 89558 | 0.3469 | 77893 | 0.2185 |
| Llama (32k) | 183904 | 0.4261 | 106614 | 0.6880 | 169253 | 0.3682 |
| Baichuan2 (125k) | 172505 | 0.4330 | 99267 | 0.7666 | 80966 | 0.3087 |
| mt5 | 73172 | 4.2103 | 104994 | 9.0237 | 90329 | 0.4704 |
假设一个词表的大小,1k,2k,4k,8k, .. 训bpe,在给定的validation下算压缩率(暂时没找到合适的validation,以train dataset代替),饱和 (提升的比例随着词表大小不显著), 数据集wikipedia
最后一个token的frequency
###### 中文词表
| num | compression ratio | last word frequency |
| --- | --- | --- |
| 8000 | 0.7813 | 48 |
| 16000 | 0.5964 | 2 |
| 24000 | 0.5485 | 2 |
| 32000 | 0.5212 | 1 |
根据训练预料需要7000+token能够达到99.95%的覆盖率,故从8k开始。
###### 阿拉伯语词表
| num | compression ratio | last word frequency |
| --- | --- | --- |
| 1000 | 0.4172 | 553 |
| 2000 | 0.3603 | 553 |
| 4000 | 0.3130 | 10 |
| 8000 | 0.2752 | 10 |
| 16000 | 0.2466 | 7 |
| 24000 | 0.2338 | 6 |
| 32000 | 0.2262 | 1 |
## reference
```
@misc{X-tokenization-2023,
title={X-tokenization, towards a universal across languages.},
author={Jianqing Zhu and Benyou Wang},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/FreedomIntelligence/X-tokenization}},
}
```