https://github.com/freedomintelligence/x-tokenization

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/freedomintelligence/x-tokenization
Owner: FreedomIntelligence
Created: 2023-10-30T08:48:19.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-11-03T07:56:17.000Z (over 2 years ago)
Last Synced: 2025-01-18T08:38:32.227Z (over 1 year ago)
Size: 1.95 KB
Stars: 0
Watchers: 10
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

          # Testing tokenization in multilingual context

This is also inspired by [tokenmonster](https://github.com/alasdairforsythe/tokenmonster).

## tokenizer 速度测试

| 模型 | Arabic (203673) |  | English (403630) |  | Chinese (121106) |  |

| --- | --- | --- | --- | --- | --- | --- |

|  | token | time(s) | token | time(s) | token | time(s) |

| Bloom (250k) | 50825 | 0.2077 | 89558 | 0.3469 | 77893 | 0.2185 |

| Llama (32k) | 183904 | 0.4261 | 106614 | 0.6880 | 169253 | 0.3682 |

| Baichuan2 (125k) | 172505 | 0.4330 | 99267 | 0.7666 | 80966 | 0.3087 |

| mt5 | 73172 | 4.2103 | 104994 | 9.0237 | 90329 | 0.4704 |

假设一个词表的大小，1k,2k,4k,8k, .. 训bpe，在给定的validation下算压缩率（暂时没找到合适的validation，以train dataset代替），饱和 (提升的比例随着词表大小不显著), 数据集wikipedia

最后一个token的frequency

###### 中文词表

| num |  compression ratio | last word frequency |

| --- | --- | --- |

| 8000 | 0.7813 | 48 |

| 16000 | 0.5964 | 2 |

| 24000 | 0.5485 | 2 |

| 32000 | 0.5212 | 1 |

根据训练预料需要7000+token能够达到99.95%的覆盖率，故从8k开始。

###### 阿拉伯语词表

| num |  compression ratio | last word frequency |

| --- | --- | --- |

| 1000 | 0.4172 | 553 |

| 2000 | 0.3603 | 553 |

| 4000 | 0.3130 | 10 |

| 8000 | 0.2752 | 10 |

| 16000 | 0.2466 | 7 |

| 24000 | 0.2338 | 6 |

| 32000 | 0.2262 | 1 |

## reference

```

@misc{X-tokenization-2023,

  title={X-tokenization, towards a universal across languages.},

  author={Jianqing Zhu and Benyou Wang},

  year = {2023},

  publisher = {GitHub},

  journal = {GitHub repository},

  howpublished = {\url{https://github.com/FreedomIntelligence/X-tokenization}},

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/freedomintelligence/x-tokenization

Awesome Lists containing this project

README