https://github.com/modeltc/greedy-tokenizer
Greedily tokenize strings with the longest tokens iteratively.
https://github.com/modeltc/greedy-tokenizer
Last synced: 4 months ago
JSON representation
Greedily tokenize strings with the longest tokens iteratively.
- Host: GitHub
- URL: https://github.com/modeltc/greedy-tokenizer
- Owner: ModelTC
- License: apache-2.0
- Created: 2023-11-22T05:45:56.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-09-15T00:06:56.000Z (5 months ago)
- Last Synced: 2025-10-09T10:33:22.943Z (4 months ago)
- Language: Python
- Size: 46.9 KB
- Stars: 0
- Watchers: 6
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE-APACHE
Awesome Lists containing this project
README
# Greedy Tokenizer
[](./greedy_tokenizer.py)
[](#license)
[](https://github.com/ModelTC/greedy-tokenizer/actions)
Greedily tokenize strings with the longest tokens iteratively,
compatible with `transformers.PretrainedTokenizer` and `transformers.AutoTokenizer`.
## Requirements
- [transformers](https://github.com/huggingface/transformers)
- [general-sam](https://github.com/ModelTC/general-sam-py)
## Installation
```sh
git clone https://github.com/ModelTC/greedy-tokenizer.git
cd greedy-tokenizer
pip install -e .
```
Or use the [source file](./greedy_tokenizer.py) directly.
## Usage
```python
from greedy_tokenizer import GreedyTokenizer
from transformers import AutoTokenizer
# Construct GreedyTokenizer with other PretrainedTokenizer
tokenizer = GreedyTokenizer.from_other_pretrained(
"internlm/internlm2-chat-7b",
trust_remote_code=True,
revision="main",
use_fast=False,
)
# Or, you can use:
# old_tokenizer = AutoTokenizer.from_pretrained(...)
# tokenizer = GreedyTokenizer.mock_tokenizer(old_tokenizer)
seq = "Hello! 你好呀!🌠"
tokens = tokenizer.tokenize(seq)
print(tokens)
# ['Hello', '!', ' ', '你好', '呀', '!', '<0xF0>', '<0x9F>', '<0x8C>', '<0xA0>']
assert tokenizer.convert_tokens_to_string(tokens) == seq
# GreedyTokenizer can also be saved and loaded
tokenizer.save_pretrained("/tmp/internlm2-chat-gt")
tokenizer = AutoTokenizer.from_pretrained(
"/tmp/internlm2-chat-gt",
trust_remote_code=True,
use_fast=False,
)
# No subwords required!
gt = GreedyTokenizer(vocab=[f'<0x{i:02x}>' for i in range(256)] + ['你好呀'])
print(gt.tokenize('你好你好呀'))
# ['<0xe4>', '<0xbd>', '<0xa0>', '<0xe5>', '<0xa5>', '<0xbd>', '你好呀']
```
## Tests
```sh
pip install -e ".[test]"
pytest -s
# You can set some environment variables
# DATASET=happylkx/InstructCoder COLUMN=input pytest -s
```
## License
- © 2023 Chielo Newctle \<[ChieloNewctle@gmail.com](mailto:ChieloNewctle@gmail.com)\>
- © 2023 ModelTC Team
This project is licensed under either of
- [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) ([`LICENSE-APACHE`](LICENSE-APACHE))
- [MIT license](https://opensource.org/licenses/MIT) ([`LICENSE-MIT`](LICENSE-MIT))
at your option.
The [SPDX](https://spdx.dev) license identifier for this project is `MIT OR Apache-2.0`.