https://github.com/hscspring/bytepiece-rs
The Bytepiece Tokenizer Implemented in Rust.
https://github.com/hscspring/bytepiece-rs
bytepiece language-model nlp tokenizer
Last synced: 9 months ago
JSON representation
The Bytepiece Tokenizer Implemented in Rust.
- Host: GitHub
- URL: https://github.com/hscspring/bytepiece-rs
- Owner: hscspring
- Created: 2023-09-17T14:04:45.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-11-28T14:33:25.000Z (over 2 years ago)
- Last Synced: 2024-10-18T00:48:41.148Z (over 1 year ago)
- Topics: bytepiece, language-model, nlp, tokenizer
- Language: Rust
- Homepage:
- Size: 1.98 MB
- Stars: 14
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# bytepiece
Implementation of Su's [bytepiece](https://github.com/bojone/bytepiece).
Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little preprocessing, more pure and language independent.
## Bindings
- [Rust](https://github.com/hscspring/bytepiece-rs/tree/main/bytepiece_rs)
- [Python](https://github.com/hscspring/bytepiece-rs/tree/main/bindings/python)
## Quick Example using Python
```python
from rs_bytepiece import Tokenizer
tokenizer = Tokenizer()
output = tokenizer.encode("今天天气不错")
print(output)
# [40496, 45268, 39432]
```