Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ealmloff/fast-bpe
What if tokenization was fast?
https://github.com/ealmloff/fast-bpe
Last synced: 8 days ago
JSON representation
What if tokenization was fast?
- Host: GitHub
- URL: https://github.com/ealmloff/fast-bpe
- Owner: ealmloff
- Created: 2024-08-30T00:51:58.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-09-13T17:53:40.000Z (4 months ago)
- Last Synced: 2024-12-22T04:41:57.327Z (16 days ago)
- Language: Rust
- Homepage:
- Size: 78.6 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
A fast BPE tokenizer written in Rust.
## Fast on small inputs
After pre-tokenization splitting, most inputs will be very small. FastBPE is absurdly fast on small inputs.
![Screenshot 2024-09-05 at 8 01 24 PM](https://github.com/user-attachments/assets/cb8ee307-dafb-4199-acdd-3495e7c3e8d0)
## Fast on giant inputs
Even if you don't pre-tokenize, FastBPE is takes linear time for any input size. This makes it very fast on giant inputs.
![Screenshot 2024-09-06 at 6 59 52 AM](https://github.com/user-attachments/assets/1120bce3-ad53-4037-adb6-f7c1f602ce1e)