Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/richiejp/ztran
Presently just a semi-working byte-pair encoder
https://github.com/richiejp/ztran
Last synced: about 13 hours ago
JSON representation
Presently just a semi-working byte-pair encoder
- Host: GitHub
- URL: https://github.com/richiejp/ztran
- Owner: richiejp
- Created: 2023-12-04T09:50:09.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-02-29T08:20:45.000Z (9 months ago)
- Last Synced: 2024-02-29T09:31:18.583Z (9 months ago)
- Language: Zig
- Size: 12.7 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Zig Transformer
Weekend project to implement the encoder-decoder self-attention
transformer model from scratch (i.e. probably never :-p or maybe Mamba
at the rate things are moving).## Progress
1. So far I have a slow and complicated byte-pair encoder for
tokenization. It encodes bytes (limited to ASCII at the moment) into
16-bit codes that index into a dictionary. It supports doing the
encoding in blocks because the time complexity is not great.2. After looking at OpenAI's Tiktokenizer and watching Kaparthy's video on BPE
I have a much more positive opinion of my own attempt. Here are
some notes:
- Training and encoding BPE have poor time complexity which has
to be worked around. There could be some really clever
solution, but it's not what people are presently using.
- Generally a hand written regex is used to split strings to
avoid the `m` or the `n` in `O(mn)` being large and avoid
some tokens. I'm not sure whether a sequence of bytes can be
engineered to cause a timeout with existing libraries and
encodings. I certainly had this issue with training my own
BPE with no splitting.
- UTF-8 is used sort of. The regex works on UTF-8, but then the
BPE works on bytes. So I observed that the first two bytes in
a three byte character can be joined into a single token
while the remaining byte is encoded as a seperate token. Rare
4-byte UTF-8 chars can end up being 4 tokens where each
token is represented with 4-bytes.
Some things I'd like to try
- [ ] Get my own implementation to work with UTF-8
- [ ] Get it to encode and decode cl100k_base (GPT4's tokenizer)
- [ ] Make it fast
- [ ] Create some training scheme so that it chooses byte pairs
to split strings on instead of manually specifying a
regex
- [ ] Create a CLI
- [ ] Create a Python binding