Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jmaczan/bpe.c
Byte-Pair Encoding tokenizer for training large language models on huge datasets. I don't know C, so most of the code comes from AI :D I hope to learn by rewriting it and making changes, fixes etc
https://github.com/jmaczan/bpe.c
bpe bpe-tokenizer c clang llm tokenizer
Last synced: 8 days ago
JSON representation
Byte-Pair Encoding tokenizer for training large language models on huge datasets. I don't know C, so most of the code comes from AI :D I hope to learn by rewriting it and making changes, fixes etc
- Host: GitHub
- URL: https://github.com/jmaczan/bpe.c
- Owner: jmaczan
- License: gpl-3.0
- Created: 2024-06-07T06:18:19.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-06-23T12:09:53.000Z (5 months ago)
- Last Synced: 2024-06-23T13:28:34.810Z (5 months ago)
- Topics: bpe, bpe-tokenizer, c, clang, llm, tokenizer
- Language: C
- Homepage:
- Size: 18.6 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# bpe.c
Byte-Pair Encoding tokenizer for training large language models on huge datasets. I don't know C yet, so **most of the code comes from AI :D**
I hope to learn C by rewriting it and making changes, fixes etc