Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lucidrains/nim-tokenizer
Implementation of a simple BPE tokenizer, but in Nim
https://github.com/lucidrains/nim-tokenizer
artificial-intelligence deep-learning language-models nim tokenizer
Last synced: about 2 months ago
JSON representation
Implementation of a simple BPE tokenizer, but in Nim
- Host: GitHub
- URL: https://github.com/lucidrains/nim-tokenizer
- Owner: lucidrains
- License: mit
- Created: 2022-12-30T21:37:09.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-07-02T17:28:49.000Z (over 1 year ago)
- Last Synced: 2024-10-30T02:43:59.817Z (3 months ago)
- Topics: artificial-intelligence, deep-learning, language-models, nim, tokenizer
- Language: Nim
- Homepage:
- Size: 5.86 KB
- Stars: 21
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Nim Tokenizer (wip)
Implementation of a simple BPE tokenizer, but in Nim. May contain BPE Dropout too
## Todo
- [ ] figure out the special treatment of whitespaces as done in starcoder and make sure it is supported
## Citations
```bibtex
@inproceedings{Wang2019NeuralMT,
title = {Neural Machine Translation with Byte-Level Subwords},
author = {Changhan Wang and Kyunghyun Cho and Jiatao Gu},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2019}
}
``````bibtex
@inproceedings{provilkov-etal-2020-bpe,
title = "{BPE}-Dropout: Simple and Effective Subword Regularization",
author = "Provilkov, Ivan and Emelianenko, Dmitrii and Voita, Elena",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.acl-main.170",
doi = "10.18653/v1/2020.acl-main.170",
pages = "1882--1892",
}
```