Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lucidrains/nim-tokenizer

Implementation of a simple BPE tokenizer, but in Nim
https://github.com/lucidrains/nim-tokenizer

artificial-intelligence deep-learning language-models nim tokenizer

Last synced: about 2 months ago
JSON representation

Implementation of a simple BPE tokenizer, but in Nim

Awesome Lists containing this project

README

        

## Nim Tokenizer (wip)

Implementation of a simple BPE tokenizer, but in Nim. May contain BPE Dropout too

## Todo

- [ ] figure out the special treatment of whitespaces as done in starcoder and make sure it is supported

## Citations

```bibtex
@inproceedings{Wang2019NeuralMT,
title = {Neural Machine Translation with Byte-Level Subwords},
author = {Changhan Wang and Kyunghyun Cho and Jiatao Gu},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2019}
}
```

```bibtex
@inproceedings{provilkov-etal-2020-bpe,
title = "{BPE}-Dropout: Simple and Effective Subword Regularization",
author = "Provilkov, Ivan and Emelianenko, Dmitrii and Voita, Elena",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.acl-main.170",
doi = "10.18653/v1/2020.acl-main.170",
pages = "1882--1892",
}
```