{"id":15601087,"url":"https://github.com/lucidrains/nim-tokenizer","last_synced_at":"2025-04-09T16:18:01.158Z","repository":{"id":96485861,"uuid":"583771692","full_name":"lucidrains/nim-tokenizer","owner":"lucidrains","description":"Implementation of a simple BPE tokenizer, but in Nim","archived":false,"fork":false,"pushed_at":"2023-07-02T17:28:49.000Z","size":6,"stargazers_count":22,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-09T16:17:54.654Z","etag":null,"topics":["artificial-intelligence","deep-learning","language-models","nim","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Nim","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-30T21:37:09.000Z","updated_at":"2024-11-28T10:00:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"0e7c3fd1-b9bf-4cd7-ae2b-3c8f6285f704","html_url":"https://github.com/lucidrains/nim-tokenizer","commit_stats":{"total_commits":6,"total_committers":1,"mean_commits":6.0,"dds":0.0,"last_synced_commit":"60f33b2e59fb2e516530442ce033d004c4d72517"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fnim-tokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fnim-tokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fnim-tokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fnim-tokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/nim-tokenizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248065281,"owners_count":21041872,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-learning","language-models","nim","tokenizer"],"created_at":"2024-10-03T02:14:17.448Z","updated_at":"2025-04-09T16:18:01.101Z","avatar_url":"https://github.com/lucidrains.png","language":"Nim","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Nim Tokenizer (wip)\n\nImplementation of a simple BPE tokenizer, but in \u003ca href=\"https://nim-lang.org/\"\u003eNim\u003c/a\u003e. May contain \u003ca href=\"https://arxiv.org/abs/1910.13267\"\u003eBPE Dropout\u003c/a\u003e too\n\n## Todo\n\n- [ ] figure out the special treatment of whitespaces as done in \u003ca href=\"https://arxiv.org/abs/2305.06161\"\u003estarcoder\u003c/a\u003e and make sure it is supported\n\n## Citations\n\n```bibtex\n@inproceedings{Wang2019NeuralMT,\n    title   = {Neural Machine Translation with Byte-Level Subwords},\n    author  = {Changhan Wang and Kyunghyun Cho and Jiatao Gu},\n    booktitle = {AAAI Conference on Artificial Intelligence},\n    year    = {2019}\n}\n```\n\n```bibtex\n@inproceedings{provilkov-etal-2020-bpe,\n    title   = \"{BPE}-Dropout: Simple and Effective Subword Regularization\",\n    author  = \"Provilkov, Ivan  and Emelianenko, Dmitrii  and Voita, Elena\",\n    booktitle = \"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics\",\n    month   = jul,\n    year    = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url     = \"https://aclanthology.org/2020.acl-main.170\",\n    doi     = \"10.18653/v1/2020.acl-main.170\",\n    pages   = \"1882--1892\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fnim-tokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fnim-tokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fnim-tokenizer/lists"}