{"id":18301213,"url":"https://github.com/jaymody/ocaml-tokenizers","last_synced_at":"2025-04-09T09:45:13.551Z","repository":{"id":197041761,"uuid":"697640347","full_name":"jaymody/ocaml-tokenizers","owner":"jaymody","description":"Transformer tokenizers in OCaml.","archived":false,"fork":false,"pushed_at":"2024-01-10T16:48:37.000Z","size":561,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-15T03:46:48.290Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"OCaml","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaymody.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-09-28T07:06:37.000Z","updated_at":"2023-09-28T07:06:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"f3616407-2cc7-4453-9eef-cbf1fc6b7f4a","html_url":"https://github.com/jaymody/ocaml-tokenizers","commit_stats":null,"previous_names":["jaymody/ocaml-tokenizers"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaymody%2Focaml-tokenizers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaymody%2Focaml-tokenizers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaymody%2Focaml-tokenizers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaymody%2Focaml-tokenizers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaymody","download_url":"https://codeload.github.com/jaymody/ocaml-tokenizers/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248017871,"owners_count":21034042,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T15:14:50.668Z","updated_at":"2025-04-09T09:45:13.525Z","avatar_url":"https://github.com/jaymody.png","language":"OCaml","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OCaml Tokenizers\nTransformer tokenizers in OCaml.\n\nCurrently, only BPE \"inference\" is implemented, but I hope to expand it with training and WordPiece tokenization.\n\n### Usage\nDependencies:\n```shell\nopam switch create . -w\n```\n\nRun CLI (converts stdin text to BPE token ids):\n```shell\n\u003e printf \"This is some text\" | dune exec -- bin/main.exe\n1212\n318\n617\n2420\n```\n\n### Test\nTo compare the BPE implementation to [`tiktoken`](https://github.com/openai/tiktoken), run:\n\n```shell\ncat some_file.txt | python -c \"import sys;import tiktoken;print(*tiktoken.get_encoding('gpt2').encode(sys.stdin.read()),sep='\\n')\"\n```\n\nAnd compare with:\n\n```shell\ncat some_file.txt | dune exec -- bin/main.exe\n```\n\n### Todo\n- [ ] Add ability to download BPE vocab files.\n- [ ] Implement training for BPE.\n- [ ] Fix issue in BPE where the python version doesn't merge two consecutive new lines (leaves it as [198, 198], i.e. [\\n, \\n]) while this version merges them (to [ 628 ] i.e. [\\n\\n])). This is due to the last two lines in the [original implementations regex](https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53), `\"\\s+(?!\\S)\"` which will always separate two consecutive `\\n` if the second `\\n` is followed by a non whitespace character. Effectively, this means that BPE tokenization without the regex gives a slightly different result. This is more a bug with the OpenAI implementation (BPE should give the same result with or without regex ideally) but nonetheless should be dealt with since it might effect generation negatively.\n- [ ] Implement wordpiece tokenization for BERT, reference: https://github.com/google-research/bert/blob/master/tokenization.py\n- [ ] Add example of streaming the output of tokenization to something like `llama.c`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaymody%2Focaml-tokenizers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaymody%2Focaml-tokenizers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaymody%2Focaml-tokenizers/lists"}