{"id":17717631,"url":"https://github.com/davzim/rtiktoken","last_synced_at":"2025-05-06T18:43:13.509Z","repository":{"id":257935116,"uuid":"870701151","full_name":"DavZim/rtiktoken","owner":"DavZim","description":"BPE Tokenizer for OpenAI's models","archived":false,"fork":false,"pushed_at":"2025-04-14T19:54:28.000Z","size":14096,"stargazers_count":12,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-14T20:38:39.915Z","etag":null,"topics":["bpe","openai","r","rust","tokenization"],"latest_commit_sha":null,"homepage":"https://davzim.github.io/rtiktoken/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DavZim.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-10T14:08:02.000Z","updated_at":"2025-04-14T19:51:44.000Z","dependencies_parsed_at":"2024-10-17T02:59:53.791Z","dependency_job_id":"b9f3cc52-5495-4876-8c6f-6edf13b8930b","html_url":"https://github.com/DavZim/rtiktoken","commit_stats":null,"previous_names":["davzim/rtiktoken"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavZim%2Frtiktoken","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavZim%2Frtiktoken/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavZim%2Frtiktoken/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavZim%2Frtiktoken/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DavZim","download_url":"https://codeload.github.com/DavZim/rtiktoken/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252747136,"owners_count":21798090,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bpe","openai","r","rust","tokenization"],"created_at":"2024-10-25T14:27:24.996Z","updated_at":"2025-05-06T18:43:13.483Z","avatar_url":"https://github.com/DavZim.png","language":"R","readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n```\n\n# rtiktoken\n\n\u003c!-- badges: start --\u003e\n[![R-CMD-check](https://github.com/DavZim/rtiktoken/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/DavZim/rtiktoken/actions/workflows/R-CMD-check.yaml)\n[![CRAN status](https://www.r-pkg.org/badges/version/rtiktoken)](https://CRAN.R-project.org/package=rtiktoken)\n\u003c!-- badges: end --\u003e\n\n`{rtiktoken}` is a thin wrapper around [`tiktoken-rs`](https://github.com/zurawiki/tiktoken-rs) (and in turn around [OpenAI's Python library `tiktoken`](https://github.com/openai/tiktoken)).\nIt provides functions to encode text into tokens used by OpenAI's models and decode tokens back into text using [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokenizers.\nIt is also useful to count the numbers of tokens in a text to guess how expensive a call to OpenAI's API would be.\nNote that all the tokenization happens offline and no internet connection is required.\n\nAnother use-case is to compute similarity scores between texts using tokens.\n\nOther use-cases can be found in the OpenAI's cookbook [How to Count Tokens with `tiktoken`](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).\n\nTo verify the outputs of the functions, see also [OpenAI's Tokenizer Platform](https://platform.openai.com/tokenizer).\n\n\n## Installation\n\nYou can install `rtiktoken` like so:\n\n``` r\n# Dev version\n# install.packages(\"devtools\")\n# devtools::install_github(\"DavZim/rtiktoken\")\n\n# CRAN version\ninstall.packages(\"rtiktoken\")\n```\n\n\n## Example\n\n```{r example}\nlibrary(rtiktoken)\n\n# 1. Encode text into tokens\ntext \u003c- c(\n  \"Hello World, this is a text that we are going to use in rtiktoken!\",\n  \"Note that the functions are vectorized! Yay!\"\n)\ntokens \u003c- get_tokens(text, \"gpt-4o\")\ntokens\n\n# 2. Decode tokens back into text\ndecoded_text \u003c- decode_tokens(tokens, \"gpt-4o\")\ndecoded_text\n\n# Note that it's not guaranteed to produce the identical text as text-parts\n# might be dropped if no token match is found.\nidentical(text, decoded_text)\n\n# 3. Count the number of tokens in a text\nn_tokens \u003c- get_token_count(text, \"gpt-4o\")\nn_tokens\n```\n\n\n### Models \u0026 Tokenizers\n\nThe different OpenAI models use different tokenizers (see also [source code of `tikoken-rs`](https://github.com/zurawiki/tiktoken-rs/blob/main/tiktoken-rs/src/tokenizer.rs) for a full list).\n\nThe following models use the following tokenizers (note that all functions of this package both allow to use the model names as well as the tokenizer names):\n\n| Model Name | Tokenizer Name |\n|------------|----------------|\n| GPT-4o models   | `o200k_base`   |\n| ChatGPT models, e.g., `text-embedding-ada-002`, `gpt-3.5-turbo`, `gpt-4-` | `cl100k_base` |\n| Code models, e.g., `text-davinci-002`, `text-davinci-003` | `p50k_base` |\n| Edit models, e.g., `text-davinci-edit-001`, `code-davinci-edit-001` | `p50k_edit` |\n| GPT-3 models, e.g., `davinci` | `r50k_base` or `gpt2` |\n\n\n## Development\n\n`rtiktoken` is built using `extendr` and `Rust`.\nTo build the package, you need to have `Rust` installed on your machine.\n\n```r\nrextendr::document()\ndevtools::document()\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavzim%2Frtiktoken","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavzim%2Frtiktoken","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavzim%2Frtiktoken/lists"}