https://github.com/mlverse/hftokenizers
Hugging face tokenizers for R using extendr
https://github.com/mlverse/hftokenizers
Last synced: 12 months ago
JSON representation
Hugging face tokenizers for R using extendr
- Host: GitHub
- URL: https://github.com/mlverse/hftokenizers
- Owner: mlverse
- License: other
- Archived: true
- Created: 2021-01-15T03:54:11.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2023-05-17T07:31:14.000Z (about 3 years ago)
- Last Synced: 2025-07-07T13:40:40.528Z (12 months ago)
- Language: Rust
- Homepage: https://mlverse.github.io/hftokenizers
- Size: 172 KB
- Stars: 11
- Watchers: 4
- Forks: 1
- Open Issues: 4
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# HuggingFace tokenizers from R
[](https://github.com/mlverse/hftokenizers/actions)
> This is an experimental project binding HuggingFace [tokenizers](https://github.com/huggingface/tokenizers) Rust library to R using the [extendr](https://github.com/extendr/extendr) project. Do **not** use for anything meaninful yet.
## Installation
This repository uses the [helloextendr template](https://github.com/extendr/helloextendr).
Before you can install this package, you need to install a working Rust toolchain. We recommend using [rustup.](https://rustup.rs/)
On Windows, you'll also have to add the `i686-pc-windows-gnu` and `x86_64-pc-windows-gnu` targets:
rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu
Once Rust is working, you can install this package via:
``` {.r}
remotes::install_github("mlverse/hftokenizers")
```
## Small example
Here's a quick demo of what you can do with `hftokenizers`:
```{r}
library(hftokenizers)
download.file(
"https://raw.githubusercontent.com/mlverse/hftokenizers/main/tests/testthat/assets/small.txt",
"small.txt"
)
tokenizer$
new(models_bpe$new())$
train(normalizePath("small.txt"))$
encode(c("hello world"))$
ids
```