An open API service indexing awesome lists of open source software.

https://github.com/mlverse/hftokenizers

Hugging face tokenizers for R using extendr
https://github.com/mlverse/hftokenizers

Last synced: 12 months ago
JSON representation

Hugging face tokenizers for R using extendr

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# HuggingFace tokenizers from R

[![R build status](https://github.com/mlverse/hftokenizers/workflows/R-CMD-check/badge.svg)](https://github.com/mlverse/hftokenizers/actions)

> This is an experimental project binding HuggingFace [tokenizers](https://github.com/huggingface/tokenizers) Rust library to R using the [extendr](https://github.com/extendr/extendr) project. Do **not** use for anything meaninful yet.

## Installation

This repository uses the [helloextendr template](https://github.com/extendr/helloextendr).

Before you can install this package, you need to install a working Rust toolchain. We recommend using [rustup.](https://rustup.rs/)

On Windows, you'll also have to add the `i686-pc-windows-gnu` and `x86_64-pc-windows-gnu` targets:

rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu

Once Rust is working, you can install this package via:

``` {.r}
remotes::install_github("mlverse/hftokenizers")
```

## Small example

Here's a quick demo of what you can do with `hftokenizers`:

```{r}
library(hftokenizers)

download.file(
"https://raw.githubusercontent.com/mlverse/hftokenizers/main/tests/testthat/assets/small.txt",
"small.txt"
)

tokenizer$
new(models_bpe$new())$
train(normalizePath("small.txt"))$
encode(c("hello world"))$
ids
```