https://github.com/bnosac/tokenizers.bpe

R package for Byte Pair Encoding based on YouTokenToMe
https://github.com/bnosac/tokenizers.bpe

bpe byte-pair-encoding text-mining tokenization

Last synced: 8 months ago
JSON representation

R package for Byte Pair Encoding based on YouTokenToMe

Host: GitHub
URL: https://github.com/bnosac/tokenizers.bpe
Owner: bnosac
License: mpl-2.0
Created: 2019-07-25T16:47:40.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2023-09-16T05:43:51.000Z (over 2 years ago)
Last Synced: 2025-04-14T20:18:18.159Z (10 months ago)
Topics: bpe, byte-pair-encoding, text-mining, tokenization
Language: C++
Homepage:
Size: 7.64 MB
Stars: 15
Watchers: 5
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # tokenizers.bpe - R package for Byte Pair Encoding

This repository contains an R package which is an Rcpp wrapper around the YouTokenToMe C++ library

- YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency

- It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://aclanthology.org/P16-1162/)]

- YouTokenToMe is available at https://github.com/VKCOM/YouTokenToMe

- Note that the flat_hash_map used in YouTokenToMe was replaced by [parallel-hashmap](https://github.com/greg7mdp/parallel-hashmap)

## Features

The R package allows you to 

- build a Byte Pair Encoding (BPE) model

- apply the model to encode text

- apply the model to decode ids back to text

## Installation

- For regular users, install the package from your local CRAN mirror `install.packages("tokenizers.bpe")`

- For installing the development version of this package: `remotes::install_github("bnosac/tokenizers.bpe")`

Look to the documentation of the functions

```

help(package = "tokenizers.bpe")

```

## Example

- As an example, let's take some training data containing questions asked in Belgian Parliament in 2017 and focus on French text only.

```{r}

library(tokenizers.bpe)

data(belgium_parliament, package = "tokenizers.bpe")

x <- subset(belgium_parliament, language == "french")

writeLines(text = x$text, con = "traindata.txt")

```

- Train a model on text data and inspect the vocabulary

```{r}

model <- bpe("traindata.txt", coverage = 0.999, vocab_size = 5000)

model

```

```

Byte Pair Encoding model trained with YouTokenToMe

  size of the vocabulary: 5000

  model stored at: C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/tokenizers.bpe/youtokentome.bpe

```

```{r}

str(model$vocabulary)

```

```

'data.frame':	5000 obs. of  2 variables:

 $ id     : int  0 1 2 3 4 5 6 7 8 9 ...

 $ subword: chr  "" "" "" "" ...

```

- Use the model to encode text

```{r}

text <- c("L'appartement est grand & vraiment bien situe en plein centre",

          "Proportion de femmes dans les situations de famille monoparentale.")

bpe_encode(model, x = text, type = "subwords")

```

```

[[1]]

 [1] "▁L'"     "app"     "ar"      "tement"  "▁est"    "▁grand"  "▁"       "&"       "▁v"      "r"       "ai"      "ment"    "▁bien"   "▁situe"  "▁en"     "▁plein"  "▁centre"

[[2]]

 [1] "▁Pro"        "por"         "tion"        "▁de"         "▁femmes"     "▁dans"       "▁les"        "▁situations" "▁de"         "▁famille"    "▁mon"        "op"          "ar"          "ent"         "ale." 

```

```{r}

bpe_encode(model, x = text, type = "ids")

```

```

[[1]]

 [1]  421  327   98  554  178 1521    4    1  117   11  101   99  679 4599  113 3702 2126

[[2]]

 [1] 1529 4878   92   76 2321  162  108 4099   76 3218  791  312   98   87 2546

```

- Use the model to decode byte pair encodings back to text

```{r}

x <- bpe_encode(model, x = text, type = "ids")

bpe_decode(model, x)

```

```

[[1]]

[1] "L'appartement est grand  vraiment bien situe en plein centre"

[[2]]

[1] "Proportion de femmes dans les situations de famille monoparentale."

```

## Support in text mining

Need support in text mining?

Contact BNOSAC: http://www.bnosac.be

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bnosac/tokenizers.bpe

Awesome Lists containing this project

README