https://github.com/macmillancontentscience/wordpiece
https://github.com/macmillancontentscience/wordpiece
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/macmillancontentscience/wordpiece
- Owner: macmillancontentscience
- License: other
- Created: 2020-12-02T03:28:21.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2022-03-03T16:21:28.000Z (about 3 years ago)
- Last Synced: 2024-10-28T17:32:34.843Z (6 months ago)
- Language: R
- Size: 92.8 KB
- Stars: 8
- Watchers: 2
- Forks: 1
- Open Issues: 4
-
Metadata Files:
- Readme: README.Rmd
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- jimsghstars - macmillancontentscience/wordpiece - (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# wordpiece
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
The goal of wordpiece is to allow for easy text tokenization using a wordpiece vocabulary.
## Installation
You can install the released version of wordpiece from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("wordpiece")
```And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("macmillancontentscience/wordpiece")
```# Examples
This package can be used to tokenize text for modeling.
A common usecase would be to tokenize all text in a data.frame or other tibble.```{r tokenize_df}
library(wordpiece)
library(dplyr, warn.conflicts = FALSE)
df_tokenized <- tibble(
text = c(
"I like tacos.",
"I like apples with cheese.",
"The unaffable coder wrote incorrect examples."
)
) %>%
mutate(
tokens = wordpiece_tokenize(text)
)df_tokenized
df_tokenized$tokens[[1]]
```## Code of Conduct
Please note that the wordpiece project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html).
By contributing to this project, you agree to abide by its terms.## Disclaimer
This is not an officially supported Macmillan Learning product.
## Contact information
Questions or comments should be directed to Jonathan Bratt (jonathan.bratt@macmillan.com) and Jon Harmon (jonthegeek@gmail.com).