https://github.com/macmillancontentscience/wordpiece

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/macmillancontentscience/wordpiece
Owner: macmillancontentscience
License: other
Created: 2020-12-02T03:28:21.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2022-03-03T16:21:28.000Z (about 3 years ago)
Last Synced: 2024-10-28T17:32:34.843Z (6 months ago)
Language: R
Size: 92.8 KB
Stars: 8
Watchers: 2
Forks: 1
Open Issues: 4
Metadata Files:
- Readme: README.Rmd
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

jimsghstars - macmillancontentscience/wordpiece - (R)

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# wordpiece

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

The goal of wordpiece is to allow for easy text tokenization using a wordpiece vocabulary.

## Installation

You can install the released version of wordpiece from [CRAN](https://CRAN.R-project.org) with:

``` r

install.packages("wordpiece")

```

And the development version from [GitHub](https://github.com/) with:

``` r

# install.packages("devtools")

devtools::install_github("macmillancontentscience/wordpiece")

```

# Examples

This package can be used to tokenize text for modeling.

A common usecase would be to tokenize all text in a data.frame or other tibble.

```{r tokenize_df}

library(wordpiece)

library(dplyr, warn.conflicts = FALSE)

df_tokenized <- tibble(

  text = c(

    "I like tacos.",

    "I like apples with cheese.",

    "The unaffable coder wrote incorrect examples."

  )

) %>% 

  mutate(

    tokens = wordpiece_tokenize(text)

  )

df_tokenized

df_tokenized$tokens[[1]]

```

## Code of Conduct

  

Please note that the wordpiece project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). 

By contributing to this project, you agree to abide by its terms.

## Disclaimer

This is not an officially supported Macmillan Learning product.

## Contact information

Questions or comments should be directed to Jonathan Bratt (jonathan.bratt@macmillan.com) and Jon Harmon (jonthegeek@gmail.com).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/macmillancontentscience/wordpiece

Awesome Lists containing this project

README