https://github.com/vgherard/sbo
Utilities for training and evaluating text predictors based on Stupid Back-off N-gram models.
https://github.com/vgherard/sbo
natural-language-processing ngram-models predictive-text sbo
Last synced: 3 days ago
JSON representation
Utilities for training and evaluating text predictors based on Stupid Back-off N-gram models.
- Host: GitHub
- URL: https://github.com/vgherard/sbo
- Owner: vgherard
- Created: 2020-08-01T22:20:34.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2021-07-07T13:49:59.000Z (almost 4 years ago)
- Last Synced: 2025-04-30T09:53:53.007Z (2 months ago)
- Topics: natural-language-processing, ngram-models, predictive-text, sbo
- Language: R
- Homepage:
- Size: 24.1 MB
- Stars: 10
- Watchers: 3
- Forks: 2
- Open Issues: 6
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# sbo
[](https://ci.appveyor.com/project/vgherard/sbo)
[](https://circleci.com/gh/vgherard/sbo)
[](https://github.com/vgherard/sbo/actions)
[](https://codecov.io/gh/vgherard/sbo?branch=master)
[](https://CRAN.R-project.org/package=sbo)
[](https://CRAN.R-project.org/package=sbo)
[](https://twitter.com/intent/tweet?text={sbo}: Stupid Back-Off N-gram Models in R&url=https://vgherard.github.io/sbo&via=ValerioGherardi&hashtags=rstats,nlp,ngrams)`sbo` provides utilities for building and evaluating text predictors based on
[Stupid Back-off](https://www.aclweb.org/anthology/D07-1090.pdf) N-gram models
in R. It includes functions such as:- `kgram_freqs()`: Extract $k$-gram frequency tables from a text corpus
- `sbo_predictor()`: Train a next-word predictor via Stupid Back-off.
- `eval_sbo_predictor()`: Test text predictions against an independent corpus.## Installation
### Released version
You can install the latest release of `sbo` from CRAN:
``` r
install.packages("sbo")
```### Development version:
You can install the development version of `sbo` from GitHub:
``` r
# install.packages("devtools")
devtools::install_github("vgherard/sbo")
```## Example
This example shows how to build a text predictor with `sbo`:
```{r example, message=FALSE, warning=FALSE}
library(sbo)
p <- sbo_predictor(sbo::twitter_train, # 50k tweets, example dataset
N = 3, # Train a 3-gram model
dict = sbo::twitter_dict, # Top 1k words appearing in corpus
.preprocess = sbo::preprocess, # Preprocessing transformation
EOS = ".?!:;" # End-Of-Sentence characters
)
```The object `p` can now be used to generate predictive text as follows:
```{r}
predict(p, "i love") # a character vector
predict(p, "you love") # another character vector
predict(p,
c("i love", "you love", "she loves", "we love", "you love", "they love")
) # a character matrix
```## Related packages
For more general purpose utilities to work with $n$-gram models, you can also check out my package [`{kgrams}`](https://vgherard.github.io/kgrams/).
## Help
For help, see the `sbo` [website](https://vgherard.github.io/sbo/).