https://github.com/vgherard/sbo

Utilities for training and evaluating text predictors based on Stupid Back-off N-gram models.
https://github.com/vgherard/sbo

natural-language-processing ngram-models predictive-text sbo

Last synced: 3 days ago
JSON representation

Utilities for training and evaluating text predictors based on Stupid Back-off N-gram models.

Host: GitHub
URL: https://github.com/vgherard/sbo
Owner: vgherard
Created: 2020-08-01T22:20:34.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2021-07-07T13:49:59.000Z (almost 4 years ago)
Last Synced: 2025-04-30T09:53:53.007Z (2 months ago)
Topics: natural-language-processing, ngram-models, predictive-text, sbo
Language: R
Homepage:
Size: 24.1 MB
Stars: 10
Watchers: 3
Forks: 2
Open Issues: 6
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# sbo

[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/vgherard/sbo?branch=master&svg=true)](https://ci.appveyor.com/project/vgherard/sbo)

[![CircleCI build status](https://circleci.com/gh/vgherard/sbo.svg?style=svg)](https://circleci.com/gh/vgherard/sbo)

[![GitHub Actions build status](https://github.com/vgherard/sbo/workflows/R-CMD-check/badge.svg)](https://github.com/vgherard/sbo/actions)

[![Codecov test coverage](https://codecov.io/gh/vgherard/sbo/branch/master/graph/badge.svg)](https://codecov.io/gh/vgherard/sbo?branch=master)

[![CRAN status](https://www.r-pkg.org/badges/version/sbo)](https://CRAN.R-project.org/package=sbo)

[![CRAN downloads](http://cranlogs.r-pkg.org/badges/grand-total/sbo)](https://CRAN.R-project.org/package=sbo)

[![Tweet](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/intent/tweet?text={sbo}: Stupid Back-Off N-gram Models in R&url=https://vgherard.github.io/sbo&via=ValerioGherardi&hashtags=rstats,nlp,ngrams)

`sbo` provides utilities for building and evaluating text predictors based on 

[Stupid Back-off](https://www.aclweb.org/anthology/D07-1090.pdf) N-gram models 

in R. It includes functions such as:

- `kgram_freqs()`: Extract $k$-gram frequency tables from a text corpus

- `sbo_predictor()`: Train a next-word predictor via Stupid Back-off.

- `eval_sbo_predictor()`: Test text predictions against an independent corpus.

## Installation

### Released version

You can install the latest release of `sbo` from CRAN:

``` r

install.packages("sbo")

```

### Development version:

You can install the development version of `sbo` from GitHub:

``` r

# install.packages("devtools")

devtools::install_github("vgherard/sbo")

```

## Example

This example shows how to build a text predictor with `sbo`:

```{r example, message=FALSE, warning=FALSE}

library(sbo)

p <- sbo_predictor(sbo::twitter_train, # 50k tweets, example dataset

                   N = 3, # Train a 3-gram model

                   dict = sbo::twitter_dict, # Top 1k words appearing in corpus

                   .preprocess = sbo::preprocess, # Preprocessing transformation

                   EOS = ".?!:;" # End-Of-Sentence characters

                   )

```

The object `p` can now be used to generate predictive text as follows:

```{r}

predict(p, "i love") # a character vector

predict(p, "you love") # another character vector

predict(p, 

        c("i love", "you love", "she loves", "we love", "you love", "they love")

        ) # a character matrix

```

## Related packages

For more general purpose utilities to work with $n$-gram models, you can also check out my package [`{kgrams}`](https://vgherard.github.io/kgrams/).

## Help

For help, see  the `sbo` [website](https://vgherard.github.io/sbo/).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vgherard/sbo

Awesome Lists containing this project

README