Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vgherard/kgrams

k-grams, Language Models, and All That
https://github.com/vgherard/kgrams

language-models n-grams natural-language-processing

Last synced: 9 days ago
JSON representation

k-grams, Language Models, and All That

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

```{r srr-tags, eval = FALSE, echo = FALSE}
```

# kgrams

[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![R-CMD-check](https://github.com/vgherard/kgrams/workflows/R-CMD-check/badge.svg)](https://github.com/vgherard/kgrams/actions)
[![Codecov test coverage](https://codecov.io/gh/vgherard/kgrams/branch/main/graph/badge.svg)](https://app.codecov.io/gh/vgherard/kgrams?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/kgrams)](https://CRAN.R-project.org/package=kgrams)
[![R-universe status](https://vgherard.r-universe.dev/badges/kgrams)](https://vgherard.r-universe.dev/)
[![Website](https://img.shields.io/badge/Website-here-blue)](https://vgherard.github.io/kgrams/)
[![Tweet](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/intent/tweet?text={kgrams}:%20Classical%20k-gram%20Language%20Models&url=https://github.com/vgherard/kgrams&via=ValerioGherardi&hashtags=rstats,MachineLearning,NaturalLanguageProcessing)

[`kgrams`](https://vgherard.github.io/kgrams/) provides tools for training and evaluating $k$-gram language models, including several probability smoothing methods, perplexity computations, random text generation and more. It is based on an C++ back-end which makes `kgrams` fast, coupled with an accessible R API which aims at streamlining the process of model building, and can be suitable for small- and medium-sized NLP experiments, baseline model building, and for pedagogical purposes.

## For beginners
If you have no idea about what $k$-gram models are *and* didn't get here by
accident, you can check out my hands-on [tutorial post on $k$-gram language models](https://datascienceplus.com/an-introduction-to-k-gram-language-models-in-r/) using R at [DataScience+](https://datascienceplus.com/).

## Installation

#### Released version

You can install the latest release of `kgrams` from [CRAN](https://CRAN.R-project.org/package=kgrams) with:

``` r
install.packages("kgrams")
```

#### Development version

You can install the development version from [my R-universe](https://vgherard.r-universe.dev/) with:

``` r
install.packages("kgrams", repos = "https://vgherard.r-universe.dev/")
```

## Example

This example shows how to train a modified Kneser-Ney 4-gram model on Shakespeare's play "Much Ado About Nothing" using `kgrams`.

```{r}
library(kgrams)
# Get k-gram frequency counts from text, for k = 1:4
freqs <- kgram_freqs(kgrams::much_ado, N = 4)
# Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3.
mkn <- language_model(freqs, smoother = "mkn", D1 = 0.25, D2 = 0.5, D3 = 0.75)
```

We can now use this `language_model` to compute sentence and word continuation probabilities:

```{r}
# Compute sentence probabilities
probability(c("did he break out into tears ?",
"we are predicting sentence probabilities ."
),
model = mkn
)
# Compute word continuation probabilities
probability(c("tears", "pieces") %|% "did he break out into", model = mkn)
```

Here are some sentences sampled from the language model's distribution at temperatures `t = c(1, 0.1, 10)`:

```{r}
# Sample sentences from the language model at different temperatures
set.seed(840)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 1)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 10)
```

## Getting Help

For further help, you can consult the reference page of the `kgrams` [website](https://vgherard.github.io/kgrams/) or [open an issue](https://github.com/vgherard/kgrams/issues) on the GitHub repository of `kgrams`. A vignette is available on the website, illustrating the process of building language models in-depth.