https://github.com/vgherard/kgrams

k-grams, Language Models, and All That
https://github.com/vgherard/kgrams

language-models n-grams natural-language-processing

Last synced: about 2 months ago
JSON representation

k-grams, Language Models, and All That

Host: GitHub
URL: https://github.com/vgherard/kgrams
Owner: vgherard
License: gpl-3.0
Created: 2021-01-23T20:28:26.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2024-11-14T13:34:45.000Z (7 months ago)
Last Synced: 2025-03-30T14:41:57.079Z (3 months ago)
Topics: language-models, n-grams, natural-language-processing
Language: R
Homepage: https://vgherard.github.io/kgrams/
Size: 1.63 MB
Stars: 7
Watchers: 4
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

```{r srr-tags, eval = FALSE, echo = FALSE}

```

# kgrams

[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)

[![R-CMD-check](https://github.com/vgherard/kgrams/workflows/R-CMD-check/badge.svg)](https://github.com/vgherard/kgrams/actions)

[![Codecov test coverage](https://codecov.io/gh/vgherard/kgrams/branch/main/graph/badge.svg)](https://app.codecov.io/gh/vgherard/kgrams?branch=main)

[![CRAN status](https://www.r-pkg.org/badges/version/kgrams)](https://CRAN.R-project.org/package=kgrams)

[![R-universe status](https://vgherard.r-universe.dev/badges/kgrams)](https://vgherard.r-universe.dev/)

[![Website](https://img.shields.io/badge/Website-here-blue)](https://vgherard.github.io/kgrams/)

[![Tweet](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/intent/tweet?text={kgrams}:%20Classical%20k-gram%20Language%20Models&url=https://github.com/vgherard/kgrams&via=ValerioGherardi&hashtags=rstats,MachineLearning,NaturalLanguageProcessing)

[`kgrams`](https://vgherard.github.io/kgrams/) provides tools for training and evaluating $k$-gram language models, including several probability smoothing methods, perplexity computations, random text generation and more. It is based on an C++ back-end which makes `kgrams` fast, coupled with an accessible R API which aims at streamlining the process of model building, and can be suitable for small- and medium-sized NLP experiments, baseline model building, and for pedagogical purposes.

## For beginners

If you have no idea about what $k$-gram models are *and* didn't get here by 

accident, you can check out my hands-on [tutorial post on $k$-gram language models](https://datascienceplus.com/an-introduction-to-k-gram-language-models-in-r/) using R at [DataScience+](https://datascienceplus.com/).

## Installation

#### Released version

You can install the latest release of `kgrams` from [CRAN](https://CRAN.R-project.org/package=kgrams) with:

``` r

install.packages("kgrams")

```

#### Development version

You can install the development version from [my R-universe](https://vgherard.r-universe.dev/) with:

``` r

install.packages("kgrams", repos = "https://vgherard.r-universe.dev/")

```

## Example

This example shows how to train a modified Kneser-Ney 4-gram model on Shakespeare's play "Much Ado About Nothing" using `kgrams`.

```{r}

library(kgrams)

# Get k-gram frequency counts from text, for k = 1:4

freqs <- kgram_freqs(kgrams::much_ado, N = 4)

# Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3.

mkn <- language_model(freqs, smoother = "mkn", D1 = 0.25, D2 = 0.5, D3 = 0.75)

```

We can now use this `language_model` to compute sentence and word continuation probabilities:

```{r}

# Compute sentence probabilities

probability(c("did he break out into tears ?",

              "we are predicting sentence probabilities ."

              ), 

            model = mkn

            )

# Compute word continuation probabilities

probability(c("tears", "pieces") %|% "did he break out into", model = mkn)

```

Here are some sentences sampled from the language model's distribution at temperatures `t = c(1, 0.1, 10)`:

```{r}

# Sample sentences from the language model at different temperatures

set.seed(840)

sample_sentences(model = mkn, n = 3, max_length = 10, t = 1)

sample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1)

sample_sentences(model = mkn, n = 3, max_length = 10, t = 10)

```

## Getting Help

For further help, you can consult the reference page of the `kgrams` [website](https://vgherard.github.io/kgrams/) or [open an issue](https://github.com/vgherard/kgrams/issues) on the GitHub repository of `kgrams`. A vignette is available on the website, illustrating the process of building language models in-depth.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vgherard/kgrams

Awesome Lists containing this project

README