Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vgherard/kgrams
k-grams, Language Models, and All That
https://github.com/vgherard/kgrams
language-models n-grams natural-language-processing
Last synced: 9 days ago
JSON representation
k-grams, Language Models, and All That
- Host: GitHub
- URL: https://github.com/vgherard/kgrams
- Owner: vgherard
- License: gpl-3.0
- Created: 2021-01-23T20:28:26.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-11-13T08:26:31.000Z (about 1 month ago)
- Last Synced: 2024-11-13T09:20:41.121Z (about 1 month ago)
- Topics: language-models, n-grams, natural-language-processing
- Language: R
- Homepage: https://vgherard.github.io/kgrams/
- Size: 1.63 MB
- Stars: 7
- Watchers: 4
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
``````{r srr-tags, eval = FALSE, echo = FALSE}
```# kgrams
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![R-CMD-check](https://github.com/vgherard/kgrams/workflows/R-CMD-check/badge.svg)](https://github.com/vgherard/kgrams/actions)
[![Codecov test coverage](https://codecov.io/gh/vgherard/kgrams/branch/main/graph/badge.svg)](https://app.codecov.io/gh/vgherard/kgrams?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/kgrams)](https://CRAN.R-project.org/package=kgrams)
[![R-universe status](https://vgherard.r-universe.dev/badges/kgrams)](https://vgherard.r-universe.dev/)
[![Website](https://img.shields.io/badge/Website-here-blue)](https://vgherard.github.io/kgrams/)
[![Tweet](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/intent/tweet?text={kgrams}:%20Classical%20k-gram%20Language%20Models&url=https://github.com/vgherard/kgrams&via=ValerioGherardi&hashtags=rstats,MachineLearning,NaturalLanguageProcessing)[`kgrams`](https://vgherard.github.io/kgrams/) provides tools for training and evaluating $k$-gram language models, including several probability smoothing methods, perplexity computations, random text generation and more. It is based on an C++ back-end which makes `kgrams` fast, coupled with an accessible R API which aims at streamlining the process of model building, and can be suitable for small- and medium-sized NLP experiments, baseline model building, and for pedagogical purposes.
## For beginners
If you have no idea about what $k$-gram models are *and* didn't get here by
accident, you can check out my hands-on [tutorial post on $k$-gram language models](https://datascienceplus.com/an-introduction-to-k-gram-language-models-in-r/) using R at [DataScience+](https://datascienceplus.com/).## Installation
#### Released version
You can install the latest release of `kgrams` from [CRAN](https://CRAN.R-project.org/package=kgrams) with:
``` r
install.packages("kgrams")
```#### Development version
You can install the development version from [my R-universe](https://vgherard.r-universe.dev/) with:
``` r
install.packages("kgrams", repos = "https://vgherard.r-universe.dev/")
```## Example
This example shows how to train a modified Kneser-Ney 4-gram model on Shakespeare's play "Much Ado About Nothing" using `kgrams`.
```{r}
library(kgrams)
# Get k-gram frequency counts from text, for k = 1:4
freqs <- kgram_freqs(kgrams::much_ado, N = 4)
# Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3.
mkn <- language_model(freqs, smoother = "mkn", D1 = 0.25, D2 = 0.5, D3 = 0.75)
```We can now use this `language_model` to compute sentence and word continuation probabilities:
```{r}
# Compute sentence probabilities
probability(c("did he break out into tears ?",
"we are predicting sentence probabilities ."
),
model = mkn
)
# Compute word continuation probabilities
probability(c("tears", "pieces") %|% "did he break out into", model = mkn)
```Here are some sentences sampled from the language model's distribution at temperatures `t = c(1, 0.1, 10)`:
```{r}
# Sample sentences from the language model at different temperatures
set.seed(840)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 1)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 10)
```## Getting Help
For further help, you can consult the reference page of the `kgrams` [website](https://vgherard.github.io/kgrams/) or [open an issue](https://github.com/vgherard/kgrams/issues) on the GitHub repository of `kgrams`. A vignette is available on the website, illustrating the process of building language models in-depth.