{"id":22852940,"url":"https://github.com/vgherard/kgrams","last_synced_at":"2025-04-30T09:21:14.312Z","repository":{"id":48301706,"uuid":"332301772","full_name":"vgherard/kgrams","owner":"vgherard","description":"k-grams, Language Models, and All That","archived":false,"fork":false,"pushed_at":"2024-11-14T13:34:45.000Z","size":1706,"stargazers_count":7,"open_issues_count":2,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-30T14:41:57.079Z","etag":null,"topics":["language-models","n-grams","natural-language-processing"],"latest_commit_sha":null,"homepage":"https://vgherard.github.io/kgrams/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vgherard.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-01-23T20:28:26.000Z","updated_at":"2024-12-01T02:11:31.000Z","dependencies_parsed_at":"2024-11-13T09:29:43.749Z","dependency_job_id":null,"html_url":"https://github.com/vgherard/kgrams","commit_stats":{"total_commits":318,"total_committers":3,"mean_commits":106.0,"dds":"0.14150943396226412","last_synced_commit":"7d7e59dc6005bf01400349952d39e1d12de95158"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vgherard%2Fkgrams","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vgherard%2Fkgrams/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vgherard%2Fkgrams/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vgherard%2Fkgrams/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vgherard","download_url":"https://codeload.github.com/vgherard/kgrams/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251674981,"owners_count":21625716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-models","n-grams","natural-language-processing"],"created_at":"2024-12-13T06:10:00.763Z","updated_at":"2025-04-30T09:21:14.292Z","avatar_url":"https://github.com/vgherard.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n```\n\n```{r srr-tags, eval = FALSE, echo = FALSE}\n```\n\n\n# kgrams\n\n\u003c!-- badges: start --\u003e\n[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![R-CMD-check](https://github.com/vgherard/kgrams/workflows/R-CMD-check/badge.svg)](https://github.com/vgherard/kgrams/actions)\n[![Codecov test coverage](https://codecov.io/gh/vgherard/kgrams/branch/main/graph/badge.svg)](https://app.codecov.io/gh/vgherard/kgrams?branch=main)\n[![CRAN status](https://www.r-pkg.org/badges/version/kgrams)](https://CRAN.R-project.org/package=kgrams)\n[![R-universe status](https://vgherard.r-universe.dev/badges/kgrams)](https://vgherard.r-universe.dev/)\n[![Website](https://img.shields.io/badge/Website-here-blue)](https://vgherard.github.io/kgrams/)\n[![Tweet](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/intent/tweet?text={kgrams}:%20Classical%20k-gram%20Language%20Models\u0026url=https://github.com/vgherard/kgrams\u0026via=ValerioGherardi\u0026hashtags=rstats,MachineLearning,NaturalLanguageProcessing)\n\u003c!-- badges: end --\u003e\n\n[`kgrams`](https://vgherard.github.io/kgrams/) provides tools for training and evaluating $k$-gram language models, including several probability smoothing methods, perplexity computations, random text generation and more. It is based on an C++ back-end which makes `kgrams` fast, coupled with an accessible R API which aims at streamlining the process of model building, and can be suitable for small- and medium-sized NLP experiments, baseline model building, and for pedagogical purposes.\n\n## For beginners\nIf you have no idea about what $k$-gram models are *and* didn't get here by \naccident, you can check out my hands-on [tutorial post on $k$-gram language models](https://datascienceplus.com/an-introduction-to-k-gram-language-models-in-r/) using R at [DataScience+](https://datascienceplus.com/).\n\n## Installation\n\n#### Released version\n\nYou can install the latest release of `kgrams` from [CRAN](https://CRAN.R-project.org/package=kgrams) with:\n\n``` r\ninstall.packages(\"kgrams\")\n```\n\n#### Development version\n\nYou can install the development version from [my R-universe](https://vgherard.r-universe.dev/) with:\n\n``` r\ninstall.packages(\"kgrams\", repos = \"https://vgherard.r-universe.dev/\")\n```\n\n## Example\n\nThis example shows how to train a modified Kneser-Ney 4-gram model on Shakespeare's play \"Much Ado About Nothing\" using `kgrams`.\n\n```{r}\nlibrary(kgrams)\n# Get k-gram frequency counts from text, for k = 1:4\nfreqs \u003c- kgram_freqs(kgrams::much_ado, N = 4)\n# Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3.\nmkn \u003c- language_model(freqs, smoother = \"mkn\", D1 = 0.25, D2 = 0.5, D3 = 0.75)\n```\n\nWe can now use this `language_model` to compute sentence and word continuation probabilities:\n\n```{r}\n# Compute sentence probabilities\nprobability(c(\"did he break out into tears ?\",\n              \"we are predicting sentence probabilities .\"\n              ), \n            model = mkn\n            )\n# Compute word continuation probabilities\nprobability(c(\"tears\", \"pieces\") %|% \"did he break out into\", model = mkn)\n```\n\nHere are some sentences sampled from the language model's distribution at temperatures `t = c(1, 0.1, 10)`:\n\n```{r}\n# Sample sentences from the language model at different temperatures\nset.seed(840)\nsample_sentences(model = mkn, n = 3, max_length = 10, t = 1)\nsample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1)\nsample_sentences(model = mkn, n = 3, max_length = 10, t = 10)\n```\n\n## Getting Help\n\nFor further help, you can consult the reference page of the `kgrams` [website](https://vgherard.github.io/kgrams/) or [open an issue](https://github.com/vgherard/kgrams/issues) on the GitHub repository of `kgrams`. A vignette is available on the website, illustrating the process of building language models in-depth.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvgherard%2Fkgrams","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvgherard%2Fkgrams","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvgherard%2Fkgrams/lists"}