{"id":21615988,"url":"https://github.com/pharo-ai/ngrammodel","last_synced_at":"2025-07-13T12:33:59.432Z","repository":{"id":65908349,"uuid":"165321881","full_name":"pharo-ai/NgramModel","owner":"pharo-ai","description":"Ngram language model implemented in Pharo","archived":false,"fork":false,"pushed_at":"2023-02-16T11:18:03.000Z","size":8447,"stargazers_count":4,"open_issues_count":10,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-05-18T21:52:20.581Z","etag":null,"topics":["language-model","natural-language-processing","ngram-language-model","ngrams","pharo","statistics"],"latest_commit_sha":null,"homepage":null,"language":"Smalltalk","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pharo-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-01-11T22:48:48.000Z","updated_at":"2022-01-23T23:25:57.000Z","dependencies_parsed_at":"2023-06-03T06:00:29.154Z","dependency_job_id":null,"html_url":"https://github.com/pharo-ai/NgramModel","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pharo-ai%2FNgramModel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pharo-ai%2FNgramModel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pharo-ai%2FNgramModel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pharo-ai%2FNgramModel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pharo-ai","download_url":"https://codeload.github.com/pharo-ai/NgramModel/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226238804,"owners_count":17593679,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-model","natural-language-processing","ngram-language-model","ngrams","pharo","statistics"],"created_at":"2024-11-24T22:13:19.521Z","updated_at":"2024-11-24T22:13:20.198Z","avatar_url":"https://github.com/pharo-ai.png","language":"Smalltalk","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Ngram Language Model\n\n[![Build status](https://github.com/pharo-ai/NgramModel/workflows/CI/badge.svg)](https://github.com/pharo-ai/NgramModel/actions/workflows/test.yml)\n[![Coverage Status](https://coveralls.io/repos/github/pharo-ai/NgramModel/badge.svg?branch=master)](https://coveralls.io/github/pharo-ai/NgramModel?branch=master)\n[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/pharo-ai/NgramModel/master/LICENSE)\n\n`Ngram` package provides basic [n-gram](https://en.wikipedia.org/wiki/N-gram) functionality for Pharo. This includes `Ngram` class as well as `String` and `SequenceableCollection` extension that allow you to split text into unigrams, bigrams, trigrams, etc. Basically, this is just a simple utility for splitting texts into sequences of words.\nThis project also provides \n\n## Installation\n\nTo install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):\n\n```Smalltalk\nMetacello new\n  baseline: 'AINgramModel';\n  repository: 'github://pharo-ai/NgramModel/src';\n  load\n```\n\n## How to depend on it?\n\nIf you want to add a dependency to this project to your own project, include the following lines into your baseline method:\n\n```Smalltalk\nspec\n  baseline: 'NgramModel'\n  with: [ spec repository: 'github://pharo-ai/NgramModel/src' ].\n```\n\nIf you are new to baselines and Metacello, check out the [Baselines](https://github.com/pharo-open-documentation/pharo-wiki/blob/master/General/Baselines.md) tutorial on Pharo Wiki.\n\n## What are n-grams?\n\n[N-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of n elements, usually words. Number n is called the order of n-gram The concept of n-grams is widely used in [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing). A text can be split into n-grams - sequences of n words. Consider the following text:\n```\nI do not like green eggs and ham\n```\nWe can split it into **unigrams** (n-grams with n=1):\n```\n(I), (do), (not), (like), (green), (eggs), (and), (ham)\n```\nOr **bigrams** (n-grams with n=2):\n```\n(I do), (do not), (not like), (like green), (green eggs), (eggs and), (and ham)\n```\nOr **trigrams** (n-grams with n=3):\n```\n(I do not), (do not like), (not like green), (like green eggs), (green eggs and), (eggs and ham)\n```\nAnd so on (tetragrams, pentagrams, etc.).\n\n### Applications\n\nN-grams are widely applied in [language modeling](https://en.wikipedia.org/wiki/Language_model). For example, take a look at the implementation of [n-gram language model](https://github.com/olekscode/NgramModel) in Pharo.\n\n### Structure of n-gram\n\nEach n-gram can be separated into:\n\n* **last word** - the last element in a sequence;\n* **history** (context) - n-gram of order n-1 with all words except the last one.\n\nSuch separation is useful in probabilistic modeling when we want to estimate the probability of word given n-1 previous words (see [n-gram language model](https://github.com/olekscode/NgramModel)).\n\n## Ngram class\n\nThis package provides only one class - `Ngram`. It models the n-gram.\n\n### Instance creation\n\nYou can create n-gram from any `SequenceableCollection`:\n\n```Smalltalk\ntrigram := AINgram withElements: #(do not like).\ntetragram := #(green eggs and ham) asNgram.\n```\n\nOr by explicitly providing the history (n-gram of lower order) and last element:\n\n```Smalltalk\nhist := #(green eggs and) asNgram.\nw := 'ham'.\n\nngram := AINgram withHistory: hist last: w.\n```\n\nYou can also create a zerogram - n-gram of order 0. It is an empty sequence with no history and no last word:\n\n```Smalltalk\nAINgram zerogram.\n```\n\n### Accessing\n\nYou can access the order of n-gram, its history and last element:\n\n```Smalltalk\ntetragram. \"n-gram(green eggs and ham)\"\ntetragram order. \"4\"\ntetragram history. \"n-gram(green eggs and)\"\ntetragram last. \"ham\"\n```\n\n## String extensions\n\n\u003e TODO\n\n## Example of text generation\n\n#### 1. Loading Brown corpus\n```Smalltalk\nfile := 'pharo-local/iceberg/pharo-ai/NgramModel/Corpora/brown.txt' asFileReference.\nbrown := file contents.\n```\n#### 2. Training a 2-gram language model on the corpus\n```Smalltalk\nmodel := AINgramModel order: 2.\nmodel trainOn: brown.\n```\n#### 3. Generating text of 100 words\nAt each step the model selects top 5 words that are most likely to follow the previous words and returns the random word from those five (this randomnes ensures that the generator does not get stuck in a cycle).\n```Smalltalk\ngenerator := AINgramTextGenerator new model: model.\ngenerator generateTextOfSize: 100.\n```\n## Result:\n\n#### 100 words generated by a 2-gram model trained on Brown corpus\n```\n educator cannot describe and edited a highway at private time ``\n Fallen Figure Technique tells him life pattern more flesh tremble \n with neither my God `` Hit ) landowners began this narrative and \n planted , post-war years Josephus Daniels was Virginia years \n Congress with confluent , jurisdiction involved some used which \n he''s something the Lyle Elliott Carter officiated and edited and\n portents like Paradise Road in boatloads . Shipments of Student \n Movement itself officially shifted religions of fluttering soutane .\n Coolest shade which reasonably . Coolest shade less shaky . Doubts \n thus preventing them proper bevels easily take comfort was\n```\n#### 100 words generated by a 3-gram model trained on Brown corpus\n```\n The Fulton County purchasing departments do to escape Nicolas Manas .\n But plain old bean soup , broth , hash , and cultivated in himself , \n back straight , black sheepskin hat from Texas A \u0026 I College and \n operates the institution , the antipathy to outward ceremonies hailed \n by modern plastic materials -- a judgment based on displacement of his \n arrival spread through several stitches along edge to her paper for \n further meditation . `` Hit the bum '' ! ! Fort up ! ! Fort up ! ! \n Kizzie turned to similar approaches . When Mrs. Coolidge for\n```\n#### 100 words generated by a 3-gram model trained on Pharo source code corpus\nThis model was trained on the corpus composed from the source code of [85,000 Pharo methods tokenized at the subtoken level](https://github.com/pharo-ai/NgramModel/blob/master/Corpora/pharo_source.txt) (composite names like `OrderedCollection` were split into subtokens: `ordered`, `collection`)\n```\n super initialize value holders . ( aggregated series := ( margins if nil\n if false ) text styler blue style table detect : [ uniform drop list input . \n export csv label : suggested file name \u003c a parametric function . | phase \n \u003cnum\u003e := bit thing basic size \u003e= desired length ) ascii . space width + \n bounds top - an event character : d bytes : stream if absent put : answers )\n | width of text . status value := dual value at last : category string := \n value cos ) abs raised to n number of\n```\n## Warning\nTraining the model on the entire Pharo corpus and generating 100 words can take over 10 minutes. So start with a smaller exercise: train a 2-gram model on a Brown corpus (it is the smallest one) and generate 10 words.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpharo-ai%2Fngrammodel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpharo-ai%2Fngrammodel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpharo-ai%2Fngrammodel/lists"}