Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pharo-ai/NgramModel
Ngram language model implemented in Pharo
https://github.com/pharo-ai/NgramModel
language-model natural-language-processing ngram-language-model ngrams pharo statistics
Last synced: about 2 months ago
JSON representation
Ngram language model implemented in Pharo
- Host: GitHub
- URL: https://github.com/pharo-ai/NgramModel
- Owner: pharo-ai
- License: mit
- Created: 2019-01-11T22:48:48.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-02-16T11:18:03.000Z (almost 2 years ago)
- Last Synced: 2024-05-18T21:52:20.581Z (8 months ago)
- Topics: language-model, natural-language-processing, ngram-language-model, ngrams, pharo, statistics
- Language: Smalltalk
- Size: 8.06 MB
- Stars: 4
- Watchers: 3
- Forks: 4
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-pharo - pharo-ai / NgramModel - N-gram language model that can be trained to estimate the probability of a next word based on N-1 previous words. (Artificial Intelligence and Machine Learning)
README
# Ngram Language Model
[![Build status](https://github.com/pharo-ai/NgramModel/workflows/CI/badge.svg)](https://github.com/pharo-ai/NgramModel/actions/workflows/test.yml)
[![Coverage Status](https://coveralls.io/repos/github/pharo-ai/NgramModel/badge.svg?branch=master)](https://coveralls.io/github/pharo-ai/NgramModel?branch=master)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/pharo-ai/NgramModel/master/LICENSE)`Ngram` package provides basic [n-gram](https://en.wikipedia.org/wiki/N-gram) functionality for Pharo. This includes `Ngram` class as well as `String` and `SequenceableCollection` extension that allow you to split text into unigrams, bigrams, trigrams, etc. Basically, this is just a simple utility for splitting texts into sequences of words.
This project also provides## Installation
To install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):
```Smalltalk
Metacello new
baseline: 'AINgramModel';
repository: 'github://pharo-ai/NgramModel/src';
load
```## How to depend on it?
If you want to add a dependency to this project to your own project, include the following lines into your baseline method:
```Smalltalk
spec
baseline: 'NgramModel'
with: [ spec repository: 'github://pharo-ai/NgramModel/src' ].
```If you are new to baselines and Metacello, check out the [Baselines](https://github.com/pharo-open-documentation/pharo-wiki/blob/master/General/Baselines.md) tutorial on Pharo Wiki.
## What are n-grams?
[N-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of n elements, usually words. Number n is called the order of n-gram The concept of n-grams is widely used in [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing). A text can be split into n-grams - sequences of n words. Consider the following text:
```
I do not like green eggs and ham
```
We can split it into **unigrams** (n-grams with n=1):
```
(I), (do), (not), (like), (green), (eggs), (and), (ham)
```
Or **bigrams** (n-grams with n=2):
```
(I do), (do not), (not like), (like green), (green eggs), (eggs and), (and ham)
```
Or **trigrams** (n-grams with n=3):
```
(I do not), (do not like), (not like green), (like green eggs), (green eggs and), (eggs and ham)
```
And so on (tetragrams, pentagrams, etc.).### Applications
N-grams are widely applied in [language modeling](https://en.wikipedia.org/wiki/Language_model). For example, take a look at the implementation of [n-gram language model](https://github.com/olekscode/NgramModel) in Pharo.
### Structure of n-gram
Each n-gram can be separated into:
* **last word** - the last element in a sequence;
* **history** (context) - n-gram of order n-1 with all words except the last one.Such separation is useful in probabilistic modeling when we want to estimate the probability of word given n-1 previous words (see [n-gram language model](https://github.com/olekscode/NgramModel)).
## Ngram class
This package provides only one class - `Ngram`. It models the n-gram.
### Instance creation
You can create n-gram from any `SequenceableCollection`:
```Smalltalk
trigram := AINgram withElements: #(do not like).
tetragram := #(green eggs and ham) asNgram.
```Or by explicitly providing the history (n-gram of lower order) and last element:
```Smalltalk
hist := #(green eggs and) asNgram.
w := 'ham'.ngram := AINgram withHistory: hist last: w.
```You can also create a zerogram - n-gram of order 0. It is an empty sequence with no history and no last word:
```Smalltalk
AINgram zerogram.
```### Accessing
You can access the order of n-gram, its history and last element:
```Smalltalk
tetragram. "n-gram(green eggs and ham)"
tetragram order. "4"
tetragram history. "n-gram(green eggs and)"
tetragram last. "ham"
```## String extensions
> TODO
## Example of text generation
#### 1. Loading Brown corpus
```Smalltalk
file := 'pharo-local/iceberg/pharo-ai/NgramModel/Corpora/brown.txt' asFileReference.
brown := file contents.
```
#### 2. Training a 2-gram language model on the corpus
```Smalltalk
model := AINgramModel order: 2.
model trainOn: brown.
```
#### 3. Generating text of 100 words
At each step the model selects top 5 words that are most likely to follow the previous words and returns the random word from those five (this randomnes ensures that the generator does not get stuck in a cycle).
```Smalltalk
generator := AINgramTextGenerator new model: model.
generator generateTextOfSize: 100.
```
## Result:#### 100 words generated by a 2-gram model trained on Brown corpus
```
educator cannot describe and edited a highway at private time ``
Fallen Figure Technique tells him life pattern more flesh tremble
with neither my God `` Hit ) landowners began this narrative and
planted , post-war years Josephus Daniels was Virginia years
Congress with confluent , jurisdiction involved some used which
he''s something the Lyle Elliott Carter officiated and edited and
portents like Paradise Road in boatloads . Shipments of Student
Movement itself officially shifted religions of fluttering soutane .
Coolest shade which reasonably . Coolest shade less shaky . Doubts
thus preventing them proper bevels easily take comfort was
```
#### 100 words generated by a 3-gram model trained on Brown corpus
```
The Fulton County purchasing departments do to escape Nicolas Manas .
But plain old bean soup , broth , hash , and cultivated in himself ,
back straight , black sheepskin hat from Texas A & I College and
operates the institution , the antipathy to outward ceremonies hailed
by modern plastic materials -- a judgment based on displacement of his
arrival spread through several stitches along edge to her paper for
further meditation . `` Hit the bum '' ! ! Fort up ! ! Fort up ! !
Kizzie turned to similar approaches . When Mrs. Coolidge for
```
#### 100 words generated by a 3-gram model trained on Pharo source code corpus
This model was trained on the corpus composed from the source code of [85,000 Pharo methods tokenized at the subtoken level](https://github.com/pharo-ai/NgramModel/blob/master/Corpora/pharo_source.txt) (composite names like `OrderedCollection` were split into subtokens: `ordered`, `collection`)
```
super initialize value holders . ( aggregated series := ( margins if nil
if false ) text styler blue style table detect : [ uniform drop list input .
export csv label : suggested file name < a parametric function . | phase
:= bit thing basic size >= desired length ) ascii . space width +
bounds top - an event character : d bytes : stream if absent put : answers )
| width of text . status value := dual value at last : category string :=
value cos ) abs raised to n number of
```
## Warning
Training the model on the entire Pharo corpus and generating 100 words can take over 10 minutes. So start with a smaller exercise: train a 2-gram model on a Brown corpus (it is the smallest one) and generate 10 words.