Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/pharo-ai/NgramModel

Ngram language model implemented in Pharo
https://github.com/pharo-ai/NgramModel

language-model natural-language-processing ngram-language-model ngrams pharo statistics

Last synced: 2 months ago
JSON representation

Ngram language model implemented in Pharo

Lists

README

        

# Ngram Language Model

[![Build status](https://github.com/pharo-ai/NgramModel/workflows/CI/badge.svg)](https://github.com/pharo-ai/NgramModel/actions/workflows/test.yml)
[![Coverage Status](https://coveralls.io/repos/github/pharo-ai/NgramModel/badge.svg?branch=master)](https://coveralls.io/github/pharo-ai/NgramModel?branch=master)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/pharo-ai/NgramModel/master/LICENSE)

`Ngram` package provides basic [n-gram](https://en.wikipedia.org/wiki/N-gram) functionality for Pharo. This includes `Ngram` class as well as `String` and `SequenceableCollection` extension that allow you to split text into unigrams, bigrams, trigrams, etc. Basically, this is just a simple utility for splitting texts into sequences of words.
This project also provides

## Installation

To install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):

```Smalltalk
Metacello new
baseline: 'AINgramModel';
repository: 'github://pharo-ai/NgramModel/src';
load
```

## How to depend on it?

If you want to add a dependency to this project to your own project, include the following lines into your baseline method:

```Smalltalk
spec
baseline: 'NgramModel'
with: [ spec repository: 'github://pharo-ai/NgramModel/src' ].
```

If you are new to baselines and Metacello, check out the [Baselines](https://github.com/pharo-open-documentation/pharo-wiki/blob/master/General/Baselines.md) tutorial on Pharo Wiki.

## What are n-grams?

[N-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of n elements, usually words. Number n is called the order of n-gram The concept of n-grams is widely used in [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing). A text can be split into n-grams - sequences of n words. Consider the following text:
```
I do not like green eggs and ham
```
We can split it into **unigrams** (n-grams with n=1):
```
(I), (do), (not), (like), (green), (eggs), (and), (ham)
```
Or **bigrams** (n-grams with n=2):
```
(I do), (do not), (not like), (like green), (green eggs), (eggs and), (and ham)
```
Or **trigrams** (n-grams with n=3):
```
(I do not), (do not like), (not like green), (like green eggs), (green eggs and), (eggs and ham)
```
And so on (tetragrams, pentagrams, etc.).

### Applications

N-grams are widely applied in [language modeling](https://en.wikipedia.org/wiki/Language_model). For example, take a look at the implementation of [n-gram language model](https://github.com/olekscode/NgramModel) in Pharo.

### Structure of n-gram

Each n-gram can be separated into:

* **last word** - the last element in a sequence;
* **history** (context) - n-gram of order n-1 with all words except the last one.

Such separation is useful in probabilistic modeling when we want to estimate the probability of word given n-1 previous words (see [n-gram language model](https://github.com/olekscode/NgramModel)).

## Ngram class

This package provides only one class - `Ngram`. It models the n-gram.

### Instance creation

You can create n-gram from any `SequenceableCollection`:

```Smalltalk
trigram := AINgram withElements: #(do not like).
tetragram := #(green eggs and ham) asNgram.
```

Or by explicitly providing the history (n-gram of lower order) and last element:

```Smalltalk
hist := #(green eggs and) asNgram.
w := 'ham'.

ngram := AINgram withHistory: hist last: w.
```

You can also create a zerogram - n-gram of order 0. It is an empty sequence with no history and no last word:

```Smalltalk
AINgram zerogram.
```

### Accessing

You can access the order of n-gram, its history and last element:

```Smalltalk
tetragram. "n-gram(green eggs and ham)"
tetragram order. "4"
tetragram history. "n-gram(green eggs and)"
tetragram last. "ham"
```

## String extensions

> TODO

## Example of text generation

#### 1. Loading Brown corpus
```Smalltalk
file := 'pharo-local/iceberg/pharo-ai/NgramModel/Corpora/brown.txt' asFileReference.
brown := file contents.
```
#### 2. Training a 2-gram language model on the corpus
```Smalltalk
model := AINgramModel order: 2.
model trainOn: brown.
```
#### 3. Generating text of 100 words
At each step the model selects top 5 words that are most likely to follow the previous words and returns the random word from those five (this randomnes ensures that the generator does not get stuck in a cycle).
```Smalltalk
generator := AINgramTextGenerator new model: model.
generator generateTextOfSize: 100.
```
## Result:

#### 100 words generated by a 2-gram model trained on Brown corpus
```
educator cannot describe and edited a highway at private time ``
Fallen Figure Technique tells him life pattern more flesh tremble
with neither my God `` Hit ) landowners began this narrative and
planted , post-war years Josephus Daniels was Virginia years
Congress with confluent , jurisdiction involved some used which
he''s something the Lyle Elliott Carter officiated and edited and
portents like Paradise Road in boatloads . Shipments of Student
Movement itself officially shifted religions of fluttering soutane .
Coolest shade which reasonably . Coolest shade less shaky . Doubts
thus preventing them proper bevels easily take comfort was
```
#### 100 words generated by a 3-gram model trained on Brown corpus
```
The Fulton County purchasing departments do to escape Nicolas Manas .
But plain old bean soup , broth , hash , and cultivated in himself ,
back straight , black sheepskin hat from Texas A & I College and
operates the institution , the antipathy to outward ceremonies hailed
by modern plastic materials -- a judgment based on displacement of his
arrival spread through several stitches along edge to her paper for
further meditation . `` Hit the bum '' ! ! Fort up ! ! Fort up ! !
Kizzie turned to similar approaches . When Mrs. Coolidge for
```
#### 100 words generated by a 3-gram model trained on Pharo source code corpus
This model was trained on the corpus composed from the source code of [85,000 Pharo methods tokenized at the subtoken level](https://github.com/pharo-ai/NgramModel/blob/master/Corpora/pharo_source.txt) (composite names like `OrderedCollection` were split into subtokens: `ordered`, `collection`)
```
super initialize value holders . ( aggregated series := ( margins if nil
if false ) text styler blue style table detect : [ uniform drop list input .
export csv label : suggested file name < a parametric function . | phase
:= bit thing basic size >= desired length ) ascii . space width +
bounds top - an event character : d bytes : stream if absent put : answers )
| width of text . status value := dual value at last : category string :=
value cos ) abs raised to n number of
```
## Warning
Training the model on the entire Pharo corpus and generating 100 words can take over 10 minutes. So start with a smaller exercise: train a 2-gram model on a Brown corpus (it is the smallest one) and generate 10 words.