An open API service indexing awesome lists of open source software.

https://github.com/daniel-lima-lopez/n-gram-example

Implementation of a BiGram-based language system in Python
https://github.com/daniel-lima-lopez/n-gram-example

ngram ngram-language-model ngrams nlp nlp-machine-learning python

Last synced: 7 months ago
JSON representation

Implementation of a BiGram-based language system in Python

Awesome Lists containing this project

README

          

# N-Gram-Example
[This repository](https://github.com/daniel-lima-lopez/N-Gram-Example) shows the implementation of a BiGram, which considers the vocabulary of different dialogues between characters in Shakespear's works, whose information is found in the [Shakespeare plays](https://www.kaggle.com/datasets/kingburrito666/shakespeare-plays) dataset.

## Installation
Clone this repository:
```bash
git clone git@github.com:daniel-lima-lopez/N-Gram-Example.git
```
move to installation directory:
```bash
cd N-Gram-Example
```

## Method description
The presented bigram is implemented by counting all occurrences of word pairs present in the text corpus. In this way, the system is able to identify the most frequent word pairs, and therefore, analyze the English idiom as a system of word pairs.

The [counts.ipynb](counts.ipynb) notebook shows the procedure necessary to analyze the text corpus and create the [word_id.csv](word_id.csv) and [CMatrix.csv](CMatrix.csv) files. The first file list the unique words found in the corpus, while the second file contains the count of all possible word pairs found in the text. Note that, given the large amount of data, saving this information in a matrix would contain mostly zeros (sparse matrix), so it was decided to write only those existing occurrences.

The Bigram is implemented in the [BiGram.py](BiGram.py) code, which takes as input the files [word_id.csv](word_id.csv) and [CMatrix.csv](CMatrix.csv). When instantiating it, the value of the parameters `k` and `add` can be chosen. Where `k` is a factor that multiplies all the elements in the counting matrix and `add` is a constant that is added to the result of the multiplication. This is done in order to move some of the counting mass to word pairs that are not in the corpus, in order to expand the vocabulary of the system.

## Example
The following example can be executed in the notebook [example.ipynb](example.ipynb).

We can instantiate the BiGram class as follows:
```python
from BiGram import BiGram
test = BiGram(k=5, add=1)
```
We can then use the `next_word()` method to predict the next most likely word, given a previous word. Below is an example of 10 sentences of 5 words generated by the Bigram. Note that in each case the starting indicator of the sentence is used, and the i+i-th word is generated considering the i-th word:
```python
for i in range(10):
ws = ['s1']
for i in range(5):
nw = test.next_word(ws[-1])
if nw == 'e1':
break
else:
ws.append(nw)
print(*ws[1:])
```
which results in:
```
- the grace insurrection module countenances
- and for they prescriptions deiphobus
- attend lettersdamnd eyases censureo smarting
- humbling witch lade scions dearbeloved
- incarnal cricket tellus exchequers overview
- visit stubble each heros nursery
- boarish lucentio luna godfather dire
- you begin offend glorious sundaycitizens
- come infixing dareful cuckooflowers minded
- and shrilltongued everpardon blue uttering
```

It is important to mention that the `next_word()` method chooses the next most probable word considering the counting matrix, and making a random selection among all possible occurrences of words, considering with greater probability those combinations that are most frequent in the corpus.