https://github.com/paceaux/methodius
A utility for analyzing text on the web
https://github.com/paceaux/methodius
bigram ngram parse split tokenize trigram
Last synced: 5 months ago
JSON representation
A utility for analyzing text on the web
- Host: GitHub
- URL: https://github.com/paceaux/methodius
- Owner: paceaux
- License: other
- Created: 2022-07-18T19:15:10.000Z (almost 4 years ago)
- Default Branch: develop
- Last Pushed: 2025-09-19T03:25:08.000Z (9 months ago)
- Last Synced: 2025-10-29T09:37:36.233Z (8 months ago)
- Topics: bigram, ngram, parse, split, tokenize, trigram
- Language: TypeScript
- Homepage:
- Size: 362 KB
- Stars: 5
- Watchers: 1
- Forks: 2
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Methodius (an NGram utility)
A utility for analyzing frequency of text chunks on the web.
Supply a bit o' text to the Methodius class, and let it determine your bigrams, trigrams, ngrams, letter-frequencies, word frequencies, bigram relationships, and create ngram trees.
[](https://firstdonoharm.dev/version/3/0/law-media-mil-soc-sv.html)

## Example
```JavaScript
const { Methodius } = require('methodius');
// or import { Methodius } from 'methodius';
const udhr1 = `
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
`;
const nGrams = new Methodius(udhr1);
const topLetters = nGrams.getTopLetters(10);
const topWords = nGrams.getTopWords(10);
```
# API
## `Methodius`
Global Class
`new Methodius(text)`
**Parameters**
| name | type | Description |
| --- |--- | --- |
| text | string | raw text to be analyzed |
### Static Members
#### `Punctuations`
characters to ignore when analyzing text
period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, some spaces
`\\.,;:!?‽¡¿⸘()\\[\\]{}<>’'…\"\n\t\r`
#### `wordSeparators`
characters to ignore AND CONSUME when trying to find words
em-dash, period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, space
`—\\.,;:!?‽¡¿⸘()\\[\\]{}<>…"\\s`
### Static Methods
#### `hasPunctuation(string)`
determines if string contains punctuation
**Parameters**
| name | type | Description |
| --- |--- | --- |
| string | string | |
**Returns**
`boolean`
#### `hasSymbols(string)`
determines if string contains symbols
**Parameters**
| name | type | Description |
| --- |--- | --- |
| string | string | |
**Returns**
`boolean`
#### `hasSpace(string)`
determines if a string has a space
**Parameters**
| name | type | Description |
| --- |--- | --- |
| string | string | |
**Returns**
`boolean`
#### `sanitizeText(string)`
lowercases text and removes diacritics and other characters that would throw off n-gram analysis
**Parameters**
| name | type | Description |
| --- |--- | --- |
| string |string | |
**Returns**
`string`
#### `getWords(text)`
extracts an array of words from a string
**Parameters**
| name | type | Description |
| --- |--- | --- |
| text | string | |
**Returns**
`Array`
#### `getNGrams(text, gramSize)`
gets ngrams from text
**Parameters**
| name | type | Description |
| --- |--- | --- |
| text | string | |
| gramSize | Number | Default = 2 |
**Returns**
`Array`
#### `getMeanWordSize(wordArray)`
Gets average size of a word
**Parameters**
| name | type | Description |
| --- |--- | --- |
| wordArray | string[] | |
**Returns**
`number`
#### `getMedianWordSize(wordArray)`
Gets the median (middle) size of a word
**Parameters**
| name | type | Description |
| --- |--- | --- |
| wordArray | string[] | |
**Returns**
`number`
#### `getWordNGrams(text)`
Gets 2-word pairs from text.
Note: This doesn't use sentence punctuation as a boundary. Should it?
**Parameters**
| name | type | Description |
| --- |--- | --- |
| text | string | |
| gramSize | number | default=2 |
**Returns**
`Array`
#### `getFrequencyMap(frequencyMap)`
converts an array of strings into a map of those strings and number of occurences
**Parameters**
| name | type | Description |
| --- |--- | --- |
| ngramArray | `Array` | |
**Returns**
`Map`
#### `getPercentMap(frequencyMap)`
converts a frequency map into a map of percentages
**Parameters**
| name | type | Description |
| --- |--- | --- |
| frequencyMap | `Map` | |
**Returns**
`Map`
#### `getTopGrams(frequencyMap)`
filters a frequency map into only a small subset of the most frequent ones
**Parameters**
| name | type | Description |
| --- |--- | --- |
| frequencyMap | `Map` | |
| limit | number | default=20 |
**Returns**
`Map`
#### `getIntersection(iterable1, iterable2)`
Returns an array of items that occur in both iterables
**Parameters**
| name | type | Description |
| --- |--- | --- |
| iterable1 | `Map|Array` | |
| iterable2 | `Map|Array` | |
**Returns**
`Array`
An array of items that occur in both iterables. It will compare the keys, if sent a map
#### `getUnion(iterable1, iterable2)`
Returns an array that is the union of two iterables
**Parameters**
| name | type | Description |
| --- |--- | --- |
| iterable1 | `Map|Array` | |
| iterable2 | `Map|Array` | |
**Returns**
`Array`
A union of the items that occur in both iterables.
#### `getDisjunctiveUnion(iterable1, iterable2)`
Returns an array of arrays of the unique items in either iterable. Also known as the symmetric difference
**Parameters**
| name | type | Description |
| --- |--- | --- |
| iterable1 | `Map|Array` | |
| iterable2 | `Map|Array` | |
**Returns**
`Array`
An array of arrays of the unique items. The first item is the first parameter, 2nd item second param
#### `getDifference`
Returns an array of items that are unique only to the first parameter.
**Parameters**
| name | type | Description |
| --- |--- | --- |
| iterable1 | `Map|Array` | |
| iterable2 | `Map|Array` | |
**Returns**
`Array`
An array of items unique only to the first parameter
#### `getComparison(iterable1, iterable2)`
Returns a map containing various comparisons between two iterables
**Parameters**
| name | type | Description |
| --- |--- | --- |
| iterable1 | `Map|Array` | |
| iterable2 | `Map|Array` | |
**Returns**
`Map>`
A map containing various comparisons between two iterables. Those comparisons will be arrays of intersection, disjunctiveUnion, difference, and union.
#### `getWordPlacementForNGram(ngram, wordsArray)`
determines the placement of a single ngram in an array of words
**Parameters**
| name | type | Description |
| --- |--- | --- |
| ngram | `string` | |
| wordsArray | `Array` | |
**Returns**
`Map`
a map with the keys 'start', 'middle', and 'end' whose values correspond to how often the provided ngram occurs in this position
#### `getWordPlacementForNGrams(ngrams, wordsArray)`
determines the placement of ngrams in an array of words
**Parameters**
| name | type | Description |
| --- |--- | --- |
| ngram | `Array` | |
| wordsArray | `Array` | |
**Returns**
`Map>`
a map with the key of the ngram, and the value that is a map containing start, middle, end
#### `getNgramCollections(ngrams, wordsArray)`
gets ngrams from an array of words
**Parameters**
| name | type | Description |
| --- |--- | --- |
| wordArray | `Array` | an array of words |
| ngramSize | `number` | default = 2. The size of the ngrams to return |
**Returns**
`Array>`
An array containing arrays of ngrams, each array corresponds to a word.
#### `getNgramSiblings(searchText, ngramCollections, siblingSize)`
using a collection returned from getNgramCollections, searches for a string and returns what comes before and after it
**Parameters**
| name | type | Description |
| --- |--- | --- |
| searchText | `string` | the string to search for |
| ngramCollections | `Array|Array>` | an array of ngrams, or an nGramCollection |
| siblingSize | `number` | default = 1. How many siblings to find in front or behind |
**Returns**
`Map<'before'|'after',Map>`
a Map with the keys 'before' and 'after' which contain maps of what comes before and after
**Example**
```JavaScript
const words = ['revolution', 'nation'];
const ngramCollections = Methodius.getNgramCollections(words, 2);
const onSiblings = Methodius.getNgramSiblings('io', ngramCollections);
/*
new Map([
['before', new Map(
['ti', 2]
)],
['after', new Map(
['on', 2]
)]
])
*/
```
#### `getRelatedNgrams(words, ngrams, ngramSize)`
Gets the ngrams that will occur before or after other ngrams. Useful for finding patterns of ngrams.
**Parameters**
| name | type | Description |
| --- |--- | --- |
| words | `Array` | an array of words to evaluate |
| ngrams | `Map` | a frequency map of ngrams |
| ngramSize | `number` | default = 2. the size of the ngram |
**Returns**
`Map` A frequency map of how often ngrams occured before or after other ngrams
**Example**
This requires several steps. You'll need an array of words and a frequency map of ngrams.
```JavaScript
const ngrams = getNGrams('the revolution of the nation was on television. It was about pollution and the terrible situation ', 2);
const frequencyMap = getFrequencyMap(ngrams);
const topNgrams = getTopGrams(frequencyMap, 5);
const words = ['the', 'revolution', 'of', 'the', 'nation', 'was', 'on', 'television', 'it', 'was', 'about', 'pollution', 'and', 'the', 'terrible', 'situation' ];
const relatedNgrams = getRelatedNgrams(words, topNgrams, 2, 5);
```
#### `getNgramTreeCollection(words)`
Gets a nested map of maps that breaks down unique words into their smallest ngrams
**Parameters**
| name | type | Description |
| --- |--- | --- |
| words | `Array` | an array of words to evaluate |
**Returns**
`Map| Map>` A nested map of maps that breaks down unique words into their smallest ngrams.
### Instance Members
#### `sanitizedText`
lowercased text with diacritics removed
`string`
#### `letters`
an array of letters in the text
`Array`
#### `words`
an array of words in the text
`Array`
#### `bigrams`
an array of letter bigrams in the text
`Array`
#### `trigrams`
an array of letter trigrams in the text
`Array`
#### `uniqueLetters`
an array of unique letters in the text
`Array`
#### `uniqueBigrams`
an array of unique bigrams in the text
`Array`
#### `uniqueTrigrams`
an array of unique trigrams in the text
`Map>`
#### `letterPositions`
a map of placements of letters within words
`Map>`
#### `bigramPositions`
a map of placements of bigrams within words
`Map>`
#### `uniqueTrigrams`
a map of placements of trigrams within words
`Array`
#### `uniqueWords`
an array of unique words in the text
`Array`
#### `letterFrequencies`
a map of letter frequencies in the sanitized text
`Map`
#### `bigramFrequencies`
a map of bigram frequencies in the sanitized text
`Map`
#### `trigramFrequencies`
a map of trigram frequencies in the sanitized text
`Map`
#### `wordFrequencies`
a map of word frequencies in the sanitized text
`Map`
#### `letterPercentages`
a map of letter percentages in the sanitized text
`Map`
#### `bigramPercentages`
a map of bigram percentages in the sanitized text
`Map`
#### `trigramPercentages`
a map of trigram percentages in the sanitized text
`Map`
#### `wordPercentages`
a map of word percentages in the sanitized text
`Map`
#### `meanWordSize`
The average size of a word
`number`
#### `medianWordSize`
The middle size of a word
`number`
#### `ngramTreeCollection`
A nested map of maps that breaks down unique words into their smallest ngrams.
### Instance Methods
#### `getLetterNGrams(size)`
gets an array of customizeable ngrams in the text
**Parameters**
| name | type | Description |
| --- |--- | --- |
| size | `number` | default = 2 size of the n-gram to return |
**Returns**
`Array`
#### `getTopLetters(limit)`
a map of the most used letters in the text
**Parameters**
| name | type | Description |
| --- |--- | --- |
| limit | `number` | default = 20 number of top letters to return |
**Returns**
`Map`
#### `getTopBigrams(limit)`
a map of the most used bigrams in the text
**Parameters**
| name | type | Description |
| --- |--- | --- |
| limit | `number` | default = 20 number of top bigrams to return |
**Returns**
`Map`
#### `getTopTrigrams(limit)`
a map of the most used trigrams in the text
**Parameters**
| name | type | Description |
| --- |--- | --- |
| limit | `number` | default = 20 number of top trigrams to return |
**Returns**
`Map`
#### `getTopWords(limit)`
a map of the most used words in the text
**Parameters**
| name | type | Description |
| --- |--- | --- |
| limit | `number` | default = 20 number of top words to return |
**Returns**
`Map`
#### `compareTo(methodius)`
Compare this methodius instance to another
**Parameters**
| name | type | Description |
| --- |--- | --- |
| methodius | `Methodius` | another Methodius instance |
**Returns**
`Map`
A map of property names and their comparisons (intersection, disjunctiveUnions, etc) for a set of properties
#### `getRelatedTopNgrams(ngramSize, limit)`
Gets the ngrams that will occur before or after other ngrams based on what the most frequent ngrams are. Useful for finding patterns of ngrams.
**Parameters**
| name | type | Description |
| --- |--- | --- |
| ngramSize | `number` | default = 2. the size of the ngram |
| limit | `number` | default = 20. the number of top ngrams to use |
**Returns**
`Map` A frequency map of how often the most common ngrams occured before or after other common ngrams