https://github.com/paceaux/methodius

A utility for analyzing text on the web
https://github.com/paceaux/methodius
bigram ngram parse split tokenize trigram
Last synced: 5 months ago
JSON representation
A utility for analyzing text on the web
Host: GitHub
URL: https://github.com/paceaux/methodius
Owner: paceaux
License: other
Created: 2022-07-18T19:15:10.000Z (almost 4 years ago)
Default Branch: develop
Last Pushed: 2025-09-19T03:25:08.000Z (9 months ago)
Last Synced: 2025-10-29T09:37:36.233Z (8 months ago)
Topics: bigram, ngram, parse, split, tokenize, trigram
Language: TypeScript
Homepage:
Size: 362 KB
Stars: 5
Watchers: 1
Forks: 2
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project

README

          # Methodius (an NGram utility)

A utility for analyzing frequency of text chunks on the web.

Supply a bit o' text to the Methodius class, and let it determine your bigrams, trigrams, ngrams, letter-frequencies, word frequencies, bigram relationships, and create ngram trees. 

[![Hippocratic License HL3-LAW-MEDIA-MIL-SOC-SV](https://img.shields.io/static/v1?label=Hippocratic%20License&message=HL3-LAW-MEDIA-MIL-SOC-SV&labelColor=5e2751&color=bc8c3d)](https://firstdonoharm.dev/version/3/0/law-media-mil-soc-sv.html)

![npm](https://img.shields.io/npm/dm/methodius)

## Example

```JavaScript

const { Methodius } = require('methodius');

// or import { Methodius } from 'methodius';

const udhr1 = `

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

`;

const nGrams = new Methodius(udhr1);

const topLetters = nGrams.getTopLetters(10);

const topWords = nGrams.getTopWords(10);

```

# API

## `Methodius`

Global Class

`new Methodius(text)`

**Parameters**

| name      | type  | Description   |

| ---       |---    | ---           |

| text    | string       |     raw text to be analyzed          |

### Static Members

#### `Punctuations`

characters to ignore when analyzing text

period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, some spaces

`\\.,;:!?‽¡¿⸘()\\[\\]{}<>’'…\"\n\t\r`

#### `wordSeparators`

characters to ignore AND CONSUME when trying to find words

em-dash, period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, space

`—\\.,;:!?‽¡¿⸘()\\[\\]{}<>…"\\s`

### Static Methods

#### `hasPunctuation(string)`

 determines if string contains punctuation 

 

**Parameters**

| name      | type  | Description   |

| ---       |---    | ---           |

| string    | string       |               |

**Returns**

`boolean`

#### `hasSymbols(string)`

 determines if string contains symbols 

 

**Parameters**

| name      | type  | Description   |

| ---       |---    | ---           |

| string    | string       |               |

**Returns**

`boolean`

#### `hasSpace(string)`

 determines if a string has a space 

**Parameters**

| name  | type  | Description   |

| ---           |---        | ---           |

| string        | string    |                |

**Returns**

`boolean`

#### `sanitizeText(string)`

 lowercases text and removes diacritics and other characters that would throw off n-gram analysis 

**Parameters**

| name  | type  | Description   |

| ---           |---    | ---           |

| string        |string       |               |

**Returns**

`string`

#### `getWords(text)`

 extracts an array of words from a string 

**Parameters**

| name  | type  | Description   |

| ---       |---    | ---           |

| text      | string       |               |

**Returns**

`Array`

#### `getNGrams(text, gramSize)`

 gets ngrams from text 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

|  text     | string       |               |

|  gramSize     | Number       | Default = 2              |

**Returns**

`Array`

#### `getMeanWordSize(wordArray)`

 Gets average size of a word

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

|  wordArray     | string[]       |               |

**Returns**

`number`

#### `getMedianWordSize(wordArray)`

 Gets the median (middle) size of a word

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

|  wordArray     | string[]       |               |

**Returns**

`number`

#### `getWordNGrams(text)`

Gets 2-word pairs from text.

Note: This doesn't use sentence punctuation as a boundary. Should it?

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

|   text     | string      |               |

|   gramSize     | number      |    default=2           |

**Returns**

`Array`

#### `getFrequencyMap(frequencyMap)`

 converts an array of strings into a map of those strings and number of occurences 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| ngramArray       | `Array`       |               |

**Returns**

`Map`

#### `getPercentMap(frequencyMap)`

 converts a frequency map into a map of percentages 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

|    frequencyMap   | `Map`      |               |

**Returns**

`Map`

#### `getTopGrams(frequencyMap)`

 filters a frequency map into only a small subset of the most frequent ones 

 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| frequencyMap      |   `Map`    |               |

| limit      |   number   |     default=20          |

**Returns**

`Map`

#### `getIntersection(iterable1, iterable2)`

Returns an array of items that occur in both iterables

 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| iterable1      |   `Map|Array`    |               |

| iterable2      |   `Map|Array`    |               |

**Returns**

`Array` 

An array of items that occur in both iterables. It will compare the keys, if sent a map

#### `getUnion(iterable1, iterable2)`

Returns an array that is the union of two iterables

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| iterable1      |   `Map|Array`    |               |

| iterable2      |   `Map|Array`    |               |

**Returns**

`Array` 

A union of the items that occur in both iterables. 

#### `getDisjunctiveUnion(iterable1, iterable2)`

Returns an array of arrays of the unique items in either iterable. Also known as the symmetric difference

 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| iterable1      |   `Map|Array`    |               |

| iterable2      |   `Map|Array`    |               |

**Returns**

`Array` 

An array of arrays of the unique items. The first item is the first parameter, 2nd item second param

#### `getDifference`

Returns an array of items that are unique only to the first parameter. 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| iterable1      |   `Map|Array`    |               |

| iterable2      |   `Map|Array`    |               |

**Returns**

`Array`

An array of items unique only to the first parameter

#### `getComparison(iterable1, iterable2)`

Returns a map containing various comparisons between two iterables

 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| iterable1      |   `Map|Array`    |               |

| iterable2      |   `Map|Array`    |               |

**Returns**

`Map>` 

A map containing various comparisons between two iterables. Those comparisons will be arrays of intersection, disjunctiveUnion, difference, and union.

#### `getWordPlacementForNGram(ngram, wordsArray)`

determines the placement of a single ngram in an array of words

 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| ngram      |   `string`    |               |

| wordsArray      |   `Array`    |               |

**Returns**

`Map` 

a map with the keys 'start', 'middle', and 'end' whose values correspond to how often the provided ngram occurs in this position

#### `getWordPlacementForNGrams(ngrams, wordsArray)`

determines the placement of ngrams in an array of words

 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| ngram      |   `Array`    |               |

| wordsArray      |   `Array`    |               |

**Returns**

`Map>` 

a map with the key of the ngram, and the value that is a map containing start, middle, end

#### `getNgramCollections(ngrams, wordsArray)`

gets ngrams from an array of words

 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| wordArray      |   `Array`    |     an array of words          |

| ngramSize      |   `number`    | default = 2. The size of the ngrams to return               |

**Returns**

`Array>` 

An array containing arrays of ngrams, each array corresponds to a word. 

#### `getNgramSiblings(searchText, ngramCollections, siblingSize)`

using a collection returned from getNgramCollections, searches for a string and returns what comes before and after it

 

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| searchText      |   `string`    |     the string to search for          |

| ngramCollections      |   `Array|Array>`    | an array of ngrams, or an nGramCollection               |

| siblingSize      |   `number`    | default = 1. How many siblings to find in front or behind               |

**Returns**

`Map<'before'|'after',Map>` 

a Map with the keys 'before' and 'after' which contain maps of what comes before and after

**Example**

```JavaScript

        const words = ['revolution', 'nation'];

        const ngramCollections = Methodius.getNgramCollections(words, 2);

        const onSiblings = Methodius.getNgramSiblings('io', ngramCollections);

        /* 

        new Map([

          ['before', new Map(

            ['ti', 2]

          )],

          ['after', new Map(

            ['on', 2]

          )]

        ])

        */

```

#### `getRelatedNgrams(words, ngrams, ngramSize)`

Gets the ngrams that will occur before or after other ngrams. Useful for finding patterns of ngrams.

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| words      |   `Array`    |     an array of words to evaluate          |

| ngrams      |   `Map`    | a frequency map of ngrams               |

| ngramSize      |   `number`    | default = 2. the size of the ngram              |

**Returns**

`Map` A frequency map of how often ngrams occured before or after other ngrams

**Example**

This requires several steps. You'll need an array of words and a frequency map of ngrams.

```JavaScript

    const ngrams = getNGrams('the revolution of the nation was on television. It was about pollution and the terrible situation ', 2);

    const frequencyMap = getFrequencyMap(ngrams);

    const topNgrams = getTopGrams(frequencyMap, 5);

    const words = ['the', 'revolution', 'of', 'the', 'nation', 'was', 'on', 'television', 'it', 'was', 'about', 'pollution', 'and', 'the', 'terrible', 'situation' ];

    const relatedNgrams = getRelatedNgrams(words, topNgrams, 2, 5);

```

#### `getNgramTreeCollection(words)`

Gets a nested map of maps that breaks down unique words into their smallest ngrams

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| words      |   `Array`    |     an array of words to evaluate          |

**Returns**

`Map| Map>` A nested map of maps that breaks down unique words into their smallest ngrams.

### Instance Members

#### `sanitizedText`

lowercased text with diacritics removed

`string`

#### `letters`

 an array of letters in the text

`Array`

#### `words`

 an array of words in the text

 `Array`

#### `bigrams`

 an array of letter bigrams in the text

  `Array`

#### `trigrams`

 an array of letter trigrams in the text

 `Array`

#### `uniqueLetters`

 an array of unique letters in the text

 `Array`

#### `uniqueBigrams`

 an array of unique bigrams in the text

 `Array`

#### `uniqueTrigrams`

 an array of unique trigrams in the text

 `Map>`

#### `letterPositions`

a map of placements of letters within words

 `Map>`

#### `bigramPositions`

a map of placements of bigrams within words

 `Map>`

#### `uniqueTrigrams`

a map of placements of trigrams within words

 `Array`

#### `uniqueWords`

 an array of unique words in the text

  `Array`

#### `letterFrequencies`

 a map of letter frequencies in the sanitized text

  `Map`

#### `bigramFrequencies`

 a map of bigram frequencies in the sanitized text

  `Map`

#### `trigramFrequencies`

 a map of trigram frequencies in the sanitized text

  `Map`

#### `wordFrequencies`

 a map of word frequencies in the sanitized text

  `Map`

#### `letterPercentages`

 a map of letter percentages in the sanitized text

  `Map`

#### `bigramPercentages`

 a map of bigram percentages in the sanitized text

  `Map`

#### `trigramPercentages`

 a map of trigram percentages in the sanitized text

  `Map`

#### `wordPercentages`

 a map of word percentages in the sanitized text

  `Map`

#### `meanWordSize`

 The average size of a word

  

  `number`

#### `medianWordSize`

 The middle size of a word

 `number`

#### `ngramTreeCollection`

A nested map of maps that breaks down unique words into their smallest ngrams.

### Instance Methods

#### `getLetterNGrams(size)`

gets an array of customizeable ngrams in the text

**Parameters**

| name          | type  | Description   |

| ---           |---    | ---           |

|    size   | `number`      | default = 2  size of the n-gram to return       |

**Returns**

`Array`

#### `getTopLetters(limit)`

 a map of the most used letters in the text

**Parameters**

| name          | type  | Description   |

| ---           |---    | ---           |

|    limit   | `number`      | default = 20  number of top letters to return       |

**Returns**

`Map`

#### `getTopBigrams(limit)`

 a map of the most used bigrams in the text

**Parameters**

| name          | type  | Description   |

| ---           |---    | ---           |

|    limit   | `number`      | default = 20  number of top bigrams to return       |

**Returns**

`Map`

#### `getTopTrigrams(limit)`

 a map of the most used trigrams in the text

**Parameters**

| name          | type  | Description   |

| ---           |---    | ---           |

|    limit   | `number`      | default = 20  number of top trigrams to return       |

**Returns**

`Map`

#### `getTopWords(limit)`

 a map of the most used words in the text

**Parameters**

| name          | type  | Description   |

| ---           |---    | ---           |

|    limit   | `number`      | default = 20  number of top words to return       |

**Returns**

`Map`

#### `compareTo(methodius)`

Compare this methodius instance to another

**Parameters**

| name          | type  | Description   |

| ---           |---    | ---           |

|    methodius   | `Methodius`      | another Methodius instance       |

**Returns**

`Map`

A map of property names and their comparisons (intersection, disjunctiveUnions, etc) for a set of properties

#### `getRelatedTopNgrams(ngramSize, limit)`

Gets the ngrams that will occur before or after other ngrams based on what the most frequent ngrams are. Useful for finding patterns of ngrams.

**Parameters**

| name  | type  | Description   |

| ---   |---    | ---           |

| ngramSize      |   `number`    | default = 2. the size of the ngram              |

| limit      |   `number`    | default = 20. the number of top ngrams to use              |

**Returns**

`Map` A frequency map of how often the most common ngrams occured before or after other common ngrams
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/paceaux/methodius

Awesome Lists containing this project

README