Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/phughesmcr/simplengrams
The easiest way to get n-grams from strings!
https://github.com/phughesmcr/simplengrams
ngram ngrams nlp nlp-parsing nlp-resources text-mining
Last synced: 24 days ago
JSON representation
The easiest way to get n-grams from strings!
- Host: GitHub
- URL: https://github.com/phughesmcr/simplengrams
- Owner: phughesmcr
- License: mit
- Created: 2017-08-11T17:05:48.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-02-29T21:46:32.000Z (10 months ago)
- Last Synced: 2024-11-10T07:28:08.190Z (about 2 months ago)
- Topics: ngram, ngrams, nlp, nlp-parsing, nlp-resources, text-mining
- Language: JavaScript
- Size: 340 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🗿 SimpleNGrams
The easiest way to get an array of n-grams from a string or array of tokens!
No dependencies!
## Install
```bash
npm install --production --save simplengrams
```## Usage
SimpleNGrams exports one function: `nGram()`.
The function take the following arguments:
- `input` - a string or array of strings to be split into ngrams.
- `n` - the ngram size as a number. Defaults to 2 (i.e. bigrams).
- `pad` - optional padding parameter. Takes a boolean or an array. Defaults to
false (i.e. no padding). See [Padding](#padding) below.
- `splitPattern` - optional pattern as string or RegExp to split input string
by. Defaults to spaces. See [Pattern](#pattern) below.## Example
```javascript
import { nGram } from "simplengrams";
const text = "In the beginning God created the heavens and the earth.";
const bigrams = nGram(text);
console.log(bigrams);
```Output:
```javascript
[
["In", "the"],
["the", "beginning"],
["beginning", "God"],
["God", "created"],
["created", "the"],
["the", "heavens"],
["heavens", "and"],
["and", "the"],
["the", "earth"],
["earth", "."],
];
```## Padding
Custom padding options can be used to add right and left padding to the output
array.The padding argument is the third argument in `nGram()`. It takes a boolean
(i.e. `true` or `false`) or an array.The padding option defaults to `false` if it is not supplied.
Some examples:
- `false` = padding is not applied.
- `true` = `null` is used as padding.
- `['FOO', 'BAR']` = The string `'FOO'` is used as left padding and the string
`'BAR'` is used as right padding.- `['FOOBAR']` = The string `'FOOBAR'` is used as both left and right padding.
You can disable individual padding by using `undefined` like so:
- `[undefined, 'BAR']` will disable left padding and use `'BAR'` as right
padding.N.B. `null` will cause the padder to use the `null` element literally. Use
`undefined` instead to disable padding.N.B. Simply use `false` instead of `[undefined, undefined]` - it results in the
same output but is slightly faster.### Examples
#### pad = true
```javascript
import { nGram } from "simplengrams";
const text = "In the beginning God created the heavens and the earth.";
const bigrams = nGram(text, 2, true);
console.log(bigrams);
``````javascript
[
[null, "In"],
["In", "the"],
["the", "beginning"],
["beginning", "God"],
["God", "created"],
["created", "the"],
["the", "heavens"],
["heavens", "and"],
["and", "the"],
["the", "earth"],
["earth", "."],
[".", null],
];
```#### pad = [undefined, 'END']
```javascript
import { nGram } from "simplengrams";
const text = "In the beginning God created the heavens and the earth.";
const bigrams = nGram(text, 2, [undefined, "END"]);
console.log(bigrams);
``````javascript
[
["In", "the"],
["the", "beginning"],
["beginning", "God"],
["God", "created"],
["created", "the"],
["the", "heavens"],
["heavens", "and"],
["and", "the"],
["the", "earth"],
["earth", "."],
[".", "END"],
];
```## Pattern
The pattern argument is an optional fourth argument of `nGram()`.
It defaults to `/\s+/` and can take a string or a RegExp.
If your input is an array, the pattern argument is ignored.
For string inputs, the pattern argument is used to split the string into tokens,
exactly like `string.split()`.## Testing
Use `node index.test.js` to check any development changes against expected outputs.
## License
© 2017-24 [P. Hughes](https://www.phugh.es). All rights reserved.
Shared under the
[MIT license](http://creativecommons.org/licenses/by-nc-sa/3.0/) license.