Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/eklem/words-n-numbers
Tokenizing strings of text. Regex extracting arrays of words and optionally numbers, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions.
https://github.com/eklem/words-n-numbers
nlp offline-first regex tokenization tokenizer
Last synced: 22 days ago
JSON representation
Tokenizing strings of text. Regex extracting arrays of words and optionally numbers, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions.
- Host: GitHub
- URL: https://github.com/eklem/words-n-numbers
- Owner: eklem
- License: mit
- Created: 2019-07-29T06:38:32.000Z (over 5 years ago)
- Default Branch: trunk
- Last Pushed: 2024-07-22T13:51:01.000Z (4 months ago)
- Last Synced: 2024-09-22T17:47:30.435Z (about 1 month ago)
- Topics: nlp, offline-first, regex, tokenization, tokenizer
- Language: JavaScript
- Homepage:
- Size: 2.06 MB
- Stars: 11
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Words'n'numbers
Tokenizing strings of text. Extracting arrays of words and optionally number, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions. Part of document processing for [search-index](https://github.com/fergiemcdowall/search-index) and [nowsearch.xyz](https://github.com/eklem/nowsearch.xyz).Inspired by [extractwords](https://github.com/f-a-r-a-z/extractwords)
[![NPM version][npm-version-image]][npm-url]
[![NPM downloads][npm-downloads-image]][npm-url]
[![](https://data.jsdelivr.com/v1/package/npm/words-n-numbers/badge?style=rounded)](https://www.jsdelivr.com/package/npm/words-n-numbers)
[![Build Status][build-image]][build-url]
[![JavaScript Style Guide][standardjs-image]][standardjs-url]
[![MIT License][license-image]][license-url]## Breaking change
From `v8.0.0` - `emojis`-regular expression now extracts single emojis, so no more "words" formed by several emojis. This because each emoji in a sense are words. You can still make a custom regular expression to grab several emojis in a row as one item with `const customEmojis = '\\p{Emoji_Presentation}'` and then use it as your custom regex.
Meaning that instead of:
```javaScript
extract('A ticket to 大éĒ costs ÂĨ2000 đđ đĸ', { regex: emojis})
// ['đđ', 'đĸ']
```...you will get:
```javaScript
extract('A ticket to 大éĒ costs ÂĨ2000 đđ đĸ', { regex: emojis})
// ['đ', 'đ', 'đĸ']
```## Initiating
### CJS
```javascript
const { extract, words, numbers, emojis, tags, usernames, email } = require('words-n-numbers')
// extract, words, numbers, emojis, tags, usernames, email available
```### ESM
```javascript
import { extract, words, numbers, emojis, tags, usernames, email } from 'words-n-numbers'
// extract, words, numbers, emojis, tags, usernames, email available
```### Browser
```html
//wnn.extract, wnn.words, wnn.numbers, wnn.emojis, wnn.tags, wnn.usernames, wnn.email available
```
## Browser demo
A [simple browser demo of wnn](https://eklem.github.io/words-n-numbers/demo/) to show how it works.[![Screenshot of the words-n-numbers demo](./demo/wnn-demo-screenshot.png)](https://eklem.github.io/words-n-numbers/demo/)
## Use
The default regex should catch every unicode character from for every language. Default regex flags are `giu`. `emojisCustom`-regex won't work with the `u`-flag (unicode).
### Only words
```javaScript
const stringOfWords = 'A 1000000 dollars baby!'
extract(stringOfWords)
// returns ['A', 'dollars', 'baby']
```### Only words, converted to lowercase
```javaScript
const stringOfWords = 'A 1000000 dollars baby!'
extract(stringOfWords, { toLowercase: true })
// returns ['a', 'dollars', 'baby']
```### Combining predefined regex for words and numbers, converted to lowercase
```javaScript
const stringOfWords = 'A 1000000 dollars baby!'
extract(stringOfWords, { regex: [words, numbers], toLowercase: true })
// returns ['a', '1000000', 'dollars', 'baby']
```### Combining predefined regex for words and emoticons, converted to lowercase
```javaScript
const stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
extract(stringOfWords, { regex: [words, emojis], toLowercase: true })
// returns [ 'A', 'ticket', 'to', '大éĒ', 'costs', 'đ', 'đ', 'đĸ' ]
```### Combining predefined regex for numbers and emoticons
```javaScript
const stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
extract(stringOfWords, { regex: [numbers, emojis], toLowercase: true })
// returns [ '2000', 'đ', 'đ', 'đĸ' ]
```### Combining predefined regex for words, numbers and emoticons, converted to lowercase
```javaScript
cons stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
extract(stringOfWords, { regex: [words, numbers, emojis], toLowercase: true })
// returns [ 'a', 'ticket', 'to', '大éĒ', 'costs', '2000', 'đ', 'đ', 'đĸ' ]
```### Predefined regex for `#tags`
```javaScript
const stringOfWords = 'A #49ticket to #大éĒ or two#tickets costs ÂĨ2000 đđđ đĸ'
extract(stringOfWords, { regex: tags, toLowercase: true })
// returns [ '#49ticket', '#大éĒ' ]
```### Predefined regex for `@usernames`
```javaScript
const stringOfWords = 'A #ticket to #大éĒ costs [email protected], @alice and @įžæ ÂĨ2000 đđđ đĸ'
extract(stringOfWords, { regex: usernames, toLowercase: true })
// returns [ '@alice123', '@įžæ' ]
```### Predefined regex for email addresses
```javaScript
const stringOfWords = 'A #ticket to #大éĒ costs [email protected], [email protected], [email protected] and @įžæ ÂĨ2000 đđđ đĸ'
extract(stringOfWords, { regex: email, toLowercase: true })
// returns [ '[email protected]', '[email protected]', '[email protected]' ]
```### Predefined custom regex for all Unicode emojis
```javaScript
const stringOfWords = 'A #ticket to #大éĒ costs [email protected], [email protected], [email protected] and @įžæ ÂĨ2000 đđđ đĸđŠđŊâđ¤âđ¨đģ đŠđŊâđ¤âđ¨đģ'
extract(stringOfWords, { regex: emojisCustom, flags: 'g' })
// returns [ 'đ', 'đ', 'đ', 'đĸ', 'đŠđŊâđ¤âđ¨đģ', 'đŠđŊâđ¤âđ¨đģ' ]
```### Custom regex
Some characters needs to be escaped, like `\`and `'`. And you escape it with a backslash - `\`.
```javaScript
const stringOfWords = 'This happens at 5 o\'clock !!!'
extract(stringOfWords, { regex: '[a-z\'0-9]+' })
// returns ['This', 'happens', 'at', '5', 'o\'clock']
```## API
### Extract function
Returns an array of words and optionally numbers.
```javascript
extract(stringOfText, \)
```### Options object
```javascript
{
regex: 'custom or predefined regex', // defaults to words
toLowercase: [true / false] // defaults to false
flags: 'gmixsuUAJD' // regex flags, defaults to giu - /[regexPattern]/[regexFlags]
}
```### Order of combined regexes
You can add an array of different regexes or just a string. If you add an array, they will be joined with a `|`-separator, making it an OR-regex. Put the `email`, `usernames` and `tags` before `words` to get the extraction right.
```javaScript
// email addresses before usernames before words can give another outcome than
extract(oldString, { regex: [email, usernames, words] })// than words before usernames before email addresses
extract(oldString, { regex: [words, usernames, email] })
```### Predefined regexes
```javaScript
words // only words, any language <-- default
numbers // only numbers
emojis // only emojis
emojisCustom // only emojis. Works with the `g`-flag, not `giu`. Based on custom emoji extractor from https://github.com/mathiasbynens/rgi-emoji-regex-pattern
tags // #tags (any language
usernames // @usernames (any language)
email // email addresses. Most valid addresses,
// but not to be used as a validator
```### Flags for regexes
All but one regex uses the `giu`-flag. The one that doesn't is the `emojisCustom` that will need only a `g`-flag. `emojisCustom` is added because the standard `emojis` regex based on `\\p{Emoji_Presentation}` isn't able to grab all emojis. When browsers support `p\{RGI_emoji} under a `giu`-flag the library will be changed.
### Languages supported
Supports most languages supported by [stopword](https://github.com/fergiemcdowall/stopword#language-code), and others too. Some languages like Japanese and Chinese simplified needs to be tokenized. May add tokenizers at a later stage.#### PR's welcome
PR's and issues are more than welcome =)[license-image]: http://img.shields.io/badge/license-MIT-blue.svg?style=flat
[license-url]: LICENSE
[npm-url]: https://npmjs.org/package/words-n-numbers
[npm-version-image]: http://img.shields.io/npm/v/words-n-numbers.svg?style=flat
[npm-downloads-image]: http://img.shields.io/npm/dm/words-n-numbers.svg?style=flat
[build-url]: https://github.com/eklem/words-n-numbers/actions/workflows/tests.yml
[build-image]: https://github.com/eklem/words-n-numbers/actions/workflows/tests.yml/badge.svg
[standardjs-url]: https://standardjs.com
[standardjs-image]: https://img.shields.io/badge/code_style-standard-brightgreen.svg?style=flat-square