Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mugendi/wordize


https://github.com/mugendi/wordize

consonant-blends natural-language-processing nlp tokenize words

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

# For when tokenizers fail!
Have you tried to tokenize a sentence with combined words like ```'tokenizerFail'```? Well, that is easy because the words use *camel case*. But how about ```tokenizerfail```? I'm sure you see the trouble you encounter tokenizing such words!

Unfortunately, with the advent of social media, these kind of 'compounded' words are much more common (especially with hashtags).

This package uses the concept of known **consonant blends** to attempt and discover word boundaries & hence tokenize/humanize such words. It is not perfect (I'm looking for other methods to enhance it) but gets you closer to perfect tokenization.

## Adopt for your language
Don't speak English? Go to the ```./lang``` folder and create consonant blends for your language (check out ```./lang/en.json```).

```javascript

const wordize = require('wordize');

var str = 'there is this bigmanInYellowSUIT who thinks he is the freakingpope & our rainmaker';

//numanize
wordize.humanize(str, 'en'); //There is this big man in yellow suit who thinks he is the freaking pope & our rain maker

//get words from the sentence
//Note: The second parameter is the appropriate language code. Defaults to 'en'
wordize.words(str) //[ 'There', 'is', 'this', 'big', 'man', 'in', 'yellow', 'suit', 'who', 'thinks', 'he', 'is', 'the', 'freaking', 'pope', 'our', 'rain', 'maker' ]

```

*Got ideas on how we can enhance this module? Please share!*