Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mugendi/wordize
https://github.com/mugendi/wordize
consonant-blends natural-language-processing nlp tokenize words
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/mugendi/wordize
- Owner: mugendi
- Created: 2017-06-21T08:53:29.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-12-03T08:46:18.000Z (about 7 years ago)
- Last Synced: 2024-11-17T09:47:43.616Z (about 2 months ago)
- Topics: consonant-blends, natural-language-processing, nlp, tokenize, words
- Language: JavaScript
- Size: 5.86 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# For when tokenizers fail!
Have you tried to tokenize a sentence with combined words like ```'tokenizerFail'```? Well, that is easy because the words use *camel case*. But how about ```tokenizerfail```? I'm sure you see the trouble you encounter tokenizing such words!Unfortunately, with the advent of social media, these kind of 'compounded' words are much more common (especially with hashtags).
This package uses the concept of known **consonant blends** to attempt and discover word boundaries & hence tokenize/humanize such words. It is not perfect (I'm looking for other methods to enhance it) but gets you closer to perfect tokenization.
## Adopt for your language
Don't speak English? Go to the ```./lang``` folder and create consonant blends for your language (check out ```./lang/en.json```).```javascript
const wordize = require('wordize');
var str = 'there is this bigmanInYellowSUIT who thinks he is the freakingpope & our rainmaker';
//numanize
wordize.humanize(str, 'en'); //There is this big man in yellow suit who thinks he is the freaking pope & our rain maker//get words from the sentence
//Note: The second parameter is the appropriate language code. Defaults to 'en'
wordize.words(str) //[ 'There', 'is', 'this', 'big', 'man', 'in', 'yellow', 'suit', 'who', 'thinks', 'he', 'is', 'the', 'freaking', 'pope', 'our', 'rain', 'maker' ]```
*Got ideas on how we can enhance this module? Please share!*