https://github.com/chearon/word-breaker
Unicode word boundary algorithm from UAX29 section 4
https://github.com/chearon/word-breaker
Last synced: over 1 year ago
JSON representation
Unicode word boundary algorithm from UAX29 section 4
- Host: GitHub
- URL: https://github.com/chearon/word-breaker
- Owner: chearon
- Created: 2019-08-29T04:08:46.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2019-09-08T18:37:05.000Z (almost 7 years ago)
- Last Synced: 2025-03-17T19:52:14.853Z (over 1 year ago)
- Language: JavaScript
- Homepage:
- Size: 11.7 KB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# word-breaker
Implementation of the Unicode Word Boundary Rules algorithm (UAX29 4.1). At time of writing it targets **Unicode 12**.
What are word boundaries used for?
* When you double click a word inside your web browser, UAX29 sec 4 defines where the start and end of the selection should be
* CSS's text-transform: uppercase
* Can be used for search algorithms too
It will keep together grapheme clusters, like emojis with skin tones or diacritical marks like a grave accent. It passes all 613 tests from the [Unicode auxillary files](https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakTest.html#samples) for word breaks.
## API
```javascript
const WordBreaker = require('word-breaker');
const string = 'UAX29 has rules like WB4\tšš¼';
const wb = new WordBreaker(string);
let last = null;
let i;
while ((i = wb.nextBreak()) !== null) {
if (last !== null) console.log(string.slice(last, i));
last = i;
}
// output:
// UAX29
// _
// has
// _
// rules
// _
// like
// ___
// WB4
// \t
// šš¼
```
## More info
Inspired by [foliojs/grapheme-breaker](https://github.com/foliojs/grapheme-breaker) which comes from the same specification, and [foliojs/linebreak](https://github.com/foliojs/linebreak). It uses the same project structure as well as [unicode-trie](https://github.com/foliojs/unicode-trie) for character classification.