https://github.com/bramstein/unicode-tokenizer
Unicode Tokenizer following the Unicode Line Breaking algorithm
https://github.com/bramstein/unicode-tokenizer
Last synced: about 1 year ago
JSON representation
Unicode Tokenizer following the Unicode Line Breaking algorithm
- Host: GitHub
- URL: https://github.com/bramstein/unicode-tokenizer
- Owner: bramstein
- Created: 2012-08-20T17:54:30.000Z (almost 14 years ago)
- Default Branch: master
- Last Pushed: 2013-08-20T17:08:12.000Z (almost 13 years ago)
- Last Synced: 2025-03-19T09:14:27.136Z (about 1 year ago)
- Language: JavaScript
- Size: 309 KB
- Stars: 20
- Watchers: 4
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Unicode Tokenizer
This is a tokenizer that tokenizes text according to the line breaking classes defined by the [Unicode Line Breaking algorithm (tr14)](http://unicode.org/reports/tr14/). It also annotates each token with its line breaking action. This is useful when performing Natural Language Processing or doing manual line breaking.
Usage:
var ut = require('unicode-tokenizer'),
tokenizer = ut.createTokenizerStream();
tokenizer.on('token', function(token, type, action) {
...
});
tokenizer.write('Hello World!');
tokenizer.end();
Note that in order to receive the token type and break action, you'll need to listen to the `token` event. The `token` parameter is a string containing the token, the `type` is a number representing the token type, and the action is also a number representing the line break action. Both the token types and line breaking actions are available as enumerations on the object returned by `require('unicode-tokenizer')`. If, for example, you would like to do something special for tokens with class `AL` that are also an explicit break you can implement the above callback as shown below:
tokenizer.on('token', function(token, type, action) {
if (type === ut.Token.AL && action = ut.Break.EXPLICIT) {
// Do something special
}
});
The `Tokenizer` returned by `createTokenizerStream` is also a valid Node.js `Stream` so it can be used with other streams:
process.stdin.pipe(tokenizer);
tokenizer.pipe(process.stdout);
process.stdin.resume();
## Unicode support
The full range of Unicode code points are supported by this tokenizer. If you however only want to tokenize selected portions of the Unicode standard, such as the [Basic Multilingual Plane](http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane), you can subset the supported Unicode range. To generate a subsetted tokenizer, modify the `included-ranges.txt` and `excluded-classes.txt` files, and use the `--include-ranges` and `--exclude-classes` command line options on the `generate-tokens` script.
## Copyright and License
This project is licensed under the three-clause BSD license. Copyright 2012-2013 Bram Stein. All rights reserved.