https://github.com/bramstein/unicode-tokenizer

Unicode Tokenizer following the Unicode Line Breaking algorithm
https://github.com/bramstein/unicode-tokenizer

Last synced: about 1 year ago
JSON representation

Unicode Tokenizer following the Unicode Line Breaking algorithm

Host: GitHub
URL: https://github.com/bramstein/unicode-tokenizer
Owner: bramstein
Created: 2012-08-20T17:54:30.000Z (almost 14 years ago)
Default Branch: master
Last Pushed: 2013-08-20T17:08:12.000Z (almost 13 years ago)
Last Synced: 2025-03-19T09:14:27.136Z (about 1 year ago)
Language: JavaScript
Size: 309 KB
Stars: 20
Watchers: 4
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ## Unicode Tokenizer

This is a tokenizer that tokenizes text according to the line breaking classes defined by the [Unicode Line Breaking algorithm (tr14)](http://unicode.org/reports/tr14/). It also annotates each token with its line breaking action. This is useful when performing Natural Language Processing or doing manual line breaking.

Usage:

    var ut = require('unicode-tokenizer'),

        tokenizer = ut.createTokenizerStream();

    tokenizer.on('token', function(token, type, action) {

        ...

    });

    tokenizer.write('Hello World!');

    tokenizer.end();

Note that in order to receive the token type and break action, you'll need to listen to the `token` event. The `token` parameter is a string containing the token, the `type` is a number representing the token type, and the action is also a number representing the line break action. Both the token types and line breaking actions are available as enumerations on the object returned by `require('unicode-tokenizer')`.  If, for example, you would like to do something special for tokens with class `AL` that are also an explicit break you can implement the above callback as shown below:

    tokenizer.on('token', function(token, type, action) {

        if (type === ut.Token.AL && action = ut.Break.EXPLICIT) {

            // Do something special

        }

    });

The `Tokenizer` returned by `createTokenizerStream` is also a valid Node.js `Stream` so it can be used with other streams:

    process.stdin.pipe(tokenizer);

    tokenizer.pipe(process.stdout);

    process.stdin.resume();

## Unicode support

The full range of Unicode code points are supported by this tokenizer. If you however only want to tokenize selected portions of the Unicode standard, such as the [Basic Multilingual Plane](http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane), you can subset the supported Unicode range. To generate a subsetted tokenizer, modify the `included-ranges.txt` and `excluded-classes.txt` files, and use the `--include-ranges` and `--exclude-classes` command line options on the `generate-tokens` script.

## Copyright and License

This project is licensed under the three-clause BSD license. Copyright 2012-2013 Bram Stein. All rights reserved.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bramstein/unicode-tokenizer

Awesome Lists containing this project

README