{"id":17963853,"url":"https://github.com/bramstein/unicode-tokenizer","last_synced_at":"2025-03-25T05:32:22.874Z","repository":{"id":57386140,"uuid":"5484630","full_name":"bramstein/unicode-tokenizer","owner":"bramstein","description":"Unicode Tokenizer following the Unicode Line Breaking algorithm","archived":false,"fork":false,"pushed_at":"2013-08-20T17:08:12.000Z","size":316,"stargazers_count":20,"open_issues_count":0,"forks_count":5,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-19T09:14:27.136Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bramstein.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-08-20T17:54:30.000Z","updated_at":"2019-05-07T10:25:45.000Z","dependencies_parsed_at":"2022-09-14T17:02:38.519Z","dependency_job_id":null,"html_url":"https://github.com/bramstein/unicode-tokenizer","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bramstein%2Funicode-tokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bramstein%2Funicode-tokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bramstein%2Funicode-tokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bramstein%2Funicode-tokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bramstein","download_url":"https://codeload.github.com/bramstein/unicode-tokenizer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245407667,"owners_count":20610231,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-29T11:45:41.345Z","updated_at":"2025-03-25T05:32:22.565Z","avatar_url":"https://github.com/bramstein.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Unicode Tokenizer\n\nThis is a tokenizer that tokenizes text according to the line breaking classes defined by the [Unicode Line Breaking algorithm (tr14)](http://unicode.org/reports/tr14/). It also annotates each token with its line breaking action. This is useful when performing Natural Language Processing or doing manual line breaking.\n\nUsage:\n\n    var ut = require('unicode-tokenizer'),\n        tokenizer = ut.createTokenizerStream();\n\n    tokenizer.on('token', function(token, type, action) {\n        ...\n    });\n\n    tokenizer.write('Hello World!');\n    tokenizer.end();\n\nNote that in order to receive the token type and break action, you'll need to listen to the `token` event. The `token` parameter is a string containing the token, the `type` is a number representing the token type, and the action is also a number representing the line break action. Both the token types and line breaking actions are available as enumerations on the object returned by `require('unicode-tokenizer')`.  If, for example, you would like to do something special for tokens with class `AL` that are also an explicit break you can implement the above callback as shown below:\n\n    tokenizer.on('token', function(token, type, action) {\n        if (type === ut.Token.AL \u0026\u0026 action = ut.Break.EXPLICIT) {\n            // Do something special\n        }\n    });\n\nThe `Tokenizer` returned by `createTokenizerStream` is also a valid Node.js `Stream` so it can be used with other streams:\n\n    process.stdin.pipe(tokenizer);\n    tokenizer.pipe(process.stdout);\n    process.stdin.resume();\n\n## Unicode support\n\nThe full range of Unicode code points are supported by this tokenizer. If you however only want to tokenize selected portions of the Unicode standard, such as the [Basic Multilingual Plane](http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane), you can subset the supported Unicode range. To generate a subsetted tokenizer, modify the `included-ranges.txt` and `excluded-classes.txt` files, and use the `--include-ranges` and `--exclude-classes` command line options on the `generate-tokens` script.\n\n## Copyright and License\n\nThis project is licensed under the three-clause BSD license. Copyright 2012-2013 Bram Stein. All rights reserved.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbramstein%2Funicode-tokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbramstein%2Funicode-tokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbramstein%2Funicode-tokenizer/lists"}