{"id":22702396,"url":"https://github.com/phughesmcr/happynodetokenizer","last_synced_at":"2025-04-13T08:02:26.021Z","repository":{"id":57261298,"uuid":"88893396","full_name":"phughesmcr/happynodetokenizer","owner":"phughesmcr","description":"Javascript port of HappyFunTokenizer.py by Christopher Potts and HappierFunTokenizing.py by H. Andrew Schwartz","archived":false,"fork":false,"pushed_at":"2024-02-29T21:22:05.000Z","size":1719,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-11-06T18:51:52.777Z","etag":null,"topics":["happierfuntokenizing","happyfuntokenizer","text-mining","tokeniser","tokenising","tokenizer","tokenizing","twitter"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/phughesmcr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-04-20T17:37:48.000Z","updated_at":"2024-03-01T19:11:54.000Z","dependencies_parsed_at":"2024-06-21T13:13:39.749Z","dependency_job_id":"5e7344e3-db27-43af-8aca-b8abaa01fcdf","html_url":"https://github.com/phughesmcr/happynodetokenizer","commit_stats":null,"previous_names":["phugh/happynodetokenizer"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phughesmcr%2Fhappynodetokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phughesmcr%2Fhappynodetokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phughesmcr%2Fhappynodetokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phughesmcr%2Fhappynodetokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/phughesmcr","download_url":"https://codeload.github.com/phughesmcr/happynodetokenizer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229019441,"owners_count":18007169,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["happierfuntokenizing","happyfuntokenizer","text-mining","tokeniser","tokenising","tokenizer","tokenizing","twitter"],"created_at":"2024-12-10T07:13:19.464Z","updated_at":"2024-12-10T07:13:20.120Z","avatar_url":"https://github.com/phughesmcr.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 😄 HappyNodeTokenizer\n\nA basic Twitter aware tokenizer for Javascript environments.\n\nA Typescript port of [HappyFunTokenizer.py](https://github.com/stanfordnlp/python-stanford-corenlp/blob/master/tests/happyfuntokenizer.py) by Christopher Potts and  [HappierFunTokenizing.py](https://github.com/dlatk/happierfuntokenizing) by H. Andrew Schwartz.\n\n## Features\n* Accurate port of both libraries (run `npm run test`)\n* Typescript definitions\n* Uses generators / memoize for efficiency\n* Customizable and easy to use\n\n## Install\n\n### NPM\n```bash\n  npm install --save happynodetokenizer\n```\n\n### JSR (Deno / Bun)\n```bash\nbunx jsr i @phughesmcr/happynodetokenizer\n```\n\n## Usage\nHappyNodeTokenizer exports a function called `tokenizer()` which takes an optional configuration object *(See \"The Options Object\" below)*.\n\n### Example\n```javascript\nimport { tokenizer } from 'happynodetokenizer';\n// or import * as mod from \"@phughesmcr/happynodetokenizer\"; if using JSR\n\nconst text = 'RT @ #happyfuncoding: this is a typical Twitter tweet :-)';\n\n// these are the default options\nconst opts = {\n  'mode': 'stanford',\n  'normalize': undefined,\n  'preserveCase': true,\n};\n\n// create a tokenizer instance with our options\nconst myTokenizer = tokenizer(opts);\n\n// calling myTokenizer returns a generator function\nconst tokenGenerator = myTokenizer(text);\n\n// you can turn the generator into an array of token objects like this:\nconst tokens = [...tokenGenerator()];\n\n// you can also convert token objects to array of strings like this:\nconst values = Array.from(tokens, (token) =\u003e token.value);\n```\n#### Output\n\nThe `tokens` variable in the above example will look like this:\n\n```javascript\n[\n  { end: 1, start: 0, tag: 'word', value: 'rt' },\n  { end: 3, start: 3, tag: 'punct', value: '@' },\n  { end: 19, start: 5, tag: 'hashtag', value: '#happyfuncoding' },\n  { end: 20, start: 20, tag: 'punct', value: ':' },\n  { end: 25, start: 22, tag: 'word', value: 'this' },\n  { end: 28, start: 27, tag: 'word', value: 'is' },\n  { end: 30, start: 30, tag: 'word', value: 'a' },\n  { end: 38, start: 32, tag: 'word', value: 'typical' },\n  { end: 46, start: 40, tag: 'word', value: 'twitter' },\n  { end: 52, start: 48, tag: 'word', value: 'tweet' },\n  { end: 56, start: 54, tag: 'emoticon', value: ':-)' }\n]\n```\n\nWhere `preserveCase` in the Options Object is `false`, each result object may also contain a `variation` property which presents the token as originally matched if it differs from the `value` property. E.g.:\n\n```javascript\n[\n  { end: 1, start: 0, tag: 'word', value: 'rt', variation: 'RT' },\n  { end: 3, start: 3, tag: 'punct', value: '@' },\n  { end: 19, start: 5, tag: 'hashtag', value: '#happyfuncoding' },\n  ...\n  { end: 46, start: 40, tag: 'word', value: 'twitter', variation: 'Twitter' },\n  ...\n]\n```\n\n## The Options Object\nThe options object and its properties are optional. The defaults are:\n\n```javascript\n{\n  'mode': 'stanford',\n  'normalize': undefined,\n  'preserveCase': true,\n};\n```\n\n### mode\n**string - valid options: `stanford` (default), or `dlatk`**\n\n`stanford` mode uses the original HappyFunTokenizer pattern. See [Github](https://github.com/stanfordnlp/python-stanford-corenlp).\n\n`dlatk` mode uses the modified HappierFunTokenizing pattern. See [Github](https://github.com/dlatk/happierfuntokenizing/).\n\n### normalize\n**string - valid options: \"NFC\" | \"NFD\" | \"NFKC\" | \"NFKD\" (default = undefined)**\n\n[Normalize](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize) strings (e.g., when set, mañana becomes manana).\n\nNormalization is disabled with set to null or undefined (default).\n\n### preserveCase\n**boolean - valid options: `true`, or `false` (default)**\n\nPreserves the case of the input string if true, otherwise all tokens are converted to lowercase. Does not affect emoticons.\n\n## Tags\nHappyNodeTokenizer outputs an array of token objects. Each token object has three properties: `idx`, `value` and `tag`. The `value` is the token itself, the `idx` is the token's original index in the output, the `tag` is a descriptor based on one of the following depending on which `opt.mode` you are using:\n\n| Tag            | Stanford           | DLATK              | Example  |\n| -------------  |-------------       | -----              | -------- |\n| phone          | :heavy_check_mark: | :heavy_check_mark: | +1 (800) 123-4567\n| url            | :x:                | :heavy_check_mark: | http://www.youtube.com\n| url_scheme     | :x:                | :heavy_check_mark: | http://\n| url_authority  | :x:                | :heavy_check_mark: | [0-3]\n| url_path_query | :x:                | :heavy_check_mark: | /index.html?s=search\n| htmltag        | :x:                | :heavy_check_mark: | \\\u003cem class='grumpy'\u003e\n| emoticon       | :heavy_check_mark: | :heavy_check_mark: | \u003e:(\n| username       | :heavy_check_mark: | :heavy_check_mark: | @somefaketwitterhandle\n| hashtag        | :heavy_check_mark: | :heavy_check_mark: | #tokenizing\n| punct          | :heavy_check_mark: | :heavy_check_mark: | ,\n| word           | :heavy_check_mark: | :heavy_check_mark: | hello\n| \\\u003cUNK\u003e         | :heavy_check_mark: | :heavy_check_mark: | (anything left unmatched)\n\n## Testing\nTo compare the results of HappyNodeTokenizer against HappyFunTokenizer and HappierFunTokenizing, run:\n```bash\nnpm run test\n```\nThe goal of this project is to provide an accurate port of HappyFunTokenizer and HappierFunTokenizing. Therefore, any pull requests with test failures will not be accepted.\n\n## Acknowledgements\nBased on [HappyFunTokenizer.py](https://github.com/stanfordnlp/python-stanford-corenlp/blob/master/tests/happyfuntokenizer.py) by Christopher Potts and  [HappierFunTokenizing.py](https://github.com/dlatk/happierfuntokenizing) by H. Andrew Schwartz.\n\nUses the [\"he\" library](https://github.com/mathiasbynens/he) by Mathias Bynens under the MIT license.\n\n## License\n(C) 2017-24 [P. Hughes](https://www.phugh.es). All rights reserved.\n\nShared under the [Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-nc-sa/3.0/) license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphughesmcr%2Fhappynodetokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphughesmcr%2Fhappynodetokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphughesmcr%2Fhappynodetokenizer/lists"}