{"id":16529395,"url":"https://github.com/cometkim/unicode-segmenter","last_synced_at":"2025-04-05T04:08:38.926Z","repository":{"id":233092448,"uuid":"785990361","full_name":"cometkim/unicode-segmenter","owner":"cometkim","description":"A lightweight implementation of the Unicode Text Segmentation (UAX #29)","archived":false,"fork":false,"pushed_at":"2025-03-07T14:12:35.000Z","size":5216,"stargazers_count":72,"open_issues_count":3,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-02T09:45:18.032Z","etag":null,"topics":["emoji","grapheme","grapheme-cluster","uax29","unicode"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cometkim.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-13T05:09:58.000Z","updated_at":"2025-03-29T21:12:01.000Z","dependencies_parsed_at":"2024-04-13T20:09:25.980Z","dependency_job_id":"6409a6c1-c33c-41f0-93e1-eb2eb06fe852","html_url":"https://github.com/cometkim/unicode-segmenter","commit_stats":{"total_commits":167,"total_committers":2,"mean_commits":83.5,"dds":"0.13772455089820357","last_synced_commit":"8d8cd4f17907772a900befc8564ea95e2fbd7ad1"},"previous_names":["cometkim/unicode-segmenter"],"tags_count":27,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cometkim%2Funicode-segmenter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cometkim%2Funicode-segmenter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cometkim%2Funicode-segmenter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cometkim%2Funicode-segmenter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cometkim","download_url":"https://codeload.github.com/cometkim/unicode-segmenter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246794132,"owners_count":20834931,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["emoji","grapheme","grapheme-cluster","uax29","unicode"],"created_at":"2024-10-11T17:44:43.847Z","updated_at":"2025-04-05T04:08:38.834Z","avatar_url":"https://github.com/cometkim.png","language":"JavaScript","readme":"# unicode-segmenter\n[![NPM Package Version](https://img.shields.io/npm/v/unicode-segmenter)](https://npmjs.com/package/unicode-segmenter)\n[![NPM Downloads](https://img.shields.io/npm/dw/unicode-segmenter)](https://npmjs.com/package/unicode-segmenter)\n[![Integration](https://github.com/cometkim/unicode-segmenter/actions/workflows/ci.yml/badge.svg)](https://github.com/cometkim/unicode-segmenter/actions/workflows/ci.yml)\n[![codecov](https://codecov.io/gh/cometkim/unicode-segmenter/graph/badge.svg?token=3rA29JEH4J)](https://codecov.io/gh/cometkim/unicode-segmenter)\n[![LICENSE - MIT](https://img.shields.io/github/license/cometkim/unicode-segmenter)](#license)\n\nA lightweight implementation of the [Unicode Text Segmentation (UAX \\#29)](https://www.unicode.org/reports/tr29)\n\n- **Spec compliant**: Up-to-date Unicode data, verified by the official Unicode test suites and fuzzed with the native `Intl.Segmenter`, and maintaining 100% test coverage.\n\n- **Excellent compatibility**: It works well on older browsers, edge runtimes, React Native (Hermes) and QuickJS.\n\n- **Zero-dependencies**: It doesn't bloat `node_modules` or the network bandwidth. Like a small minimal snippet.\n\n- **Small bundle size**: It effectively compresses the Unicode data and provides a bundler-friendly format.\n\n- **Extremely efficient**: It's carefully optimized for runtime performance, making it the fastest one in the ecosystem—outperforming even the built-in `Intl.Segmenter`.\n\n- **TypeScript**: It's fully type-checked, and provides type definitions and JSDoc.\n\n- **ESM-first**: It primarily supports ES modules, and still supports CommonJS.\n\n\u003e [!NOTE]\n\u003e unicode-segmenter is now **[e18e] recommendation!**\n\n## Unicode® Version\n\nUnicode® 16.0.0\n\nUnicode® Standard Annex \\#29 - [Revision 45](https://www.unicode.org/reports/tr29/tr29-45.html) (2024-08-28)\n\n## APIs\n\nThere are several entries for text segmentation.\n\n- [`unicode-segmenter/grapheme`](#export-unicode-segmentergrapheme): Segments and counts **extended grapheme clusters**\n- [`unicode-segmenter/intl-adapter`](#export-unicode-segmenterintl-adapter): [`Intl.Segmenter`] adapter\n- [`unicode-segmenter/intl-polyfill`](#export-unicode-segmenterintl-polyfill): [`Intl.Segmenter`] polyfill\n\nAnd extra utilities for combined use cases.\n\n- [`unicode-segmenter/emoji`](#export-unicode-segmenteremoji): Matches single codepoint emojis\n- [`unicode-segmenter/general`](#export-unicode-segmentergeneral): Matches single codepoint alphanumerics\n- [`unicode-segmenter/utils`](#export-unicode-segmenterutils): Some utilities for handling codepoints \n\n### Export `unicode-segmenter/grapheme`\n[![](https://edge.bundlejs.com/badge?q=unicode-segmenter/grapheme\u0026treeshake=[*])](https://bundlejs.com/?q=unicode-segmenter%2Fgrapheme\u0026treeshake=%5B*%5D)\n\nUtilities for text segmentation by extended grapheme cluster rules.\n\n#### Example: Get grapheme segments\n\n```js\nimport { graphemeSegments } from 'unicode-segmenter/grapheme';\n\n[...graphemeSegments('a̐éö̲\\r\\n')];\n// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\\r\\n' }\n// 1: { segment: 'é', index: 2, input: 'a̐éö̲\\r\\n' }\n// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\\r\\n' }\n// 3: { segment: '\\r\\n', index: 7, input: 'a̐éö̲\\r\\n' }\n```\n\n#### Example: Split graphemes\n\n```js\nimport { splitGraphemes } from 'unicode-segmenter/grapheme';\n\n[...splitGraphemes('#️⃣*️⃣0️⃣1️⃣2️⃣')];\n// 0: #️⃣\n// 1: *️⃣\n// 2: 0️⃣\n// 3: 1️⃣\n// 4: 2️⃣\n```\n\n#### Example: Count graphemes\n\n```js\nimport { countGraphemes } from 'unicode-segmenter/grapheme';\n\n'👋 안녕!'.length;\n// =\u003e 6\ncountGraphemes('👋 안녕!');\n// =\u003e 5\n\n'a̐éö̲'.length;\n// =\u003e 7\ncountGraphemes('a̐éö̲');\n// =\u003e 3\n```\n\n\u003e [!NOTE]\n\u003e `countGraphemes()` is a small wrapper around `graphemeSegments()`.\n\u003e \n\u003e If you need it more than once at a time, consider memoization or use `graphemeSegments()` or `splitSegments()` once instead.\n\n#### Example: Build an advanced grapheme matcher\n\n`graphemeSegments()` exposes some knowledge identified in the middle of the process to support some useful cases.\n\nFor example, knowing the [Grapheme_Cluster_Break](https://www.unicode.org/reports/tr29/tr29-43.html#Default_Grapheme_Cluster_Table) category at the beginning and end of a segment can help approximately infer the applied boundary rule.\n\n```js\nimport { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';\n\nfunction* matchEmoji(str) {\n  for (const { segment, _catBegin } of graphemeSegments(input)) {\n    // `_catBegin` identified as Extended_Pictographic means the segment is emoji\n    if (_catBegin === GraphemeCategory.Extended_Pictographic) {\n      yield segment;\n    }\n  }\n}\n\n[...matchEmoji('1🌷2🎁3💩4😜5👍')]\n// 0: 🌷\n// 1: 🎁\n// 2: 💩\n// 3: 😜\n// 4: 👍\n```\n\n### Export `unicode-segmenter/intl-adapter`\n[![](https://edge.bundlejs.com/badge?q=unicode-segmenter/intl-adapter\u0026treeshake=[*])](https://bundlejs.com/?q=unicode-segmenter%2Fintl-adapter\u0026treeshake=%5B*%5D)\n\n[`Intl.Segmenter`] API adapter (only `granularity: \"grapheme\"` available yet)\n\n```js\nimport { Segmenter } from 'unicode-segmenter/intl-adapter';\n\n// Same API with the `Intl.Segmenter`\nconst segmenter = new Segmenter();\n```\n\n### Export `unicode-segmenter/intl-polyfill`\n[![](https://edge.bundlejs.com/badge?q=unicode-segmenter/intl-polyfill\u0026treeshake=[*])](https://bundlejs.com/?q=unicode-segmenter%2Fintl-polyfill\u0026treeshake=%5B*%5D)\n\n[`Intl.Segmenter`] API polyfill (only `granularity: \"grapheme\"` available yet)\n\n```js\n// Apply polyfill to the `globalThis.Intl` object.\nimport 'unicode-segmenter/intl-polyfill';\n\nconst segmenter = new Intl.Segmenter();\n```\n\n### Export `unicode-segmenter/emoji`\n[![](https://edge.bundlejs.com/badge?q=unicode-segmenter/emoji\u0026treeshake=[*])](https://bundlejs.com/?q=unicode-segmenter%2Femoji\u0026treeshake=%5B*%5D)\n\nUtilities for matching emoji-like characters.\n\n#### Example: Use Unicode emoji property matches\n\n```js\nimport {\n  isEmojiPresentation,    // match \\p{Emoji_Presentation}\n  isExtendedPictographic, // match \\p{Extended_Pictographic}\n} from 'unicode-segmenter/emoji';\n\nisEmojiPresentation('😍'.codePointAt(0));\n// =\u003e true\nisEmojiPresentation('♡'.codePointAt(0));\n// =\u003e false\n\nisExtendedPictographic('😍'.codePointAt(0));\n// =\u003e true\nisExtendedPictographic('♡'.codePointAt(0));\n// =\u003e true\n```\n\n### Export `unicode-segmenter/general`\n[![](https://edge.bundlejs.com/badge?q=unicode-segmenter/general\u0026treeshake=[*])](https://bundlejs.com/?q=unicode-segmenter%2Fgeneral\u0026treeshake=%5B*%5D)\n\nUtilities for matching alphanumeric characters.\n\n#### Example: Use Unicode general property matchers\n\n```js\nimport {\n  isLetter,       // match \\p{L}\n  isNumeric,      // match \\p{N}\n  isAlphabetic,   // match \\p{Alphabetic}\n  isAlphanumeric, // match [\\p{N}\\p{Alphabetic}]\n} from 'unicode-segmenter/general';\n```\n\n### Export `unicode-segmenter/utils`\n[![](https://edge.bundlejs.com/badge?q=unicode-segmenter/utils\u0026treeshake=[*])](https://bundlejs.com/?q=unicode-segmenter%2Futils\u0026treeshake=%5B*%5D)\n\nYou can access some internal utilities to deal with JavaScript strings.\n\n#### Example: Handle UTF-16 surrogate pairs\n\n```js\nimport {\n  isHighSurrogate,\n  isLowSurrogate,\n  surrogatePairToCodePoint,\n} from 'unicode-segmenter/utils';\n\nconst u32 = '😍';\nconst hi = u32.charCodeAt(0);\nconst lo = u32.charCodeAt(1);\n\nif (isHighSurrogate(hi) \u0026\u0026 isLowSurrogate(lo)) {\n  const codePoint = surrogatePairToCodePoint(hi, lo);\n  // =\u003e equivalent to u32.codePointAt(0)\n}\n```\n\n#### Example: Determine the length of a character\n\n```js\nimport { isBMP } from 'unicode-segmenter/utils';\n\nconst char = '😍'; // .length = 2\nconst cp = char.codePointAt(0);\n\nchar.length === isBMP(cp) ? 1 : 2;\n// =\u003e true\n```\n\n## Runtime Compatibility\n\n`unicode-segmenter` uses only fundamental features of ES2015, making it compatible with most browsers.\n\nTo ensure compatibility, the runtime should support:\n- [`String.prototype.codePointAt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt)\n- [Generators](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Generator)\n- [Modules](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Modules)\n\nIf the runtime doesn't support these features, it can easily be fulfilled with tools like Babel.\n\n## React Native Support\n\nSince [Hermes doesn't support the `Intl.Segmenter` API](https://github.com/facebook/hermes/blob/main/doc/IntlAPIs.md) yet, `unicode-segmenter` is a good alternative.\n\n`unicode-segmenter` is compiled into small \u0026 efficient Hermes bytecode than other JavaScript libraries. See the [benchmark](#hermes-bytecode-stats) for details.\n\n## Comparison\n\n`unicode-segmenter` aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking several libraries' performance, bundle size, and Unicode version compliance.\n\n### `unicode-segmenter/grapheme` vs\n\n- [graphemer]@1.4.0 (16.6M+ weekly downloads on NPM)\n- [grapheme-splitter]@1.0.4 (5.7M+ weekly downloads on NPM)\n- [@formatjs/intl-segmenter]@11.5.7 (5.4K+ weekly downloads on NPM)\n- WebAssembly build of [unicode-segmentation]@1.12.0 with minimum bindings\n- Built-in [`Intl.Segmenter`] API\n\n#### JS Bundle Stats\n\n| Name                         | Unicode® | ESM? |   Size    | Size (min) | Size (min+gzip) | Size (min+br) |\n|------------------------------|----------|------|----------:|-----------:|----------------:|--------------:|\n| `unicode-segmenter/grapheme` |   16.0.0 |    ✔️ |    15,929 |     12,110 |           5,050 |         3,738 |\n| `graphemer`                  |   15.0.0 |    ✖️ ️|   410,435 |     95,104 |          15,752 |        10,660 |\n| `grapheme-splitter`          |   10.0.0 |    ✖️ |   122,252 |     23,680 |           7,852 |         4,841 |\n| `@formatjs/intl-segmenter`*  |   15.0.0 |    ✖️ |   491,043 |    318,721 |          54,248 |        34,380 |\n| `unicode-segmentation`*      |   16.0.0 |    ✔️ |    56,529 |     52,443 |          24,110 |        17,343 |\n| `Intl.Segmenter`*            |        - |    - |         0 |          0 |               0 |             0 |\n\n* `@formatjs/intl-segmenter` handles grapheme, word, and sentence, but it's not tree-shakable.\n* `unicode-segmentation` size contains only minimum WASM binary and its bindings to execute benchmarking. It will increases to expose more features.\n* `Intl.Segmenter`'s Unicode data depends on the host, and may not be up-to-date.\n* `Intl.Segmenter` may not be available in [some old browsers](https://caniuse.com/mdn-javascript_builtins_intl_segmenter), edge runtimes, or embedded environments.\n\n#### Hermes Bytecode Stats\n\n| Name                         | Bytecode size | Bytecode size (gzip)* |\n|------------------------------|--------------:|----------------------:|\n| `unicode-segmenter/grapheme` |        22,019 |                11,513 |\n| `graphemer`                  |       133,974 |                31,715 |\n| `grapheme-splitter`          |        63,855 |                19,133 |\n\n* It would be compressed when included as an app asset.\n\n#### Runtime Performance\n\nHere is a brief explanation, and you can see [archived benchmark results](benchmark/grapheme/_records).\n\n**Performance in Node.js**: `unicode-segmenter/grapheme` is significantly faster than alternatives.\n- 6\\~15x faster than other JavaScript libraries\n- 1.5\\~3x faster than WASM binding of the Rust's [unicode-segmentation]\n- 1.5\\~3x faster than built-in [`Intl.Segmenter`]\n\n**Performance in Bun**: `unicode-segmenter/grapheme` has almost the same performance as the built-in [`Intl.Segmenter`], with no performance degradation compared to other JavaScript libraries.\n\n**Performance in Browsers**: The performance in browser environments varies greatly due to differences in browser engines and versions, which makes benchmarking less consistent. Despite these variations, `unicode-segmenter/grapheme` generally outperforms other JavaScript libraries in most environments.\n\n**Performance in React Native**: `unicode-segmenter/grapheme` is significantly faster than alternatives when compiled to Hermes bytecode. It's 3\\~8x faster than `graphemer` and 20\\~26x faster than `grapheme-splitter`, with the performance gap increasing with input size.\n\n**Performance in QuickJS**: `unicode-segmenter/grapheme` is the only usable library in terms of performance.\n\nInstead of trusting these claims, you can try `yarn perf:grapheme` directly in your environment or build your own benchmark.\n\n## Acknowledgments\n\n- **The Rust Unicode team ([@unicode-rs](https://github.com/unicode-rs))**:\\\n   The initial implementation was ported manually from [unicode-segmentation] library.\n\n- **Marijn Haverbeke ([@marijnh](https://github.com/marijnh))**:\\\n   Inspired a technique that can greatly compress Unicode data table from [his library](https://github.com/marijnh/find-cluster-break).\n\n## LICENSE\n\n[MIT](LICENSE)\n\n[e18e]: https://e18e.dev/\n[Hermes]: https://hermesengine.dev/\n[QuickJS]: https://bellard.org/quickjs/\n[unicode-segmentation]: https://github.com/unicode-rs/unicode-segmentation\n[`Intl.Segmenter`]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter\n[graphemer]: https://github.com/flmnt/graphemer\n[grapheme-splitter]: https://github.com/orling/grapheme-splitter\n[emoji-regex]: https://github.com/mathiasbynens/emoji-regex\n[emojibase-regex]: https://emojibase.dev/docs/regex\n[XRegExp]: https://xregexp.com/\n[@formatjs/intl-segmenter]: https://formatjs.io/docs/polyfills/intl-segmenter/\n","funding_links":[],"categories":["Utilities"],"sub_categories":["Text Processing"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcometkim%2Funicode-segmenter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcometkim%2Funicode-segmenter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcometkim%2Funicode-segmenter/lists"}