{"id":26772715,"url":"https://github.com/jonschlinkert/intl-segmenter","last_synced_at":"2025-04-15T21:19:38.731Z","repository":{"id":274382808,"uuid":"922722273","full_name":"jonschlinkert/intl-segmenter","owner":"jonschlinkert","description":"A high-performance wrapper around Intl.Segmenter for efficient text segmentation. This class resolves memory handling issues seen with large strings and \"maximum call stack exceeded\" exceptions that occur when strings exceed 40-50k characters. Enhances performance by 50-500x. Only ~70 loc (with comments) and no dependencies.","archived":false,"fork":false,"pushed_at":"2025-01-26T23:20:05.000Z","size":45,"stargazers_count":10,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-11T16:24:13.681Z","etag":null,"topics":["graphemes","intl","intl-segmenter","processing","segment","segmenter","sentences","splitter","text","words"],"latest_commit_sha":null,"homepage":"https://github.com/jonschlinkert","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jonschlinkert.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"jonschlinkert"}},"created_at":"2025-01-26T23:18:41.000Z","updated_at":"2025-03-25T02:27:59.000Z","dependencies_parsed_at":"2025-01-27T00:26:15.743Z","dependency_job_id":"a800349c-6738-4007-9233-620aa8fa3073","html_url":"https://github.com/jonschlinkert/intl-segmenter","commit_stats":null,"previous_names":["jonschlinkert/intl-segmenter"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonschlinkert%2Fintl-segmenter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonschlinkert%2Fintl-segmenter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonschlinkert%2Fintl-segmenter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonschlinkert%2Fintl-segmenter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jonschlinkert","download_url":"https://codeload.github.com/jonschlinkert/intl-segmenter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249154206,"owners_count":21221370,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["graphemes","intl","intl-segmenter","processing","segment","segmenter","sentences","splitter","text","words"],"created_at":"2025-03-29T01:21:10.146Z","updated_at":"2025-04-15T21:19:38.710Z","avatar_url":"https://github.com/jonschlinkert.png","language":"JavaScript","funding_links":["https://github.com/sponsors/jonschlinkert"],"categories":[],"sub_categories":[],"readme":"# intl-segmenter [![NPM version](https://img.shields.io/npm/v/intl-segmenter.svg?style=flat)](https://www.npmjs.com/package/intl-segmenter) [![NPM monthly downloads](https://img.shields.io/npm/dm/intl-segmenter.svg?style=flat)](https://npmjs.org/package/intl-segmenter) [![NPM total downloads](https://img.shields.io/npm/dt/intl-segmenter.svg?style=flat)](https://npmjs.org/package/intl-segmenter)\n\n\u003e A high-performance wrapper around `Intl.Segmenter` for efficient text segmentation. This class resolves memory handling issues seen with large strings and can enhance performance by 50-500x. Only ~60 loc and no dependencies.\n\nPlease consider following this project's author, [Jon Schlinkert](https://github.com/jonschlinkert), and consider starring the project to show your :heart: and support.\n\n## Install\n\nInstall with [npm](https://www.npmjs.com/):\n\n```sh\n$ npm install --save intl-segmenter\n```\n\nInstall with [pnpm](https://pnpm.io):\n\n```sh\n$ pnpm install intl-segmenter\n```\n\n## Overview\n\nIf you do any text processing, parsing, or formatting, especially for the terminal, you know the challenges of handling special characters, emojis, and extended Unicode characters.\n\nThe `Intl.Segmenter` object was introduced to simplify text segmentation and correctly handle these special characters. However, it has notable limitations and potential risks:\n\n* Predictable \"Maximum call stack exceeded\" exceptions occur when strings exceed 40-50k characters.\n* Performance degrades geometrically as the number of non-ASCII/extended Unicode characters increases.\n\nFor context:\n\n* Blog posts average 2k-10k chars\n* Novels 450k-500k chars\n* This README ~7k chars\n\nThis library wraps `Intl.Segmenter` to address these issues:\n\n* Handles strings up to millions of characters in length (tested to ~24m chars on M2 Macbook Pro).\n* Improves performance by 50-500x compared to direct `Intl.Segmenter` usage.\n* Prevents \"Maximum call stack exceeded\" exceptions that predictably occur with long strings over a certain length.\n\nUse this as a drop-in replacement for `Intl.Segmenter` when accurate text segmentation is needed, particularly for strings with non-ASCII/extended Unicode characters.\n\n## Usage\n\n```js\n// Use Segmenter instead of Intl.Segmenter\nimport { Segmenter } from 'intl-segmenter';\n\nconst segmenter = new Segmenter('en', { granularity: 'grapheme' });\nconst segments = [];\n\n// The segmenter.segment method is a generator\nfor (const segment of segmenter.segment('Your input string here.')) {\n  segments.push(segment);\n}\n\n// You can also use Array.from, but read the \"Heads up\" section first\nconsole.log(Array.from(segmenter.segment('Your input string here.')));\n```\n\n### Heads up!\n\nI recommend using manual iteration (traditional loops) if there's any chance the string will exceed a few hundred characters.\n\n**Note on Array.from's Iterator Handling**\n\nWhen `Array.from` processes an iterator/generator, it retains the entire iteration state in memory. Unlike a `for...of` loop, it can't process and discard items one by one. Instead, it:\n\n* Keeps the full generator state alive\n* Maintains the entire call stack for iteration\n* Holds all intermediate values\n* Builds up the final array\n\nThis creates a deeper call stack and increased memory usage compared to manual iteration, where each step can be completed and garbage collected.\n\n## API\n\n### Segmenter\n\n**Params**\n\n* `language` **{String}**: A [BCP 47 language tag](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/Segmenter#parameters), or an [Intl.Locale](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Locale) instance.\n* `options` **{Object}**: Supports all [Intl.Segmenter](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/Segmenter#parameters) options, with an additional `maxChunkLength` that defaults to `100`.\n\n**Example**\n\n```js\nconst segmenter = new Segmenter('en', { maxChunkLength: 100 });\n```\n\n### .segment\n\nSegments the provided string into an iterable sequence of `Intl.Segment` objects, optimized for performance and memory management.\n\n**Params**\n\n* `input` **{String}**: The string to be segmented.\n\n**Returns**\n\n* **{Generator}**: Yields `Intl.Segment` objects.\n\n**Example**\n\n```js\nconst segmenter = new Segmenter('en', { localeMatcher: 'lookup' });\n\nfor (const segment of segmenter.segment('This is a test')) {\n  console.log(segment);\n}\n```\n\n### .findSafeBreakPoint\n\nMostly an internal method, but documented here in case you need to use it directly, or override it in a subclass.\n\nThis method determines a safe position to break the string into chunks for efficient processing without splitting essential non-ASCII/extended Unicode character groups.\n\n**Params**\n\n* `input` **{String}**: The string to analyze.\n\n**Returns**\n\n* **{Number}**: Position index to use for safely breaking the string.\n\n**Example**\n\n```js\nconst segmenter = new Segmenter();\nconst breakPoint = segmenter.findSafeBreakPoint('This is a test');\nconsole.log(breakPoint); // e.g., 4\n```\n\n### .getSegments\n\nReturns all segments of the input string as an array, using the efficient generator from `.segment()`.\n\n**Params**\n\n* `input` **{String}**: The string to be segmented.\n\n**Returns**\n\n* **{Array}**: An array of `Intl.Segment` objects.\n\n**Example**\n\n```js\nconst segmenter = new Segmenter();\nconst segments = segmenter.getSegments('This is a test');\nconsole.log(segments);\n// Returns:\n// [\n//   { segment: 'T', index: 0, input: 'This is a test' },\n//   { segment: 'h', index: 1, input: 'This is a test' },\n//   { segment: 'i', index: 2, input: 'This is a test' },\n//   { segment: 's', index: 3, input: 'This is a test' },\n//   { segment: ' ', index: 4, input: 'This is a test' },\n//   { segment: 'i', index: 5, input: 'This is a test' },\n//   { segment: 's', index: 6, input: 'This is a test' },\n//   { segment: ' ', index: 7, input: 'This is a test' },\n//   { segment: 'a', index: 8, input: 'This is a test' },\n//   { segment: ' ', index: 9, input: 'This is a test' },\n//   { segment: 't', index: 10, input: 'This is a test' },\n//   { segment: 'e', index: 11, input: 'This is a test' },\n//   { segment: 's', index: 12, input: 'This is a test' },\n//   { segment: 't', index: 13, input: 'This is a test' }\n// ]\n```\n\n### Segmenter.getSegments\n\nStatic method for segmenting a string. Creates a `Segmenter` instance and returns the segments as an array.\n\n**Params**\n\n* `input` **{String}**: The string to be segmented.\n* `language` **{String}**: A [BCP 47 language tag](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/Segmenter#parameters), or an [Intl.Locale](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Locale) instance.\n* `options` **{Object}**: _(optional)_ Intl.Segmenter options.\n\n**Returns**\n\n* **{Array}**: An array of `Intl.Segment` objects.\n\n**Example**\n\n```js\nconst segments = Segmenter.getSegments('This is a test', 'en');\nconsole.log(segments);\n```\n\n## FAQ\n\nIn a nutshell, this library prevents maximum call stack exceeded exceptions caused by memory management issues in  `Intl.Segmenter`, and improves performance by 50-500x over using `Intl.Segmenter` directly.\n\n**What does this do?**\n\nThis library wraps `Intl.Segmenter` and serves as a drop-in replacement that not only improves performance by 50-500x over `Intl.Segmenter` directly, but prevents maximum call stack exceeded exceptions that predictably occur with long strings.\n\nWithout this library, exceptions reliably occur with strings exceeding 20-50k in length, depending on the number of non-ASCII/extended Unicode characters. These characters significantly affect performance and trigger exceptions sooner.\n\nSimply import the library and use `Segmenter` instead of `Intl.Segmenter`.\n\n**Why use this?**\n\nIf you use `Intl.Segmenter`, your application is at risk of being terminated due to maximum call stack exceed exceptions. To prevent the exception from happening, you need to either prevent input strings from exceeding a certain length, say 10k characters, or wrap `segment` method to iterate over longer strings.\n\nHowever, _this is not as trivial as it sounds_. If you limit the length of the input string, in theory this would still allow users to break their input into chunks, then programmatically loop over those chunks. But now you've created the potential to split on a non-ASCII/extended unicode character, completely negating the entire point of using `Intl.Segmenter` in the first place.\n\nAlternatively, you can use this library, since it solves those problems for you and ensures that `Intl.Segmenter` handles all characters correctly. This library not only improves performance by 50-500x over `Intl.Segmenter` directly, but it prevents maximum call stack exceeded exceptions that _consistently occur when long strings are passed_.\n\n**What is Intl.Segmenter?**\n\nThe newly introduced (2024) built-in [Intl.Segmenter](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string.\n\n**What causes the exception?**\n\nAs of Nov. 17, 2024, a maximum call stack exceeded exception occurs when using `Intl.Segmenter` on strings that exceed 40-60k characters. The avg. blog post is around 2,500-10,000 characters, so you'll only encounter the call stack error when working with longer strings. However, even on shorter strings you might notice performance issues.\n\nNotably, performance in `Intl.Segmenter` degrades geometrically as the number of non-ASCII/extended unicode characters present in string increases, same goes for when _when the exception occurs_.\n\n_(On a related note, stack traces from exceptions indicate that the issue is related to the way Node.js interacts with V8 and how memory management is occurring at the application level via Node.js. We're still looking into this.)_\n\n**Will the exception be fixed?**\n\nI'm not sure yet if this is a bug, or a limitation in `Intl.Segmenter`. But there has been an open [issue](https://issues.chromium.org/issues/326176949) about this for almost a year, and it doesn't seem to be a priority.\n\nPlease [create an issue](../../issues) on this library if you have information or updates related to this issue.\n\n## Comparison to Intl.Segmenter\n\nIn this example, we compare the performance of `Segmenter` to `Intl.Segmenter` when processing a string with a length of 1200 characters.\n\n```ts\nconst text = ' 👨‍👩‍👧‍👦 🌍✨He👨‍👩‍👧‍👦llo 👨‍👩‍👧‍👦 world! 🌍✨'.repeat(1200);\n\n// With Intl.Segmenter\nconst intlSegmenter = new Intl.Segmenter('en', { granularity: 'grapheme', localeMatcher: 'best fit' });\nconsole.time('total time');\nArray.from(intlSegmenter.segment(text));\nconsole.timeEnd('total time');\n// total time: 3.040s\n\n// With Segmenter\nconst segmenter = new Segmenter('en', { granularity: 'grapheme', localeMatcher: 'best fit' });\nconsole.time('total time');\nArray.from(segmenter.segment(text));\nconsole.timeEnd('total time');\n// total time: 18.102ms\n```\n\nSegmenter is ~167x faster than `Intl.Segmenter`. The performance difference would be even more pronounced with longer strings, but the call stack exceeded exception would prevent you from testing that.\n\n## About\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eContributing\u003c/strong\u003e\u003c/summary\u003e\n\nPull requests and stars are always welcome. For bugs and feature requests, [please create an issue](../../issues/new).\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eRunning Tests\u003c/strong\u003e\u003c/summary\u003e\n\nRunning and reviewing unit tests is a great way to get familiarized with a library and its API. You can install dependencies and run tests with the following command:\n\n```sh\n$ npm install \u0026\u0026 npm test\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eBuilding docs\u003c/strong\u003e\u003c/summary\u003e\n\n_(This project's readme.md is generated by [verb](https://github.com/verbose/verb-generate-readme), please don't edit the readme directly. Any changes to the readme must be made in the [.verb.md](.verb.md) readme template.)_\n\nTo generate the readme, run the following command:\n\n```sh\n$ npm install -g verbose/verb#dev verb-generate-readme \u0026\u0026 verb\n```\n\n\u003c/details\u003e\n\n### Author\n\n**Jon Schlinkert**\n\n* [GitHub Profile](https://github.com/jonschlinkert)\n* [Twitter Profile](https://twitter.com/jonschlinkert)\n* [LinkedIn Profile](https://linkedin.com/in/jonschlinkert)\n\n### License\n\nCopyright © 2025, [Jon Schlinkert](https://github.com/jonschlinkert).\nReleased under the MIT License.\n\n***\n\n_This file was generated by [verb-generate-readme](https://github.com/verbose/verb-generate-readme), v0.8.0, on January 26, 2025._\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonschlinkert%2Fintl-segmenter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjonschlinkert%2Fintl-segmenter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonschlinkert%2Fintl-segmenter/lists"}