{"id":13671002,"url":"https://github.com/orling/grapheme-splitter","last_synced_at":"2025-04-27T13:33:26.768Z","repository":{"id":34546241,"uuid":"38490836","full_name":"orling/grapheme-splitter","owner":"orling","description":"A JavaScript library that breaks strings into their individual user-perceived characters.","archived":false,"fork":false,"pushed_at":"2021-02-12T00:21:57.000Z","size":80,"stargazers_count":953,"open_issues_count":8,"forks_count":47,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-04-18T16:02:39.017Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/orling.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-03T12:06:49.000Z","updated_at":"2025-04-09T16:13:31.000Z","dependencies_parsed_at":"2022-07-10T02:46:03.404Z","dependency_job_id":null,"html_url":"https://github.com/orling/grapheme-splitter","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/orling%2Fgrapheme-splitter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/orling%2Fgrapheme-splitter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/orling%2Fgrapheme-splitter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/orling%2Fgrapheme-splitter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/orling","download_url":"https://codeload.github.com/orling/grapheme-splitter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251145830,"owners_count":21543104,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T09:00:55.499Z","updated_at":"2025-04-27T13:33:21.740Z","avatar_url":"https://github.com/orling.png","language":"JavaScript","funding_links":[],"categories":["JavaScript"],"sub_categories":[],"readme":"# Background\n\nIn JavaScript there is not always a one-to-one relationship between string characters and what a user would call a separate visual \"letter\". Some symbols are represented by several characters. This can cause issues when splitting strings and inadvertently cutting a multi-char letter in half, or when you need the actual number of letters in a string.\n\nFor example, emoji characters like \"🌷\",\"🎁\",\"💩\",\"😜\" and \"👍\" are represented by two JavaScript characters each (high surrogate and low surrogate). That is, \n\n```javascript\n\"🌷\".length == 2\n```\nThe combined emoji are even longer:\n```javascript\n\"🏳️‍🌈\".length == 6\n```\n\nWhat's more, some languages often include combining marks - characters that are used to modify the letters before them. Common examples are the German letter ü and the Spanish letter ñ. Sometimes they can be represented alternatively both as a single character and as a letter + combining mark, with both forms equally valid:\n    \n```javascript\nvar two = \"ñ\"; // unnormalized two-char n+◌̃  , i.e. \"\\u006E\\u0303\";\nvar one = \"ñ\"; // normalized single-char, i.e. \"\\u00F1\"\nconsole.log(one!=two); // prints 'true'\n```\n\nUnicode normalization, as performed by the popular punycode.js library or ECMAScript 6's String.normalize, can **sometimes** fix those differences and turn two-char sequences into single characters. But it is **not** enough in all cases. Some languages like Hindi make extensive use of combining marks on their letters, that have no dedicated single-codepoint Unicode sequences, due to the sheer number of possible combinations.\nFor example, the Hindi word \"अनुच्छेद\" is comprised of 5 letters and 3 combining marks:\n\nअ + न + ु + च + ् + छ + े + द\n\nwhich is in fact just 5 user-perceived letters:\n\nअ + नु + च् + छे + द\n\nand which Unicode normalization would not combine properly.\nThere are also the unusual letter+combining mark combinations which have no dedicated Unicode codepoint. The string Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘ obviously has 5 separate letters, but is in fact comprised of 58 JavaScript characters, most of which are combining marks.\n\nEnter the grapheme-splitter.js library. It can be used to properly split JavaScript strings into what a human user would call separate letters (or \"extended grapheme clusters\" in Unicode terminology), no matter what their internal representation is. It is an implementation on the [Default Grapheme Cluster Boundary](http://unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table) of [UAX #29](http://www.unicode.org/reports/tr29/). \n\n# Installation\n\nYou can use the index.js file directly as-is. Or you you can install `grapheme-splitter` to your project using the NPM command below:\n\n```\n$ npm install --save grapheme-splitter\n```\n\n# Tests\n\nTo run the tests on `grapheme-splitter`, use the command below:\n\n```\n$ npm test\n```\n\n# Usage\n\nJust initialize and use:\n\n```javascript\nvar splitter = new GraphemeSplitter();\n\n// split the string to an array of grapheme clusters (one string each)\nvar graphemes = splitter.splitGraphemes(string);\n\n// iterate the string to an iterable iterator of grapheme clusters (one string each)\nvar graphemes = splitter.iterateGraphemes(string);\n\n// or do this if you just need their number\nvar graphemeCount = splitter.countGraphemes(string);\n```\n\n# Examples\n\n```javascript\nvar splitter = new GraphemeSplitter();\n\n// plain latin alphabet - nothing spectacular\nsplitter.splitGraphemes(\"abcd\"); // returns [\"a\", \"b\", \"c\", \"d\"]\n\n// two-char emojis and six-char combined emoji\nsplitter.splitGraphemes(\"🌷🎁💩😜👍🏳️‍🌈\"); // returns [\"🌷\",\"🎁\",\"💩\",\"😜\",\"👍\",\"🏳️‍🌈\"]\n\n// diacritics as combining marks, 10 JavaScript chars\nsplitter.splitGraphemes(\"Ĺo͂řȩm̅\"); // returns [\"Ĺ\",\"o͂\",\"ř\",\"ȩ\",\"m̅\"]\n\n// individual Korean characters (Jamo), 4 JavaScript chars\nsplitter.splitGraphemes(\"뎌쉐\"); // returns [\"뎌\",\"쉐\"]\n\n// Hindi text with combining marks, 8 JavaScript chars\nsplitter.splitGraphemes(\"अनुच्छेद\"); // returns [\"अ\",\"नु\",\"च्\",\"छे\",\"द\"]\n\n// demonic multiple combining marks, 75 JavaScript chars\nsplitter.splitGraphemes(\"Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞\"); // returns [\"Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍\",\"A̴̵̜̰͔ͫ͗͢\",\"L̠ͨͧͩ͘\",\"G̴̻͈͍͔̹̑͗̎̅͛́\",\"Ǫ̵̹̻̝̳͂̌̌͘\",\"!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞\"]\n```\n\n# TypeScript\n\nGrapheme splitter includes TypeScript declarations.\n\n```typescript\nimport GraphemeSplitter = require('grapheme-splitter')\n\nconst splitter = new GraphemeSplitter()\n\nconst split: string[] = splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞')\n```\n\n# Acknowledgements\n\nThis library is heavily influenced by Devon Govett's excellent grapheme-breaker CoffeeScript library at https://github.com/devongovett/grapheme-breaker with an emphasis on ease of integration and pure JavaScript implementation.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Forling%2Fgrapheme-splitter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Forling%2Fgrapheme-splitter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Forling%2Fgrapheme-splitter/lists"}