{"id":17196251,"url":"https://github.com/akahuku/unistring","last_synced_at":"2025-04-13T19:32:03.260Z","repository":{"id":145342284,"uuid":"43219458","full_name":"akahuku/unistring","owner":"akahuku","description":"javascript library to handle \"unicode string\" easily and correctly","archived":false,"fork":false,"pushed_at":"2024-03-08T07:16:58.000Z","size":776,"stargazers_count":26,"open_issues_count":0,"forks_count":6,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-09T14:50:37.448Z","etag":null,"topics":["grapheme-cluster","javascript","sentence-boundary","uax29","unicode","word-boundary"],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/akahuku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2015-09-26T19:24:13.000Z","updated_at":"2024-09-24T18:13:56.000Z","dependencies_parsed_at":"2024-02-27T15:44:20.015Z","dependency_job_id":null,"html_url":"https://github.com/akahuku/unistring","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akahuku%2Funistring","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akahuku%2Funistring/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akahuku%2Funistring/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akahuku%2Funistring/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/akahuku","download_url":"https://codeload.github.com/akahuku/unistring/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248768001,"owners_count":21158569,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["grapheme-cluster","javascript","sentence-boundary","uax29","unicode","word-boundary"],"created_at":"2024-10-15T01:52:46.405Z","updated_at":"2025-04-13T19:32:02.837Z","avatar_url":"https://github.com/akahuku.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"Unistring\n=========\n\n## What is this?\n\nUnistring is a javascript library to handle \"unicode string\" easily and\ncorrectly.  javascript's native string is also unicode string, however it is\nactually simple UTF-16 sequence, so you must handle unicode's complicated\nmechanism such as surrogate pairs and combining character sequence.\n\nUnistring hides this complexity.  The currently supported Unicode version is\n14.0.0 and Unistring passes all test patterns[^test-patterns] provided by Unicode.org.\n\n[^test-patterns]: grapheme property test: 17953 patterns\n  grapheme break test: 602 patterns\n  word break test: 1823 patterns\n  sentence break test: 502 patterns\n  line break test: 7654 patterns\n\n## Example\n\n### String manipulation\n\n```javascript\nlet s = 'de\\u0301licieux\\uD83D\\uDE0B'; // délicieux😋\nlet us = Unistring(s);\n\n// retrieving number of 'user-perceived characters'...\ns.length;        // fail, returns 12\nus.length;       // ok, returns 10\n\n// retrieving e with accent aigu...\ns.charAt(1);     // fail, returns \"e\" as string\nus.clusterAt(1); // ok, returns \"e\\u0301\" as string\n\n// retrieving last character...\ns.substr(-1);    // fail, returns \"\\uDE0B\" as string\nus.substr(-1);   // ok, returns \"😋\" as Unistring instance\n\n// manipulation\nus.insert(\"C'est \", 0);\nus.delete(-1);\nus.append('!');\nus.toString();   // returns \"C'est délicieux!\" as string\n```\n\n### Break into words by UAX#29 word boundary rule\n\n```javascript\nlet words1 = Unistring.getWords('The quick (“brown”) fox can’t jump 32.3 feet, right?');\n/*\nwords1 = [\n {\n  \"text\": \"The\",\t// fragment of the target text\n  \"index\": 0,\t\t// start index, in grapheme unit\n  \"rawIndex\": 0,\t// start index, in UTF-16 unit\n  \"length\": 3,\t\t// length of graphemes\n  \"type\": 12\t\t// internal class value\n },\n {\n  \"text\": \" \",\n  \"index\": 3, \"rawIndex\": 3, \"length\": 1, \"type\": 19\n },\n {\n  \"text\": \"quick\",\n  \"index\": 4, \"rawIndex\": 4, \"length\": 5, \"type\": 12\n },\n {\n  \"text\": \" \",\n  \"index\": 9, \"rawIndex\": 9, \"length\": 1, \"type\": 19\n },\n {\n  \"text\": \"(\", \"index\": 10, \"rawIndex\": 10, \"length\": 1, \"type\": 0\n },\n {\n  \"text\": \"“\",\n  \"index\": 11, \"rawIndex\": 11, \"length\": 1, \"type\": 0\n },\n {\n  \"text\": \"brown\",\n  \"index\": 12, \"rawIndex\": 12, \"length\": 5, \"type\": 12\n },\n {\n  \"text\": \"”\",\n  \"index\": 17, \"rawIndex\": 17, \"length\": 1, \"type\": 0\n },\n {\n  \"text\": \")\",\n  \"index\": 18, \"rawIndex\": 18, \"length\": 1, \"type\": 0\n },\n {\n  \"text\": \" \",\n  \"index\": 19, \"rawIndex\": 19, \"length\": 1, \"type\": 19\n },\n {\n  \"text\": \"fox\",\n  \"index\": 20, \"rawIndex\": 20, \"length\": 3, \"type\": 12\n },\n {\n  \"text\": \" \",\n  \"index\": 23, \"rawIndex\": 23, \"length\": 1, \"type\": 19\n },\n {\n  \"text\": \"can’t\",\n  \"index\": 24, \"rawIndex\": 24, \"length\": 5, \"type\": 12\n },\n {\n  \"text\": \" \",\n  \"index\": 29, \"rawIndex\": 29, \"length\": 1, \"type\": 19\n },\n {\n  \"text\": \"jump\",\n  \"index\": 30, \"rawIndex\": 30, \"length\": 4, \"type\": 12\n },\n {\n  \"text\": \" \",\n  \"index\": 34, \"rawIndex\": 34, \"length\": 1, \"type\": 19\n },\n {\n  \"text\": \"32.3\",\n  \"index\": 35, \"rawIndex\": 35, \"length\": 4, \"type\": 16\n },\n {\n  \"text\": \" \",\n  \"index\": 39, \"rawIndex\": 39, \"length\": 1, \"type\": 19\n },\n {\n  \"text\": \"feet\",\n  \"index\": 40, \"rawIndex\": 40, \"length\": 4, \"type\": 12\n },\n {\n  \"text\": \",\",\n  \"index\": 44, \"rawIndex\": 44, \"length\": 1, \"type\": 14\n },\n {\n  \"text\": \" \",\n  \"index\": 45, \"rawIndex\": 45, \"length\": 1, \"type\": 19\n },\n {\n  \"text\": \"right\",\n  \"index\": 46, \"rawIndex\": 46, \"length\": 5, \"type\": 12\n },\n {\n  \"text\": \"?\",\n  \"index\": 51, \"rawIndex\": 51, \"length\": 1, \"type\": 0\n }\n]\n */\n```\n\n### Break into words by UAX#29 word boundary rule, with Unistring's script extension\n\nYou can turn on Unistring's script extension (treat neighboring same script\ncharacter as part of word) by setting the second argument of getWords() to\ntrue.\n\n```javascript\nlet words2 = Unistring.getWords('// 漢字カタカナひらがな1.23', true);\n/*\nwords2 = [\n {\n  \"text\": \"//\",\n  \"index\": 0, \"rawIndex\": 0, \"length\": 2, \"type\": 0\n },\n {\n  \"text\": \" \",\n  \"index\": 2, \"rawIndex\": 2, \"length\": 1, \"type\": 19\n },\n {\n  \"text\": \"漢字\",\n  \"index\": 3, \"rawIndex\": 3, \"length\": 2, \"type\": 0\n },\n {\n  \"text\": \"カタカナ\",\n  \"index\": 5, \"rawIndex\": 5, \"length\": 4, \"type\": 20\n },\n {\n  \"text\": \"ひらがな\",\n  \"index\": 9, \"rawIndex\": 9, \"length\": 4, \"type\": 21\n },\n {\n  \"text\": \"1.23\",\n  \"index\": 13, \"rawIndex\": 13, \"length\": 4, \"type\": 16\n }\n]\n */\n```\n\n### Break into sentences by UAX#29 sentence boundary rule\n\n```javascript\nlet sentences = Unistring.getSentences(\n\t'ある日の事でございます。御釈迦様は極楽の蓮池のふちを、独りでぶらぶら御歩きになっていらっしゃいました。' +\n\t'He said, “Are you going?”  John shook his head.'\n);\n/*\nsentences = [\n {\n  \"text\": \"ある日の事でございます。\",\n  \"index\": 0, \"rawIndex\": 0, \"length\": 12, \"type\": 11\n },\n {\n  \"text\": \"御釈迦様は極楽の蓮池のふちを、独りでぶらぶら御歩きになっていらっしゃいました。\",\n  \"index\": 12, \"rawIndex\": 12, \"length\": 39, \"type\": 11\n },\n {\n  \"text\": \"He said, “Are you going?”  \",\n  \"index\": 51, \"rawIndex\": 51, \"length\": 27, \"type\": 10\n },\n {\n  \"text\": \"John shook his head.\",\n  \"index\": 78, \"rawIndex\": 78, \"length\": 20, \"type\": 10\n }\n]\n */\n```\n\n### Fold text to fit into specified columns by UAX#14 line breaking algorithm\n\n```javascript\nlet foldedLines = Unistring.getFoldedLines(\n`On this unsatisfactory manner the penultimate message of Cavor dies out. One seems to see him away there in the blue obscurity amidst his apparatus intently signalling us to the last, all unaware of the curtain of confusion that drops between us; all unaware, too, of the final dangers that even then must have been creeping upon him. His disastrous want of vulgar common sense had utterly betrayed him. He had talked of war, he had talked of all the strength and irrational violence of men, of their insatiable aggressions, their tireless futility of conflict. He had filled the whole moon world with this impression of our race, and then I think it is plain that he made the most fatal admission that upon himself alone hung the possibility—at least for a long time—of any further men reaching the moon. The line the cold, inhuman reason of the moon would take seems plain enough to me, and a suspicion of it, and then perhaps some sudden sharp realisation of it, must have come to him. One imagines him about the moon with the remorse of this fatal indiscretion growing in his mind.  During a certain time I am inclined to guess the Grand Lunar was deliberating the new situation, and for all that time Cavor may have gone as free as ever he had gone. But obstacles of some sort prevented his getting to his electromagnetic apparatus again after that message I have just given. For some days we received nothing. Perhaps he was having fresh audiences, and trying to evade his previous admissions.  Who can hope to guess?\n\nAnd then suddenly, like a cry in the night, like a cry that is followed by a stillness, came the last message. It is the briefest fragment, the broken beginnings of two sentences.`,\n  {\n\tcolumns: 50,  // number of columns to fold. default is 80\n\tawidth: 1,    // columns of ambiguous characters in east asian script, 1 or 2. default is 2\n\tansi: false,  // if true, ignore ANSI escape sequences. default is false\n\tcharacterReference: false // if true, treat \\\u0026#999999; / \\\u0026#x999999; as the character they\n\t                          // represent. default is false\n  }\n);\n/*\nfoldedLines = [\n \"On this unsatisfactory manner the penultimate \",\n \"message of Cavor dies out. One seems to see him \",\n \"away there in the blue obscurity amidst his \",\n \"apparatus intently signalling us to the last, all \",\n \"unaware of the curtain of confusion that drops \",\n \"between us; all unaware, too, of the final \",\n \"dangers that even then must have been creeping \",\n \"upon him. His disastrous want of vulgar common \",\n \"sense had utterly betrayed him. He had talked of \",\n \"war, he had talked of all the strength and \",\n \"irrational violence of men, of their insatiable \",\n \"aggressions, their tireless futility of conflict. \",\n \"He had filled the whole moon world with this \",\n \"impression of our race, and then I think it is \",\n \"plain that he made the most fatal admission that \",\n \"upon himself alone hung the possibility—at least \",\n \"for a long time—of any further men reaching the \",\n \"moon. The line the cold, inhuman reason of the \",\n \"moon would take seems plain enough to me, and a \",\n \"suspicion of it, and then perhaps some sudden \",\n \"sharp realisation of it, must have come to him. \",\n \"One imagines him about the moon with the remorse \",\n \"of this fatal indiscretion growing in his mind.  \",\n \"During a certain time I am inclined to guess the \",\n \"Grand Lunar was deliberating the new situation, \",\n \"and for all that time Cavor may have gone as free \",\n \"as ever he had gone. But obstacles of some sort \",\n \"prevented his getting to his electromagnetic \",\n \"apparatus again after that message I have just \",\n \"given. For some days we received nothing. Perhaps \",\n \"he was having fresh audiences, and trying to \",\n \"evade his previous admissions.  Who can hope to \",\n \"guess?\\n\",\n \"\\n\",\n \"And then suddenly, like a cry in the night, like \",\n \"a cry that is followed by a stillness, came the \",\n \"last message. It is the briefest fragment, the \",\n \"broken beginnings of two sentences.\"\n]\n */\n```\n\n## Using Unistring in standard web pages\n\n### Download\n\n* [unistring.js](https://raw.githubusercontent.com/akahuku/unistring/master/unistring.js)\n\n### Use it\n\n```javascript\nimport Unistring from './unistring.js';\nlet us = Unistring('de\\u0301licieux\\uD83D\\uDE0B');\n```\n\n\n\n## Using Unistring as a node.js package\n\n### Install\n\n```sh\n$ npm install @akahuku/unistring\n```\n\n### Use it\n\n```javascript\nimport Unistring from '@akahuku/unistring';\nlet us = Unistring('de\\u0301licieux\\uD83D\\uDE0B');\n```\n\n\n\n## Reference\n\n### Instance properties\n\n* `length: number`\n\n### Instance methods\n\n* `clone(): Unistring`\n* `dump(): string`\n* `toString(): string`\n* `delete(start [,length]): Unistring`\n* `insert(str, start): Unistring`\n* `append(str): Unistring`\n* `codePointsAt(index): number[]`\n* `clusterAt(index): string`\n* `rawStringAt(index): string`\n* `rawIndexAt(index): number`\n* `forEach(callback [,thisObj])`\n* `getCrusterIndexFromUTF16Index(index): number`\n* `charAt(index): string`\n* `charCodeAt(index): number`\n* `substring(start [,end]): Unistring`\n* `substr(start [,length]): Unistring`\n* `slice(start [,end]): Unistring`\n* `concat(str): Unistring`\n* `indexOf(str): number`\n* `lastIndexOf(str): number`\n* `toLowerCase([useLocale]): Unistring`\n* `toUpperCase([useLocale]): Unistring`\n\n### Class methods\n\nmethods for text segmentation algorithm (UAX#29):\n\n* `getCodePointArray(str): number[]`\n* `getGraphemeBreakProp(codePoint): number`\n* `getWordBreakProp(codePoint): number`\n* `getSentenceBreakProp(codePoint): number`\n* `getScriptProp(codePoint): number`\n* `getUTF16FromCodePoint(codePoint): string`\n* `getCodePointString(codePoint, type): string`\n* `getWords(str [,useScripts]): object[]`\n* `getSentences(str): object[]`\n\nmethods for line breaking algorithm (UAX#14):\n\n* `getLineBreakableClusters(str): object[]`\n* `getColumnsFor(str [,options = {}]): number`\n* `divideByColumns(str, columns [,options = {}]): string[left, right]`\n* `getFoldedLines(str [,options = {}]): string[]`\n\nthese tree methods take an option for which the following properties are available:\n\n* `columns: number` - number of column (default: 80). For getFoldedLInes(), it may be an array. In that case, each element of the array is used as columns. If the array is not long enough, the last element is used as the remaining columns\n* `awidth: number` - column of ambiguous character in East Asian Width (1 or 2, default: 2)\n* `ansi: boolean` - ignore ANSI escape sequences and treat their width as 0 (default: false)\n* `characterReference: boolean` - treat SGML character reference (\\\u0026#999999;, \\\u0026#x999999; ...) as the character they represent (default: false)\n\n### Class properties\n\n* `awidth: number` - default ambiguous column if no awidth is specified in options\n\n### Class constants\n\n* `GBP: Object` - an associative array from name of GraphemeBreakProperty to corresponding integer value\n* `WBP: Object` - an associative array from name of WordBreakProperty to corresponding integer value\n* `SBP: Object` - an associative array from name of SentenceBreakProperty to corresponding integer value\n* `SCRIPT: Object` - an associative array from name of ScriptProperty to corresponding integer value\n* `LBP: Object` - an associative array from name of LineBreakProperty to corresponding integer value\n* `GBP_NAMES: string[]`\n* `WBP_NAMES: string[]`\n* `SBP_NAMES: string[]`\n* `SCRIPT_NAMES: string[]`\n* `LBP_NAMES: string[]`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakahuku%2Funistring","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fakahuku%2Funistring","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakahuku%2Funistring/lists"}