{"id":21928131,"url":"https://github.com/depp/uniset","last_synced_at":"2025-04-19T17:42:06.017Z","repository":{"id":1435616,"uuid":"1651978","full_name":"depp/uniset","owner":"depp","description":"Calculate sets of Unicode characters","archived":false,"fork":false,"pushed_at":"2016-12-07T05:14:45.000Z","size":44,"stargazers_count":18,"open_issues_count":0,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-29T10:51:22.417Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/depp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2011-04-22T23:59:15.000Z","updated_at":"2024-04-01T11:09:12.000Z","dependencies_parsed_at":"2022-07-18T21:34:47.366Z","dependency_job_id":null,"html_url":"https://github.com/depp/uniset","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/depp%2Funiset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/depp%2Funiset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/depp%2Funiset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/depp%2Funiset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/depp","download_url":"https://codeload.github.com/depp/uniset/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249751434,"owners_count":21320287,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-28T22:21:21.694Z","updated_at":"2025-04-19T17:42:05.997Z","avatar_url":"https://github.com/depp.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Uniset: Compute sets of unicode code points\n\n\nUniset is a simple command-line tool for computing sets of Unicode\ncode points.  Its main goal is to support the development of fast\nUnicode-aware parsers.\n\n## Warning!\n\nThis software is currently “early-beta”.  It may crash or generate\nincorrect results.  It has very few features.  It was written in a\nsingle day and only tested manually.\n\n## Usage\n\n    uniset [OPT..] [EXPR]\n\nPrints the set of Unicode characters specified by `EXPR`.  The\n`UNICODE_DIR` environment variable must point to a directory containing\nUnicode data tables.\n\nIn examples below, set expressions are quoted or left unquoted\nsomewhat arbitrarily.  In most situations, the quoting is irrelevant.\nHowever, remember that `'*'` is a special character in the shell.\nSingle quote marks `\u003c'\u003e` are preferred over double quote marks `\u003c\"\u003e`,\nbecause a single quote marks `\u003c'\u003e` disable shell expansion.\n\nThese three invocations are equivalent:\n\n    uniset cat:Zs + FEFF - 0..7F        # ok\n    uniset \"cat:Zs + FEFF - 0..7F\"      # ok\n    uniset 'cat:Zs + FEFF - 0..7F'      # ok\n\nHowever, these three invocations are not:\n\n    uniset cat:Ll,Lu * 0..FF        # not ok\n    uniset \"cat:Ll,Lu * 0..FF\"      # not preferred\n    uniset 'cat:Ll,Lu * 0..FF'      # ok\n\n## Set operations\n\nSets may be combined using simple set operations.  Note that the\noperators (+, -, *, !) must be separated from other tokens (besides\nparenthesis) by at least one space.\n\n    SET ::= SET + SET   (union)\n          | SET - SET   (difference)\n          | SET * SET   (intersection)\n          | ! SET       (complement)\n          | ( SET )     (grouping)\n\nOperator precedence and parentheses work as in ordinary algebra, and\nthe complement (!) operator has the highest precedence.  For example,\n\n    ! a + b * (c + d) + ! e * f\n\nis the same as\n\n    (! a) + (b * (c + d)) + ((! e) * f)\n\nThe `--verbose` flag will cause uniset to print the expression to the\nstandard error stream.\n\nNote that unlike ordinary algebra, the following are not equivalent:\n\n    a + b - c    vs.    a - c + b\n\n## Basic sets\n\nIndividual characters and ranges of characters can be specified in\nhexadecimal.  Hexadecimal was chosen because Unicode characters are\norganized naturally in hexadecimal and because the Unicode\nspecification refers to code points using hexadecimal.  Decimal is not\nsupported.\n\nExamples:\n\n    Line feed:  a, 0a, or 000A\n    All ASCII:  0..7F\n\nGeneral category:\n\n    cat:CAT1,CAT2,...\n\nEast asian width:\n\n    eaw:W1,W2,...\n\n## ECMAScript example\n\nThe 5th edition of ECMAScript specifies that source files are\nUnicode.  Identifiers may start with letters, `'$'`, `'_'`, and escape\nsequences.  The set of Unicode letters, according to the ECMAScript\nstandard, is given by the following command:\n\n    uniset 'cat:Lu,Ll,Lt,Lm,Lo,Nl'\n\nThe remainder of an identifier may contain the same characters, as\nwell as combining marks (categories Mn, Mc), digits (category Nd),\nconnector punctuation (category Pc), and the characters ZWNJ (U+200C)\nand ZWJ (U+200D).  The additional characters are given by the command:\n\n    uniset 'cat:Mn,Mc,Nd,Pc + 200C + 200D'\n\nWhitespace in ECMAScript consists of the Unicode category Zs, as well\nas the byte-order mark U+FEFF.\n\n    uniset 'cat:Zs + FEFF'\n\nSuppose that your ECMAScript parser uses a table or switch statement\nto handle ASCII characters.  For efficiency, you can omit all ASCII\ncharacters from a set by subtracting them at the end.  For example,\n\n    uniset 'cat:Lu,Ll,Lt,Lm,Lo,Nl - 0..7f'\n\n## Output formats\n\nBy default, uniset outputs a sorted list of non-overlapping ranges of\ncharacters in the set, in hexadecimal.  For example,\n\n    $ uniset cat:Zs\n    20\n    a0\n    1680\n    180e\n    2000..200a\n    202f\n    205f\n    3000\n\nThe `--16` option specifies a C-style array of pairs of 16-bit\nunsigned integers.  The first 17 entries correspond to the 17 Unicode\nplanes, and each entry specifies a pair of offsets into the remainder\nof the table.\n\n    $ uniset --16 cat:Zs\n    { /* plane 0 */ 0, 8 },\n    { /* plane 1 */ 0, 0 },\n    \u003c15 repeated entries removed\u003e\n    { 32, 32 },\n    { 160, 160 },\n    { 5760, 5760 },\n    { 6158, 6158 },\n    { 8192, 8202 },\n    { 8239, 8239 },\n    { 8287, 8287 },\n    { 12288, 12288 }\n\nNote that the category Zs only contains characters in the first\nplane, so the other 16 planes have zero entries.  The entry for plane\n0, { 0, 8 }, indicates that the first entry is at 17 + 0 and the entry\non past the end is at 17 + 8.\n\nBut if you don’t like reading English, here is the C code to test if a\ncharacter is a member of a set:\n\n    bool uniset_test(uint16_t const set[][2], uint32_t c)\n    {\n        unsigned int p = c \u003e\u003e 16;\n        if (p \u003e 16)\n            return false;\n        unsigned int l = set[p][0] + 17, r = set[p][1] + 17;\n        c \u0026= 0xffff;\n        while (l \u003c r) {\n            unsigned int m = (l + r) / 2;\n            if (c \u003c set[m][0])\n                r = m;\n            else if (c \u003e set[m][1])\n                l = m + 1;\n            else\n                return true;\n        }\n        return false;\n    }\n\nThe `--32` option specifies a C-style array of 32-bit unsigned\nintegers.  Each entry is a range of characters.  The plane offsets\nare not printed because they’re not required to search the 32-bit\ntable.  A pair may cross a Unicode plane boundary.\n\n    $ uniset --32 cat:Zs\n    { 32, 32 },\n    { 160, 160 },\n    { 5760, 5760 },\n    { 8192, 8202 },\n    { 8239, 8239 },\n    { 8287, 8287 },\n    { 12288, 12288 }\n\nHere is the C code for checking membership, where `n` is the array\nsize:\n\n    bool uniset_test(uint32_t n, uint32_t const set[][2], uint32_t c)\n    {\n        unsigned int l = 0, r = n;\n        while (l \u003c r) {\n            unsigned int m = (l + r) / 2;\n            if (c \u003c set[m][0])\n                r = m;\n            else if (c \u003e set[m][1])\n                l = m + 1;\n            else\n                return true;\n        }\n        return false;\n    }\n\nThe typical way to use the `--16` or `--32` options is as an include\nfile.  For example,\n\n    const uint16_t UNICODE_LETTER[][2] = {\n    #include \"unicode_letter.def\"\n    };\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdepp%2Funiset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdepp%2Funiset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdepp%2Funiset/lists"}