{"id":17101204,"url":"https://github.com/hackerb9/ugrep","last_synced_at":"2026-06-05T10:30:14.945Z","repository":{"id":113746505,"uuid":"118385487","full_name":"hackerb9/ugrep","owner":"hackerb9","description":"find unicode characters based on their names","archived":false,"fork":false,"pushed_at":"2024-07-16T05:18:53.000Z","size":355,"stargazers_count":15,"open_issues_count":5,"forks_count":2,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-07-16T07:46:41.181Z","etag":null,"topics":["emoji","emoji-picker","emoji-searcher","emoji-unicode","grep","unicode","unicode-character-database","unicode-characters","unicode-data"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hackerb9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-22T00:09:39.000Z","updated_at":"2024-07-16T05:18:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"6374d2f5-91ee-4e90-b1b6-fd2db79c39f2","html_url":"https://github.com/hackerb9/ugrep","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fugrep","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fugrep/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fugrep/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fugrep/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hackerb9","download_url":"https://codeload.github.com/hackerb9/ugrep/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219847176,"owners_count":16556407,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["emoji","emoji-picker","emoji-searcher","emoji-unicode","grep","unicode","unicode-character-database","unicode-characters","unicode-data"],"created_at":"2024-10-14T15:24:17.086Z","updated_at":"2026-06-05T10:30:14.801Z","avatar_url":"https://github.com/hackerb9.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ca href=\"https://raw.githubusercontent.com/hackerb9/ugrep/master/README.md.d/screenshot.png\"\u003e\n\u003cimg title=\"ugrep screenshot\" alt-text=\"Example of ugrep vine\" align=\"right\" src=\"README.md.d/screenshot.png\" width=\"50%\"\u003e\n\u003c/a\u003e\n\n# ☙ ugrep ❧\n\n_Find unicode characters based on their names_\n\nugrep is essentially [grep](https://www.gnu.org/software/grep/) for\nthe Unicode table. It prints out the resulting unicode characters\nliterally, so you can easily cut-and-paste. Ugrep is useful for\nlooking up Emojis 😤, finding obscure symbols ⚸⅗ℏ℞☧☭, or beautiful\nglyphs to decorate your text. 🙶❡✯🟔❢🙷\n\nYou can also use it for the reverse operation to lookup a single\ncharacter (or a string of them) you've pasted into the terminal.\n\nAs a bonus, it can list which fonts are installed that contain a\nparticular unicode character and — through the magic of sixels — will\nshow a rendering in each font.\n\n## Installation\n\nIt's just a Python 3 shell script. Download it to `/usr/local/bin` or `~/bin`\nand make it executable.\n\n    cd /usr/local/bin\n    wget https://github.com/hackerb9/ugrep/raw/master/ugrep\n    chmod +x ugrep\n\n## Usage\n\n* Search by name: **ugrep** [**-w**] _regex_\n\n\tLook up a character name where _regex_ is a regular\n\texpression. If you don't know [regular\n\texpressions](https://docs.python.org/3/howto/regex.html),\n\tdon't worry. Just use plain strings and you'll rarely be\n\twrong.\n\n\t    ugrep runic\n\n\tIf you find ugrep returning too many hits because the phrase you used\n\tis found in other terms, e.g., _thema_ found in _mathematical_, use\n\tthe **-w** option to limit the search to complete words.\n\n* Search by number: **ugrep** _codepoint_**[..**_codepoint_**[..**_increment_**]]**\n\n\tLook up a character (or a range of them) using Unicode code points in\n\thexadecimal. For example,\n\n\t    ugrep 03c0\n\t    ugrep 23b0..f\n\t\tugrep 0..10ffff..1000\n\n* Search by character: **ugrep** [**-c**] _character string_\n\n\tLook up each character in a string. Note that if the string is a\n\tsingle character, e.g., `ugrep X`, then **-c** is implied and need not\n\tbe specified.\n\n\t    ugrep -c \"(ﾟ∀ﾟ)\"\n\n* List fonts for a character: **ugrep** [**-l**] _character_\n\n\tAfter showing the usual character information, list installed\n    fonts that contain that character and show an example in each:\n\n\t    ugrep -l mho\n\n\t☝ *When `ssh`ed to another machine, `ugrep` shows the fonts\n    installed on the remote machine.*\n\n* List fonts, scaled larger: **ugrep** [**-L** _scale_] _character_\n  \n\tSame as `-l`, but scale up the example rendering in each font to\n    be easier to read:\n\n\t    ugrep -L2 -w om\n\n    Useful scale values range from 2 to 8. \n\n## Examples\n\nNote: output from all examples has been excerpted. (You'd be amazed\nhow many heart emojis Unicode has. 😜)\n\n\u003cdetails\u003e\n\n### Fun things to try:\n\nTo see some useful and lovely glyphs, try this:\n\n    ugrep face \n    ugrep alchemical \n    ugrep ornament\n    ugrep bullet\n    ugrep '(vine|bud)'\n    ugrep vai\n    ugrep heavy\n    ugrep drawing\n    ugrep combining\n\n### Plain text search is simple:\n\n\t    $ ugrep heart\n\t    ☙\tU+2619\tREVERSED ROTATED FLORAL HEART BULLET\n\t    ❣\tU+2763\tHEAVY HEART EXCLAMATION MARK ORNAMENT\n\t    ❤\tU+2764\tHEAVY BLACK HEART\n\t    ⋮\t[ ... truncated for brevity ... ]\n\t    💞\tU+1F49E REVOLVING HEARTS\n\t    💟\tU+1F49F HEART DECORATION\n\t    😍\tU+1F60D SMILING FACE WITH HEART-SHAPED EYES\n\t    😻\tU+1F63B\tSMILING CAT FACE WITH HEART-SHAPED EYES\n\n### Paste in a single character to lookup its codepoint:\n\n\t    $ ugrep ☺\n\t    ☺       U+263A  WHITE SMILING FACE\n\n### Arguments on the command line have an implicit wildcard between them:\n\n\t    $ ugrep right.*gle\n\t    $ ugrep right gle       # Equivalent\n\t    »\tU+00BB\tRIGHT-POINTING DOUBLE ANGLE QUOTATION MARK\n\t    ’\tU+2019\tRIGHT SINGLE QUOTATION MARK\n\t    ∟\tU+221F\tRIGHT ANGLE\n\t    ⊿\tU+22BF\tRIGHT TRIANGLE\n\n### You can use regular expressions for fancier searches: \n\n\t    $ ugrep -w '(wo|hu)?m(a|e)ns?'\n\t    ᛗ\tU+16D7\tRUNIC LETTER MANNAZ MAN M\n\t    ⛀\tU+26C0\tWHITE DRAUGHTS MAN\n\t    ⛂\tU+26C2\tBLACK DRAUGHTS MAN\n\t    ⼈\tU+2F08\tKANGXI RADICAL MAN\n\t    ⼥\tU+2F25\tKANGXI RADICAL WOMAN\n\t    𝌂\tU+1D302\tDIGRAM FOR HUMAN EARTH\n\t    𝌄\tU+1D304\tDIGRAM FOR EARTHLY HUMAN\n\t    🕴\tU+1F574\tMAN IN BUSINESS SUIT LEVITATING\n\t    🕺\tU+1F57A\tMAN DANCING\n\t    🚹\tU+1F6B9\tMENS SYMBOL\n\t    🚺\tU+1F6BA\tWOMENS SYMBOL\n\t    🤰\tU+1F930\tPREGNANT WOMAN\n\t    🤵\tU+1F935\tMAN IN TUXEDO\n\t    \n\t    $ ugrep ^x\t\t    #  Regex anchors ^ and $ work\n\t    ⊻\tU+22BB\tXOR\n\t    ⌧\tU+2327\tX IN A RECTANGLE BOX (clear key)\n\n### Use the `-w` flag to search only for complete words:\n\n\t    $ ugrep -w R\t    # The letter R used as a word\n\t    $ ugrep \"\\bR\\b\"\t    # (regex equivalent)\n\t    R\tU+0052\tLATIN CAPITAL LETTER R\n\t    Ŗ\tU+0156\tLATIN CAPITAL LETTER R WITH CEDILLA\n\t    ℛ\tU+211B\tSCRIPT CAPITAL R (Script r)\n\t    ℜ\tU+211C\tBLACK-LETTER CAPITAL R (Black-letter r)\n\t    ℝ\tU+211D\tDOUBLE-STRUCK CAPITAL R (Double-struck r)\n\n### Use -c to display info for each character in a string.\n\n        $ ugrep -c \"ᕕ( ᐛ )ᕗ\"\n        ᕕ   U+1555  CANADIAN SYLLABICS FI\n        (   U+0028  LEFT PARENTHESIS (opening parenthesis)\n            U+0020  SPACE\n        ᐛ   U+141B  CANADIAN SYLLABICS NASKAPI WAA\n            U+0020  SPACE\n        )   U+0029  RIGHT PARENTHESIS (closing parenthesis)\n        ᕗ   U+1557  CANADIAN SYLLABICS FO\n\n### Aliases (alternate names) are also searched:\n\n\t    $ ugrep backslash\n\t    \\\tU+005C\tREVERSE SOLIDUS (backslash)\n\n### Use **..** to browse through a range of Unicode characters:\n\n\t    $ ugrep 26b3..b\n\t    ⚳\tU+26B3\tCERES\n\t    ⚴\tU+26B4\tPALLAS\n\t    ⚵\tU+26B5\tJUNO\n\t    ⚶\tU+26B6\tVESTA\n\t    ⚷\tU+26B7\tCHIRON\n\t    ⚸\tU+26B8\tBLACK MOON LILITH\n\t    ⚹\tU+26B9\tSEXTILE\n\t    ⚺\tU+26BA\tSEMISEXTILE\n\t    ⚻\tU+26BB\tQUINCUNX\n\n\t    $ ugrep 1f470..ff  |  less\n\t    👰\tU+1F470\tBRIDE WITH VEIL\n\t    👱\tU+1F471\tPERSON WITH BLOND HAIR\n\t    👲\tU+1F472\tMAN WITH GUA PI MAO\n\t    👳\tU+1F473\tMAN WITH TURBAN\n\t    👴\tU+1F474\tOLDER MAN\n\t    👵\tU+1F475\tOLDER WOMAN\n\t    👶\tU+1F476\tBABY\n\t    👷\tU+1F477\tCONSTRUCTION WORKER\n\t    👸\tU+1F478\tPRINCESS\n\t    👹\tU+1F479\tJAPANESE OGRE\n\t    👺\tU+1F47A\tJAPANESE GOBLIN\n\t    👻\tU+1F47B\tGHOST\n\t    👼\tU+1F47C\tBABY ANGEL\n\t    👽\tU+1F47D\tEXTRATERRESTRIAL ALIEN\n\t    ⋮\t[ ... truncated for brevity ... ]\n\t    📼\tU+1F4FC\tVIDEOCASSETTE\n\t    📽\tU+1F4FD\tFILM PROJECTOR\n\t    📾\tU+1F4FE\tPORTABLE STEREO\n\t    📿\tU+1F4FF\tPRAYER BEADS\n\n    Sometimes it's useful (or just fun) to page through the Unicode\n    table and see what characters are defined in a region. (`ugrep\n    2700..ff`) Ranges are convenient, but very slow. Use regular\n    expressions if you want speed. (`ugrep U+27..`)\n\n### Ranges can have an optional increment:\n\n```\n$ ugrep 0..ffff..1000\n   �    U+0000  \u003ccontrol\u003e (null)\n   က    U+1000  MYANMAR LETTER KA\n  [ ]   U+2000  EN QUAD\n  [　]  U+3000  IDEOGRAPHIC SPACE\n   䀀   U+4000  cups; small cups ( M: fàn, C: fan3 fan4 fan6 )\n   倀   U+5000  bewildered; rash, wildly ( M: chāng, C: caang1 caang4 coeng1 zaang1, J: KURUU TAORERU, K: CHANG, V: trành )\n   怀   U+6000  bosom, breast; carry in bosom ( M: huái, C: waai4 )\n   瀀   U+7000  [CJK Unified Ideographs] ( M: yōu, J: ATSUI )\n   耀   U+8000  shine, sparkle, dazzle; glory ( M: yào, C: jiu6, J: KAGAYAKU, K: YO )\n   退   U+9000  step back, retreat, withdraw ( M: tuì, C: teoi3, J: SHIRIZOKU SHIRIZOKERU, K: THOY, V: thoái )\n   ꀀ   U+A000  YI SYLLABLE IT\n   뀀   U+B000  Block: [Hangul Syllables]\n   쀀   U+C000  Block: [Hangul Syllables]\n   퀀   U+D000  Block: [Hangul Syllables]\n   �    U+E000  \u003cPrivate Use, First\u003e\n       U+F000  Block: [Private Use Area]\n```\n\n  * Tip: pipe long output to `less` and search for a code point by\n    pressing `/U\\+A60F`.\n\n### Use -l to list which installed fonts contain a certain glyph:\n  \u003ca href=\"https://raw.githubusercontent.com/hackerb9/ugrep/master/README.md.d/list-fonts.png\"\u003e\n  \u003cimg title=\"ugrep -l swash amp\" alt-text=\"Example of ugrep listing fonts\" align=\"right\" src=\"README.md.d/list-fonts.png\" width=\"50%\"\u003e\n  \u003c/a\u003e\n\n      ugrep -l swash amp\n\n  * Requires FontConfig. (Most GNU/Linux boxes should already be set).\n  \n  * The requested character may also be displayed in each of the\n    listed typefaces, but only if your terminal supports sixel\n    graphics (e.g., `xterm -ti vt340`) and you have ImageMagick\n    installed.\n\n### Use -L to scale up the font examples when listing fonts\n  \n  \u003ca href=\"https://raw.githubusercontent.com/hackerb9/ugrep/master/README.md.d/list-fonts-scale.png\"\u003e\n  \u003cimg title=\"ugrep -L4 fdfd\" \n       alt-text=\"Example of ugrep listing fonts at 4x scale\"\n       src=\"README.md.d/list-fonts-scale.png\" width=\"100%\"\u003e\n  \u003c/a\u003e\n  \n  ```\n  ugrep -L4 fdfd\n     ﷽    U+FDFD  ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM\n                    Aldhabi\n                    Trutypewriter PolyglOTT\n                    Unifont\n  ```\n\n  * Note that increasing the glyph size also increased the text size.\n    Not all terminals are capable of \"double height\" text. If yours\n    shows two lines of the same text in the usual size, try using\n    `--never-double-text`.\n\n### Copy whitespace from the terminal\n\n        $ ugrep -w space\n\t\t  [ ]   U+0020  SPACE (SP)\n\t\t  [ ]   U+00A0  NO-BREAK SPACE (non-breaking space) (NBSP)\n\t\t  [ ]   U+1680  OGHAM SPACE MARK\n\t\t  [ ]   U+2002  EN SPACE\n\t\t  [ ]   U+2003  EM SPACE\n\t\t  [ ]   U+2004  THREE-PER-EM SPACE\n\t\t  [ ]   U+2005  FOUR-PER-EM SPACE\n\t\t  [ ]   U+2006  SIX-PER-EM SPACE\n\t\t  [ ]   U+2007  FIGURE SPACE\n\t\t  [ ]   U+2008  PUNCTUATION SPACE\n\t\t  [ ]   U+2009  THIN SPACE\n\t\t  [ ]   U+200A  HAIR SPACE\n\n  Whitespace characters are printed with square brackets around them\n  to make it easy to highlight and copy them from the terminal. They\n  will also be shown with a yellow background, if the terminal allows.\n\n### Determine if an alias is actually a correction\n\n  Ugrep shows the character name in all caps and aliases are usually\n  lowercase in parentheses. Some aliases are treated differently.\n  For aesthetic reasons, abbreviations are also shown in uppercase.\n  For example:\n  \n  \u003cspan style=\"font-family: monospace\"\u003e   �    U+FEFF  ZERO WIDTH NO-BREAK SPACE (byte order mark) (BOM) (ZWNBSP)\n\n  There are 31 characters in Unicode which have the wrong name in the\n  UnicodeData.txt database. Unicode includes the correct name as an\n  alias in NameAliases.txt. If that file exists on your system, then\n  ugrep will show the correction in Title Case Letters and in red\n  letters, if the terminal supports color text.\n  \n  \u003cspan style=\"font-family: monospace\"\u003e   ︘   U+FE18  PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (\u003cspan style=\"color:red\"\u003ePresentation Form For Vertical Right White Lenticular Bracket\u003c/span\u003e)\u003c/span\u003e\n\n### View CJK (Chinese-Japanese-Korean) characters\n\nUnicode does not actually define most CJK characters, except\nindirectly via Unihan, which maps certain blocks of characters to\nother standards. \n\n* Ugrep allows one to specify the code point or paste in an example\ncharacter to look up.\n\n        $ ugrep 𰻞\n           𰻞   U+30EDE biangbiang noodles ( M: biáng )\n\n        $ ugrep 8000\n        耀  U+8000  shine, sparkle, dazzle; glory ( M: yào, C: jiu6, J: KAGAYAKU, K: YO )\n\n  \u003ca href=\"README.md.d/biangbiang.png\"\u003e\n  \u003cimg title=\"ugrep 30ede -L8\" \n       alt-text=\"Example of ugrep showing character U+30EDE at 8x scale\"\n       src=\"README.md.d/biangbiang.png\"\u003e\n  \u003c/a\u003e\n\n### View _all_ characters defined by Unicode:\n\n\t    $ ugrep .?  |  less\n\t    ⋮\t[ ... over 30,000 glyphs elided for brevity ... ]\n\n  * Want just Unicode glyphs without the description? Please use\n    [fonttable](https://github.com/hackerb9/fonttable). It shows all\n    defined Unicode characters by default.\n\n### Show all possible code points, even the ones _not_ defined in Unicode:\n\n\t\t$ ugrep 0..10FFFF | less\n\t    ⋮\t[ ... over a million lines elided for brevity ... ]\n\n  ☝ This is currently very slow due to the way `ugrep` is implemented.\n  You likely want to use\n  [fonttable -u](https://github.com/hackerb9/fonttable) instead. \n  \n\u003c/details\u003e\n\n## Prerequisite: UnicodeData.txt\n\nUgrep requires the Unicode data file\n[UnicodeData.txt](https://unicode.org/Public/UNIDATA/UnicodeData.txt)\nwhich can be installed on your system, in your home, or in the current\ndirectory.\n\n**Easiest**: On Ubuntu and Debian GNU/Linux, simply `apt install unicode-data`.\n\n**Still easy**: Or, you can download it by hand from\n[unicode.org](https://unicode.org/Public/UNIDATA/UnicodeData.txt)\nand place it in `~/.local/share/unicode/UnicodeData.txt`\n\n**Not hard**: Or, if you wish the file to be accessible to all users on\nyour machine, place it in `/usr/local/share/unicode/UnicodeData.txt`.\n\n## Unihan CJK Support\n\nIf the file `Unihan_Readings.txt` exists, then ugrep will\nautomatically use it to show an English gloss describing a character\nin the CJK (Chinese-Japanese-Korean) Ideographs region.\n\nYour OS may make it easy to install (e.g., `apt install unicode-data`). \nOn other systems, you can do this\n\n    mkdir -p ~/.local/share/unicode\n    cd ~/.local/share/unicode\n    wget ftp://ftp.unicode.org/Public/UNIDATA/Unihan.zip\n    unzip Unihan.zip\n\n### CJK example\n\n\u003cdetails\u003e\n\n#### Example 1: Unicode code point\n\n```\n$ ugrep 8000\n   耀   U+8000  shine, sparkle, dazzle; glory ( M: yào, C: jiu6, J: KAGAYAKU, K: YO )\n```\n\nThe parenthesized text at the end shows the romanized pronunciation of\nthe character in **M**andarin (pinyin), **C**antonese (jyutping),\n**J**apanese (Hepburn), and **K**orean (Yale).\n\n#### Example 2: Using -c to see characters in a string\n\n```\n$ ugrep -c 「⿺辶⿳穴⿰月⿰⿲⿱幺長⿱言馬⿱幺長刂心」\n   「   U+300C  LEFT CORNER BRACKET (opening corner bracket)\n   ⿺   U+2FFA  IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT\n   辶   U+8FB6  walk; walking; KangXi radical 162 ( M: chuò, J: SHINNYOU )\n   ⿳   U+2FF3  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO MIDDLE AND BELOW\n   穴   U+7A74  cave, den, hole; KangXi radical 116 ( M: xué, C: jyut6, J: ANA, K: HYEL, V: huyệt )\n   ⿰   U+2FF0  IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT\n   月   U+6708  moon; month; KangXi radical 74 ( M: yuè, C: jyut6, J: TSUKI, K: WEL, V: nguyệt )\n   ⿰   U+2FF0  IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT\n   ⿲   U+2FF2  IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO MIDDLE AND RIGHT\n   ⿱   U+2FF1  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW\n   幺   U+5E7A  one; tiny, small ( M: yāo, C: jiu1, J: CHIISAI, K: YO )\n   長   U+9577  long; length; excel in; leader ( M: zhǎng, C: coeng4 zoeng2, J: NAGAI TAKERU OSA, K: CANG, V: trường )\n   ⿱   U+2FF1  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW\n   言   U+8A00  words, speech; speak, say ( M: yán, C: jin4, J: KOTO IU KOTOBA, K: EN UN, V: ngôn )\n   馬   U+99AC  horse; surname; KangXi radical 187 ( M: mǎ, C: maa5, J: UMA, K: MA, V: mã )\n   ⿱   U+2FF1  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW\n   幺   U+5E7A  one; tiny, small ( M: yāo, C: jiu1, J: CHIISAI, K: YO )\n   長   U+9577  long; length; excel in; leader ( M: zhǎng, C: coeng4 zoeng2, J: NAGAI TAKERU OSA, K: CANG, V: trường )\n   刂   U+5202  knife; radical number 18 ( M: dāo, C: dou1, J: RITSUTOU, K: TO )\n   心   U+5FC3  heart; mind, intelligence; soul ( M: xīn, C: sam1, J: KOKORO, K: SIM, V: tâm )\n   」   U+300D  RIGHT CORNER BRACKET (closing corner bracket)\n```\n\n### Note 1: A \"definition\" is not a translation\nUnihan calls the English gloss the character's \"definition\", but that\nis meant in a very loose sense. CJK characters change meaning based\nupon the context they are used in. For example, most Chinese words are\nmade of two characters, such as \"蜂鳥\", which means \"hummingbird\", but\nugrep would shows it as:\n\n```\n$ ugrep -c 蜂鳥\n   蜂   U+8702  bee, wasp, hornet ( M: fēng, C: fung1, J: HACHI, K: PONG, V: ong )\n   鳥   U+9CE5  bird; KangXi radical 196 ( M: niǎo, C: niu5, J: TORI, K: CO, V: điểu )\n```\n\n### Note 2: Not all characters have readings\n\nUnihan refers to this supplemental information — both the English\ngloss and the romanizations — as \"readings\". Readings are meant to be\nhelpful, but are not normative and are only available for some\ncharacters.\n\n|                    |  Count | Percent |\n|--------------------|-------:|--------:|\n| All CJK Characters | 93,858 |    100% |\n| Have any reading   | 47,429 |     51% |\n| Mandarin Pinyin    | 41,378 |     44% |\n| Cantonese Jyutping | 23,112 |     25% |\n| English definition | 21,076 |     23% |\n| Japanese Hepburn   | 11,293 |     12% |\n| Korean Yale        |  9,051 |     10% |\n| Vietnamese         |  8,301 |      9% |\n\n#### Example of CJK with no Mandarin\n\n```\n$ ugrep 2bac3\n   𫫃   U+2BAC3 (Cant.) sarcastic interrogative ( C: e1 )\n```\n#### Example of CJK with no pronunciation\n\n```\n$ ugrep 20015\n   𠀕   U+20015 Variant of U+4E99 亙\n```\n\n#### Example of CJK with no English definition\n\n```\n$ ugrep 20016\n   𠀖   U+20016 [CJK Unified Ideographs Extension B] ( V: khạng )\n```\n\n#### Example of CJK with no readings whatsoever\n\n```\n$ ugrep 2abcd\n   𪯍   U+2ABCD [CJK Unified Ideographs Extension C]\n```\nNote that ugrep currently prints just the name of the block the\ncharacter is in [within square brackets] if it has no better way to\nidentify the character. \n\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \n\n\u003c/details\u003e\n\n## Boring Implementation notes\n\n\u003cdetails\u003e\n\nThis is a rewrite of b9's AWK ugrep into Python. While AWK makes more\nsense for what this program does (comparing fields based on regexps),\na rewrite was necessary because GNU awk, while plenty powerful, uses\n`\\y` for word edges instead of the standard `\\b`. Gawk does this for\nbackwards compatibility with historic AWK, but lacks a way to disable\nit for new scripts.\n\nSwitching to Python did have the benefit of allowing more powerful\nPerlesque regexes (not that anyone has requested that).\n\n### Why not use the unicodedata module?\n\nI do not use Python's `unicodedata` module because it is woefully\ninsufficient. It allows one to search by character name only by\nspecifying it fully and exactly: `unicodedata.lookup(\"ROTATED HEAVY\nBLACK HEART BULLET\")`.\n\n\u003c/details\u003e\n\n## Future Work\n\n### Rename this project\n\nAlthough I believe this `ugrep` existed first, there is now another\n[ugrep](https://github.com/Genivia/ugrep) which is quite widely known\n— with good reason as it looks pretty nifty — which hasnothing to do\nwith looking up Unicode characters. The 'U' appears to stand for\n_Ultra-fast_ as it is a very speedy `grep` with lots of bells and\nwhistles.\n\nWhat shall this project's new name be? `ug` is also taken by the other\nugrep. How about `ugre`? It's an ugly, ogreish name, but it's probably\na safe bet nobody is going to use that name for something else.\n\n### Maybe use Unihan_Readings.txt for grepping\n\nCurrently if `Unihan_Readings.txt` is installed — which is the default if\nthe user has done `apt install unicode-data`) — and the user requests a\ncharacter that is not in UnicodeData.txt, then the Readings data is\nused to show information about the character. However, Unihan_Readings\ncould be used in the future for searching for characters to show.\n\nExample data from Unihan_Readings for U+9B44 (魄): \n\n\tU+9B44\tkCantonese\tbok3 paak3 tok3\n\tU+9B44\tkDefinition\tvigor; body; dark part of moon\n\tU+9B44\tkHangul\t백:0N\n\tU+9B44\tkHanyuPinlu\tpò(11)\n\tU+9B44\tkHanyuPinyin\t74431.090:pò,bó,tuò\n\tU+9B44\tkJapaneseKun\tTAMASHII\n\tU+9B44\tkJapaneseOn\tHAKU BAKU\n\tU+9B44\tkKorean\tPAYK\n\tU+9B44\tkMandarin\tpò\n\tU+9B44\tkTGHZ2013\t287.140:pò\n\tU+9B44\tkTang\t*pæk\n\tU+9B44\tkVietnamese\tphách\n\tU+9B44\tkXHC1983\t0084.110:bó 0887.020:pò 1175.020:tuò\n\nSee [UAX #38: Unicode Han Database](https://www.unicode.org/reports/tr38/tr38-31.html).\n\nTwo levels of Unihan support:\n1. Show kDefinition if block name is *CJK Ideographs*\n2. Search Unihan_Readings when searching for a word. Possible example:\n       $ ugrep mononoke\n\t   魅\tU+9B45\tMONONOKE BAKEMONO SUDAMA (kind of forest demon, elf)\n\nNumber 1 is finished and working, but number 2 may require a command\nline switch or some other way of enabling/disabling it as searching\nthrough the Readings file may be slow or cause other problems.\n\n### Maybe use NamesList.txt\n\nIt looks like\n[`NamesList.txt`](https://unicode.org/Public/UNIDATA/NamesList.txt)\nmight be useful to also parse as it allows multiple aliases for a\ncharacter. For example (from `grep -B1 [=%] NamesList.txt`):\n\n    0023    NUMBER SIGN\n            = pound sign, hash, crosshatch, octothorpe\n\n    002E    FULL STOP\n            = period, dot, decimal point\n    --\n    002F    SOLIDUS\n            = slash, virgule\n\n    1F70A   ALCHEMICAL SYMBOL FOR VINEGAR\n            = crucible; acid; distill; atrament; vitriol; red\n              sulfur; borax; wine; alkali salt; mercurius vivus,\n              quick silver\n\nI'm not sure how useful this will be (who is going to look up the\nnumber sign by searching on \"octothorpe\"), but it'd be nice to be able\nto at least show them as aliases.\n\nAlso, NamesList.txt has a fascinating \"cross reference\" feature:\n\n    0021    EXCLAMATION MARK\n            = factorial\n            = bang\n            x (inverted exclamation mark - 00A1)\n            x (latin letter retroflex click - 01C3)\n            x (double exclamation mark - 203C)\n            x (interrobang - 203D)\n            x (heavy exclamation mark ornament - 2762)\n\nHow would one find the interrobang (‽) without such a cross reference?\n\nNote that the NamesList.txt file actually starts with a warning *not*\nto parse it as it says it is generated mechanically from\nUnicodeData.txt plus \"manually created annotations\". However, those\nannotations are what is interesting about the file (the aliases and\ncross references) and there appears to be no other official source of\nthat data.\n\n\n## Bugs, Misfeatures, and Workarounds\n\n* ugrep 3400 shows the text defined in UnicodeData.txt, which states\n  that it is \"\u003cCJK Ideograph Extension A, First\u003e\". Now that ugrep can\n  show ideograph definitions using Unihan_Readings.txt, we should\n  (probably) replace any string in angle brackets with more useful info.\n\n* Brace expansion is confusing because of needing to be quoted from\n  the shell. It is supported for ranges (not sequences), but is not\n  currently documented because usage is tricky and the functionality\n  is not actually that helpful. For example, the following works:\n\n      ugrep {0..F}{0,4,8,C}00\n\n  but is easier to understand using range expansion:\n\n      ugrep 0..FFFF..400\n\n* Range expansion and a seemingly equivalent regular expression search\n  will give different results.\n  \n      ugrep 0..FFFF..400 | wc -l \n\t  64\n\t  ugrep U+[0-9A-F][048C]00 | wc -l\n\t  22\n\n  This is because regexes currently only return valid code points from\n  the UnicodeData.txt file, whereas range expansions can generate code\n  points which are in regions not directly defined by Unicode. For\n  example, the range from U+4E00 to U+9FEF is a block of CJK Ideographs.\n  Both are useful: regexes are blazingly fast, while range expansions\n  have more functionality. \n\n* [Note: The following is not a problem for people who are willing to\n  use vector fonts (truetype, opentype, postscript) that may be\n  antialiased. Xterm uses fontconfig just fine.]\n\n  \u003cdetails\u003e\n\n  For bitmap fonts, Xterm (as of version 369) seems to be able to only\n  use one font at a time, which means a single font must have all the\n  glyphs you want shown. (Yes, you can have a second bitmap font for\n  \"wide\" CJK, but that's still not enough.)\n\n  The author (hackerb9) currently prefers using the Neep bitmap font\n  like so in `~/.Xresources`:\n\n      ! Neep looks nice, has good unicode coverage. Requires xfonts-jmk.\n      xterm*vt100.font        :       *neep-medium-r-normal--20*10646*\n      ! Neep lacks Asian characters\n      xterm*vt100.wideFont    :       *fixed-medium-r-normal-ja-18*10646*\n\n  Neep has two major downsides. 1. It is a bitmap font with only one\n  size well implemented, so you can't zoom in or out. 2. It is limited\n  to 65536 characters, which means it cannot show characters outside\n  of Unicode's Basic Multilingual Plane, such as new emojis. Neep can\n  be installed on Debian GNU/Linux systems with `apt install\n  xfonts-jmk`.\n\n\n* Mlterm appears to have the same single font limitation as Xterm.\n  Also, it right aligns text that has even a single character in a\n  right-to-left alphabet, such as Arabic, so the output from ugrep\n  will look a little funny.\n\n  \u003c/details\u003e\n\n* Gnome-terminal uses `font-config`, so it has very nice Unicode\n  support and can easily zoom in with Ctrl-+⃣ and Ctrl--⃣. Older\n  versions had a bug where combining characters were combined with the\n  following character instead of the previous, but this is now fixed.\n\n  It does not support sixel graphics, so the -l option cannot show\n  examples of the character in different fonts.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackerb9%2Fugrep","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhackerb9%2Fugrep","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackerb9%2Fugrep/lists"}