{"id":17101314,"url":"https://github.com/hackerb9/utf8strings","last_synced_at":"2025-03-23T19:16:10.211Z","repository":{"id":113746491,"uuid":"204140052","full_name":"hackerb9/utf8strings","owner":"hackerb9","description":"Extract strings of UTF-8 (four characters or longer) from binary blobs.","archived":false,"fork":false,"pushed_at":"2019-08-24T12:24:17.000Z","size":60,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-29T01:56:31.346Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hackerb9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-24T10:02:31.000Z","updated_at":"2024-04-13T08:22:18.000Z","dependencies_parsed_at":null,"dependency_job_id":"aa89a3df-8ffe-4283-9524-3a709498fb0c","html_url":"https://github.com/hackerb9/utf8strings","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Futf8strings","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Futf8strings/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Futf8strings/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Futf8strings/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hackerb9","download_url":"https://codeload.github.com/hackerb9/utf8strings/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245153896,"owners_count":20569408,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-14T15:24:44.842Z","updated_at":"2025-03-23T19:16:10.188Z","avatar_url":"https://github.com/hackerb9.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# utf8strings\nExtract strings of UTF-8 (four characters or longer) from binary blobs.\n\n# Usage\n\n    Compilation:\n        make utf8strings\n\n    Usage:\n        utf8strings [ filename ]\n\n    Examples:\n        utf8strings /usr/sbin/bomb\n        utf8strings /dev/mem | less\n        somebinaryemittingprogram | utf8strings \n\n\u003cimg align=\"center\" src=\"README.md.d/screenshot.png\"\u003e\n\n# Why?\n\n\"Binary\" files often have text strings embedded in them, but the\nstandard `strings` utility that comes with [GNU\nbinutils](https://gnu.org/software/binutils/) does not (yet)\nunderstand UTF-8. This is a serious problem because UTF-8 has become\nthe defacto standard for text in UNIX systems and on the Internet.\n\n# How\n\nUTF-8 is a beautiful design and includes the ability to _self\nsynchronize_. Each character in a UTF-8 string is made up of a\nsequence of up to four bytes. By looking at the first two bits of a\nbyte, one knows immediately if the byte represents an ASCII character\n(00, 01), an initial byte in a sequence (11), or a continuation byte\n(10). That means that there is never any confusion about possibly\noverlapping UTF-8 interpretations.\n\n# Initial release \n\nThis was designed to be simple and correct. It was implemented in\nbog-standard C. No thought was put in to optimization, yet. It\ncorrectly identifies valid UTF-8 sequences and rejects non-UTF-8. It\nshows strings with a minimum length of four *characters* (not bytes).\nWorks on stdin or a single filename may be specified.\n\nIt works for my purposes and probably will be fine for you as well.\n\n# Deficiencies\n* Hardcoded to strings of minlength 4. \n* Could be a lot faster with some simple optimizations.\n* Does not handle any options.\n* Should be merged with `strings` from GNU binutils.\n\n# Future\n\nI've licensed this code under the same license as GNU binutils in the\nhope that it will be useful to the GNU folks as they improve the\nofficial version of `strings` to support UTF-8.\n\n# Implementation Notes\n\n## A. INVALID UTF-8 SEQUENCES are correctly discarded:\nFor example,\n   1. Bytes that don't begin with UTF's magic (10*, 110*, 1110*, or 11110*).\n   2. A byte with the correct magic bits, but all 0s for data. (E.g., 11110000).\n   3. Incorrect usage of continuation bytes (10*) \n      1. After 110*, there must be one continuation byte.\n      2. After 1110*, there must be two continuation bytes.\n      3. After 11110*, there must be three continuation bytes.\n      4. Continuation bytes (10*) not preceeded by one of the above are invalid.\n   4. Bytes C0 and C1. (They would encode ASCII as two bytes).\n   5. U+D800 to U+DFFF are reserved for UTF-16's surrogate halves.\n   6. Leading byte of F4 and codepoint is beyond Unicode's limit. (\u003e0x10FFFF)\n   7. Leading byte of F5 to FD. (Codepoint is greater than 0x10FFFF).\n   8. Leading byte of FE or FF. (Undefined in UTF-8 to allow for UTF-16 BOM).\n   9. Code points U+80 to U+9F are skipped as control characters.\n  10. End of file before a complete character is read.\n\n## B. MAYBE IT COULD BE BETTER.\n\n   Some valid UTF-8 sequences are actually undefined code points in\n   Unicode and shouldn't be printed. Similarly, for a `strings`\n   program like this, we would want to check Unicode's syntactic\n   tables so we can ignore non-printable characters. Those features\n   have been left out intentionally as they would be much more complex\n   and require updating with every new release of the Unicode\n   standard.\n\n## C. SOME TESTS:\n   1a. Values beyond Unicode (\u003e= 0x110000) should NOT be shown:\n\n       echo -n $'XX\\xf4\\x90\\x80\\x80XX' | ./utf8strings  | hd\n\n   1b. Characters \u003c= 0x10FFFF should show something:\n\n       echo -n $'XX\\xf4\\x8f\\xbf\\xbfXX' | ./utf8strings  | hd\n\n   2a. UTF-16 surrogate halves should NOT be shown:\n\n       echo -n $'XX\\xED\\xA0\\x80XX' | ./utf8strings | hd\n\n   2b. Characters between U+D000 to U+D7FF should be shown:\n\n       echo -n $'XX\\xED\\x9F\\xBFXX' | ./utf8strings | hd\n\n   3a. UTF-8 Control characters 0x80 to 0x9F should NOT be shown:\n\n       echo $'XX\\xC2\\x80XX'  | ./utf8strings | hd\n\n   3b. Characters \u003e= 0xA0 should be shown:\n\n       echo $XX'\\xC2\\xA0XX'  | ./utf8strings | hd\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackerb9%2Futf8strings","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhackerb9%2Futf8strings","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackerb9%2Futf8strings/lists"}