{"id":20665656,"url":"https://github.com/thinkr-open/utf8splain","last_synced_at":"2025-04-19T16:40:40.935Z","repository":{"id":69754330,"uuid":"99092294","full_name":"ThinkR-open/utf8splain","owner":"ThinkR-open","description":"Explain utf-8 encoded strings","archived":false,"fork":false,"pushed_at":"2024-02-16T09:44:40.000Z","size":227,"stargazers_count":17,"open_issues_count":8,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-29T10:21:57.388Z","etag":null,"topics":["binary","encoding","r","string","utf-8"],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ThinkR-open.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-02T08:31:57.000Z","updated_at":"2024-02-15T12:28:18.000Z","dependencies_parsed_at":null,"dependency_job_id":"d15acc14-38ad-4504-b1ae-1890793c2816","html_url":"https://github.com/ThinkR-open/utf8splain","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThinkR-open%2Futf8splain","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThinkR-open%2Futf8splain/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThinkR-open%2Futf8splain/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThinkR-open%2Futf8splain/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ThinkR-open","download_url":"https://codeload.github.com/ThinkR-open/utf8splain/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249740040,"owners_count":21318674,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binary","encoding","r","string","utf-8"],"created_at":"2024-11-16T19:33:05.097Z","updated_at":"2025-04-19T16:40:40.927Z","avatar_url":"https://github.com/ThinkR-open.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n```{r, echo = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"README-\"\n)\nlibrary(utf8splain)\n```\n\n## Installation from github\n\n```{r eval=FALSE}\ndevtools::install_github( \"ThinkR-open/utf8splain\")\n```\n\n## Split a string into bytes \n\n```{r}\nbytes( \"hello 🌍\" )\n```\n\n## Split a utf-8 encoded string into unicode runes \n\nIf you run it in a [crayon](https://github.com/r-lib/crayon) compatible terminal, for example \na recent enough version of rstudio, the `print` method gives you a nicer output:\n\n![](img/runes-crayon.png)\n\n## Details about unicode and utf-8\n\nutf-8 encoded strings are divided in a series of runes (aka unicode code points) from \nthe [unicode table](https://unicode-table.com/en/), for example the rune \nfor the lower case \"h\" is [U+0068](https://unicode-table.com/en/#0068). \n\nEach rune is encoded in a variable number of bytes, depending on how far it is in the\ntable, for example \"h\" (and all other ascii characters) only need one byte, but \n🌍 needs 4 bytes. \n\nutf-8 bytes are organised as follows: \n - the first byte of a rune starts with as many 1 as the rune needs bytes, followed by a 0, e.g. the first rune \n   for the utf-8 encoded 🌍 starts with \"11110\", and the only byte of the encoded \"h\" starts with \"0\"\n - the remaining bytes (if any) all start with \"10\"\n \nAll the bits that are not taken are used to store the binary representation of the rune, \nfor example the 7 bits \"1101000\" follow the initial \"0\" in the encoding of \"h\". 🌍 correspond to the rune [U+1F30D](https://unicode-table.com/en/#1F30D), i.e. the rune number `0x1F30D`. \n\n```{r}\nworld_decimal \u003c- strtoi( \"0x1F30D\", base = 16)\nworld_decimal\n\nworld_binary    \u003c- paste( substr(as.character( rev(intToBits(world_decimal)) ), 2, 2 ), collapse = \"\" )\nworld_binary\n\nworld_binary_signif \u003c- sub( \"^0+\", \"\", world_binary )\nworld_binary_signif\n\nnchar(world_binary_signif)\n```\n\nSo 🌍 needs `r nchar(world_binary_signif)` bits, which needs 4 utf-8 bytes. \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthinkr-open%2Futf8splain","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthinkr-open%2Futf8splain","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthinkr-open%2Futf8splain/lists"}