https://github.com/thinkr-open/utf8splain

Explain utf-8 encoded strings
https://github.com/thinkr-open/utf8splain

binary encoding r string utf-8

Last synced: about 1 year ago
JSON representation

Explain utf-8 encoded strings

Host: GitHub
URL: https://github.com/thinkr-open/utf8splain
Owner: ThinkR-open
License: other
Created: 2017-08-02T08:31:57.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2024-02-16T09:44:40.000Z (over 2 years ago)
Last Synced: 2025-03-29T10:21:57.388Z (about 1 year ago)
Topics: binary, encoding, r, string, utf-8
Language: R
Size: 222 KB
Stars: 17
Watchers: 1
Forks: 1
Open Issues: 8
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, echo = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "README-"

)

library(utf8splain)

```

## Installation from github

```{r eval=FALSE}

devtools::install_github( "ThinkR-open/utf8splain")

```

## Split a string into bytes 

```{r}

bytes( "hello 🌍" )

```

## Split a utf-8 encoded string into unicode runes 

If you run it in a [crayon](https://github.com/r-lib/crayon) compatible terminal, for example 

a recent enough version of rstudio, the `print` method gives you a nicer output:

![](img/runes-crayon.png)

## Details about unicode and utf-8

utf-8 encoded strings are divided in a series of runes (aka unicode code points) from 

the [unicode table](https://unicode-table.com/en/), for example the rune 

for the lower case "h" is [U+0068](https://unicode-table.com/en/#0068). 

Each rune is encoded in a variable number of bytes, depending on how far it is in the

table, for example "h" (and all other ascii characters) only need one byte, but 

🌍 needs 4 bytes. 

utf-8 bytes are organised as follows: 

 - the first byte of a rune starts with as many 1 as the rune needs bytes, followed by a 0, e.g. the first rune 

   for the utf-8 encoded 🌍 starts with "11110", and the only byte of the encoded "h" starts with "0"

 - the remaining bytes (if any) all start with "10"

 

All the bits that are not taken are used to store the binary representation of the rune, 

for example the 7 bits "1101000" follow the initial "0" in the encoding of "h". 🌍 correspond to the rune [U+1F30D](https://unicode-table.com/en/#1F30D), i.e. the rune number `0x1F30D`. 

```{r}

world_decimal <- strtoi( "0x1F30D", base = 16)

world_decimal

world_binary    <- paste( substr(as.character( rev(intToBits(world_decimal)) ), 2, 2 ), collapse = "" )

world_binary

world_binary_signif <- sub( "^0+", "", world_binary )

world_binary_signif

nchar(world_binary_signif)

```

So 🌍 needs `r nchar(world_binary_signif)` bits, which needs 4 utf-8 bytes.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thinkr-open/utf8splain

Awesome Lists containing this project

README