Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/krlmlr/enc
A simple class for storing UTF-8 strings
https://github.com/krlmlr/enc
character-encoding r utf-8
Last synced: 3 months ago
JSON representation
A simple class for storing UTF-8 strings
- Host: GitHub
- URL: https://github.com/krlmlr/enc
- Owner: krlmlr
- Archived: true
- Created: 2016-06-13T20:25:10.000Z (over 8 years ago)
- Default Branch: main
- Last Pushed: 2024-01-24T00:28:33.000Z (10 months ago)
- Last Synced: 2024-06-04T23:02:43.849Z (5 months ago)
- Topics: character-encoding, r, utf-8
- Language: R
- Homepage: https://krlmlr.github.io/enc
- Size: 428 KB
- Stars: 16
- Watchers: 3
- Forks: 3
- Open Issues: 4
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
- jimsghstars - krlmlr/enc - A simple class for storing UTF-8 strings (R)
README
---
output: downlit::readme_document
---# enc
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![rcc](https://github.com/krlmlr/enc/workflows/rcc/badge.svg)](https://github.com/krlmlr/enc/actions)
[![codecov](https://codecov.io/gh/krlmlr/enc/branch/master/graph/badge.svg)](https://codecov.io/gh/krlmlr/enc)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/enc)](https://cran.r-project.org/package=enc)Portable tools for UTF-8 character data
## R and character encoding
The [character encoding](https://en.wikipedia.org/wiki/Character_encoding) of determines the translation of the letters, digits, or other codepoints (atomic components of a text) into a sequence of bytes. A byte sequence may translate into valid text in one character encoding, but give nonsense in other character encodings.
For historic reasons, R can store strings in different ways:
1. in the "native" encoding, the default encoding of the operating system
1. in [UTF-8](https://en.wikipedia.org/wiki/UTF-8), the most prevalent and versatile encoding nowadays
1. in "latin1", a popular encoding in Western Europe
1. as "bytes", leaving the interpretation to the userOn OS X and Linux, the "native" encoding is often UTF-8, but on Windows it is not. To add to the confusion, the encoding is a property of individual strings in a character vector, and not of the entire vector.
## Why UTF-8?
When working with text, it is advisable to use UTF-8, because it allows encoding virtually any text, even in foreign languages that contain symbols that cannot be represented in your system's native encoding. The UTF-8 encoding possesses several nice technical properties, and is by far the predominant encoding on the Web. Standardization on a "universal" encoding faciliates data exchange.
Because of R's special handling of strings, some care must be taken to make sure that you're actually using the UTF-8 encoding. Many functions in R will hide encoding issues from you, and transparently convert to UTF-8 as necessary. However, some functions (such as reading and writing files) will stubbornly prefer the native encoding.
The enc package provides helpers for converting all textual components of an object to UTF-8, and for reading and writing files in UTF-8 (with a LF end-of-line terminator by default). It also defines an S3 class for tagging all-UTF-8 character vectors and ensuring that updates maintain the UTF-8 encoding. Examples for other packages that use UTF-8 by default are:
- [readr](https://readr.tidyverse.org/), [readxl](https://readxl.tidyverse.org/), and [haven](https://haven.tidyverse.org/) for data input and output
- [stringi](https://cran.r-project.org/package=stringi) and [stringr](https://stringr.tidyverse.org/) for string manipulation
- [testthat](https://testthat.r-lib.org/) and [roxygen2](https://cran.r-project.org/package=roxygen2) for package development
## Example
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
``````{r}
library(enc)
utf8(c("a", "ä"))
as_utf8(1)a <- utf8("ä")
a[2] <- "ö"
class(a)data.frame(abc = letters[1:3], utf8 = utf8(letters[1:3]))
```Install the package from GitHub:
```r
# install.packages("devtools")
devtools::install_github("krlmlr/enc")
```