An open API service indexing awesome lists of open source software.

https://github.com/krlmlr/r-utf8

UTF-8 Text Processing (R Package)
https://github.com/krlmlr/r-utf8

Last synced: 19 days ago
JSON representation

UTF-8 Text Processing (R Package)

Awesome Lists containing this project

README

        

---
output:
github_document:
html_preview: false
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)

pkgload::load_all()

set.seed(20230702)

clean_output <- function(x, options) {
x <- gsub("0x[0-9a-f]+", "0xdeadbeef", x)
x <- gsub("dataframe_[0-9]*_[0-9]*", " dataframe_42_42 ", x)
x <- gsub("[0-9]*\\.___row_number ASC", "42.___row_number ASC", x)

index <- x
index <- gsub("โ”€", "-", index)
index <- strsplit(paste(index, collapse = "\n"), "\n---\n")[[1]][[2]]
writeLines(index, "index.md")

x <- fansi::strip_sgr(x)
x
}

options(
cli.num_colors = 256,
cli.width = 80,
width = 80,
pillar.bold = TRUE
)

local({
hook_source <- knitr::knit_hooks$get("document")
knitr::knit_hooks$set(document = clean_output)
})
```

# utf8

[![rcc](https://github.com/patperry/r-utf8/workflows/rcc/badge.svg)](https://github.com/patperry/r-utf8/actions)
[![Coverage Status][codecov-badge]][codecov]
[![CRAN Status][cran-badge]][cran]
[![License][apache-badge]][apache]
[![CRAN RStudio Mirror Downloads][cranlogs-badge]][cran]

*utf8* is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R's UTF-8 handling.

## Installation

### Stable version

*utf8* is [available on CRAN][cran]. To install the latest released version,
run the following command in R:

```r
install.packages("utf8")
```

### Development version

To install the latest development version, run the following:

```r
devtools::install_github("patperry/r-utf8")
```

## Usage

```{r}
library(utf8)
```

### Validate character data and convert to UTF-8

Use `as_utf8()` to validate input text and convert to UTF-8 encoding. The
function alerts you if the input text has the wrong declared encoding:

```{r, error = TRUE}
# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails

# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
```

### Normalize data

Use `utf8_normalize()` to convert to Unicode composed normal form (NFC).
Optionally apply compatibility maps for NFKC normal form or case-fold.

```{r}
# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
utf8_normalize(angstrom) == "\u00c5"

# perform full Unicode case-folding
utf8_normalize("GrรถรŸe", map_case = TRUE)

# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("๐–ธ๐—ˆ ๐”๐ง๐ข๐œ๐จ๐๐ž ๐—… ๐—๐–พ๐—‹๐–ฝ ๐•Œ ๐—…๐—‚๐—„๐–พ ๐‘ก๐‘ฆ๐‘๐‘’๐‘“๐‘Ž๐‘๐‘’๐‘  ๐—Œ๐—ˆ ๐—๐–พ ๐—‰๐—Ž๐— ๐—Œ๐—ˆ๐—†๐–พ ๐šŒ๐š˜๐š๐šŽ๐š™๐š˜๐š’๐š—๐š๐šœ ๐—‚๐—‡ ๐—’๐—ˆ๐—Ž๐—‹ ๐”–๐”ฒ๐”ญ๐”ญ๐”ฉ๐”ข๐”ช๐”ข๐”ซ๐”ฑ๐”ž๐”ฏ๐”ถ ๐”š๐”ฒ๐”ฉ๐”ฑ๐”ฆ๐”ฉ๐”ฆ๐”ซ๐”ค๐”ณ๐”ž๐”ฉ ๐”“๐”ฉ๐”ž๐”ซ๐”ข ๐—Œ๐—ˆ ๐—’๐—ˆ๐—Ž ๐–ผ๐–บ๐—‡ ๐“ฎ๐“ท๐“ฌ๐“ธ๐“ญ๐“ฎ ๐•—๐• ๐•Ÿ๐•ฅ๐•ค ๐—‚๐—‡ ๐—’๐—ˆ๐—Ž๐—‹ ๐’‡๐’๐’๐’•๐’”.",
map_compat = TRUE)
```

### Print emoji

On some platforms (including MacOS), the R implementation of `print()` uses an
outdated version of the Unicode standard to determine which characters are
printable. Use `utf8_print()` for an updated print function:

```{r}
print(intToUtf8(0x1F600 + 0:79)) # with default R print function

utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line

utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
```

## Citation

Cite *utf8* with the following BibTeX entry:

```{r echo = FALSE, comment = NA}
print(suppressWarnings(citation("utf8")), "Bibtex")
```

## Contributing

The project maintainer welcomes contributions in the form of feature requests,
bug reports, comments, unit tests, vignettes, or other code. If you'd like to
contribute, either

- fork the repository and submit a pull request

- [file an issue][issues];

- or contact the maintainer via e-mail.

This project is released with a [Contributor Code of Conduct][conduct],
and if you choose to contribute, you must adhere to its terms.

[apache]: https://www.apache.org/licenses/LICENSE-2.0.html "Apache License, Version 2.0"
[apache-badge]: https://img.shields.io/badge/License-Apache%202.0-blue.svg "Apache License, Version 2.0"
[building]: #development-version "Building from Source"
[codecov]: https://app.codecov.io/github/patperry/r-utf8?branch=main "Code Coverage"
[codecov-badge]: https://codecov.io/github/patperry/r-utf8/coverage.svg?branch=main "Code Coverage"
[conduct]: https://github.com/patperry/r-utf8/blob/main/CONDUCT.md "Contributor Code of Conduct"
[cran]: https://cran.r-project.org/package=utf8 "CRAN Page"
[cran-badge]: https://www.r-pkg.org/badges/version/utf8 "CRAN Page"
[cranlogs-badge]: https://cranlogs.r-pkg.org/badges/utf8 "CRAN Downloads"
[issues]: https://github.com/patperry/r-utf8/issues "Issues"