Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/patperry/r-utf8

UTF-8 Text Processing (R Package)
https://github.com/patperry/r-utf8

Last synced: 3 months ago
JSON representation

UTF-8 Text Processing (R Package)

Awesome Lists containing this project

README

        

---
output: downlit::readme_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
options(width = 95)
```

utf8
====

[![rcc](https://github.com/patperry/r-utf8/workflows/rcc/badge.svg)](https://github.com/patperry/r-utf8/actions)
[![Coverage Status][codecov-badge]][codecov]
[![CRAN Status][cran-badge]][cran]
[![License][apache-badge]][apache]
[![CRAN RStudio Mirror Downloads][cranlogs-badge]][cran]

*utf8* is an R package for manipulating and printing UTF-8 text that fixes
[multiple][windows-enc2utf8] [bugs][emoji-print] in R's UTF-8 handling.

Installation
------------

### Stable version

*utf8* is [available on CRAN][cran]. To install the latest released version,
run the following command in R:

```r
install.packages("utf8")
```

### Development version

To install the latest development version, run the following:

```r
devtools::install_github("patperry/r-utf8")
```

Usage
-----

```{r}
library(utf8)
```

### Validate character data and convert to UTF-8

Use `as_utf8()` to validate input text and convert to UTF-8 encoding. The
function alerts you if the input text has the wrong declared encoding:

```{r, error = TRUE}
# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails

# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
```

### Normalize data

Use `utf8_normalize()` to convert to Unicode composed normal form (NFC).
Optionally apply compatibility maps for NFKC normal form or case-fold.

```{r}
# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
utf8_normalize(angstrom) == "\u00c5"

# perform full Unicode case-folding
utf8_normalize("GrรถรŸe", map_case = TRUE)

# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("๐–ธ๐—ˆ ๐”๐ง๐ข๐œ๐จ๐๐ž ๐—… ๐—๐–พ๐—‹๐–ฝ ๐•Œ ๐—…๐—‚๐—„๐–พ ๐‘ก๐‘ฆ๐‘๐‘’๐‘“๐‘Ž๐‘๐‘’๐‘  ๐—Œ๐—ˆ ๐—๐–พ ๐—‰๐—Ž๐— ๐—Œ๐—ˆ๐—†๐–พ ๐šŒ๐š˜๐š๐šŽ๐š™๐š˜๐š’๐š—๐š๐šœ ๐—‚๐—‡ ๐—’๐—ˆ๐—Ž๐—‹ ๐”–๐”ฒ๐”ญ๐”ญ๐”ฉ๐”ข๐”ช๐”ข๐”ซ๐”ฑ๐”ž๐”ฏ๐”ถ ๐”š๐”ฒ๐”ฉ๐”ฑ๐”ฆ๐”ฉ๐”ฆ๐”ซ๐”ค๐”ณ๐”ž๐”ฉ ๐”“๐”ฉ๐”ž๐”ซ๐”ข ๐—Œ๐—ˆ ๐—’๐—ˆ๐—Ž ๐–ผ๐–บ๐—‡ ๐“ฎ๐“ท๐“ฌ๐“ธ๐“ญ๐“ฎ ๐•—๐• ๐•Ÿ๐•ฅ๐•ค ๐—‚๐—‡ ๐—’๐—ˆ๐—Ž๐—‹ ๐’‡๐’๐’๐’•๐’”.",
map_compat = TRUE)
```

### Print emoji

On some platforms (including MacOS), the R implementation of `print()` uses an
outdated version of the Unicode standard to determine which characters are
printable. Use `utf8_print()` for an updated print function:

```{r}
print(intToUtf8(0x1F600 + 0:79)) # with default R print function

utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line

utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
```

Citation
--------

Cite *utf8* with the following BibTeX entry:

```{r echo = FALSE, comment = NA}
print(suppressWarnings(citation("utf8")), "Bibtex")
```

Contributing
------------

The project maintainer welcomes contributions in the form of feature requests,
bug reports, comments, unit tests, vignettes, or other code. If you'd like to
contribute, either

- fork the repository and submit a pull request

- [file an issue][issues];

- or contact the maintainer via e-mail.

This project is released with a [Contributor Code of Conduct][conduct],
and if you choose to contribute, you must adhere to its terms.

[apache]: https://www.apache.org/licenses/LICENSE-2.0.html "Apache License, Version 2.0"
[apache-badge]: https://img.shields.io/badge/License-Apache%202.0-blue.svg "Apache License, Version 2.0"
[building]: #development-version "Building from Source"
[codecov]: https://codecov.io/github/patperry/r-utf8?branch=master "Code Coverage"
[codecov-badge]: https://codecov.io/github/patperry/r-utf8/coverage.svg?branch=master "Code Coverage"
[conduct]: https://github.com/patperry/r-utf8/blob/master/CONDUCT.md "Contributor Code of Conduct"
[cran]: https://cran.r-project.org/package=utf8 "CRAN Page"
[cran-badge]: https://www.r-pkg.org/badges/version/utf8 "CRAN Page"
[cranlogs-badge]: https://cranlogs.r-pkg.org/badges/utf8 "CRAN Downloads"
[emoji-print]: https://twitter.com/ptrckprry/status/887732831161425920 "MacOS Emoji Printing"
[issues]: https://github.com/patperry/r-utf8/issues "Issues"
[windows-enc2utf8]: https://twitter.com/ptrckprry/status/901494853758054401 "Windows enc2utf8 Bug"