https://github.com/krlmlr/r-utf8
UTF-8 Text Processing (R Package)
https://github.com/krlmlr/r-utf8
Last synced: 19 days ago
JSON representation
UTF-8 Text Processing (R Package)
- Host: GitHub
- URL: https://github.com/krlmlr/r-utf8
- Owner: krlmlr
- License: apache-2.0
- Created: 2017-10-19T13:40:58.000Z (over 7 years ago)
- Default Branch: main
- Last Pushed: 2025-05-05T20:54:51.000Z (20 days ago)
- Last Synced: 2025-05-05T21:49:48.658Z (19 days ago)
- Language: C
- Homepage: https://ptrckprry.com/r-utf8/
- Size: 2.85 MB
- Stars: 112
- Watchers: 4
- Forks: 4
- Open Issues: 5
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
---
output:
github_document:
html_preview: false
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)pkgload::load_all()
set.seed(20230702)
clean_output <- function(x, options) {
x <- gsub("0x[0-9a-f]+", "0xdeadbeef", x)
x <- gsub("dataframe_[0-9]*_[0-9]*", " dataframe_42_42 ", x)
x <- gsub("[0-9]*\\.___row_number ASC", "42.___row_number ASC", x)index <- x
index <- gsub("โ", "-", index)
index <- strsplit(paste(index, collapse = "\n"), "\n---\n")[[1]][[2]]
writeLines(index, "index.md")x <- fansi::strip_sgr(x)
x
}options(
cli.num_colors = 256,
cli.width = 80,
width = 80,
pillar.bold = TRUE
)local({
hook_source <- knitr::knit_hooks$get("document")
knitr::knit_hooks$set(document = clean_output)
})
```# utf8
[](https://github.com/patperry/r-utf8/actions)
[![Coverage Status][codecov-badge]][codecov]
[![CRAN Status][cran-badge]][cran]
[![License][apache-badge]][apache]
[![CRAN RStudio Mirror Downloads][cranlogs-badge]][cran]*utf8* is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R's UTF-8 handling.
## Installation
### Stable version
*utf8* is [available on CRAN][cran]. To install the latest released version,
run the following command in R:```r
install.packages("utf8")
```### Development version
To install the latest development version, run the following:
```r
devtools::install_github("patperry/r-utf8")
```## Usage
```{r}
library(utf8)
```### Validate character data and convert to UTF-8
Use `as_utf8()` to validate input text and convert to UTF-8 encoding. The
function alerts you if the input text has the wrong declared encoding:```{r, error = TRUE}
# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
```### Normalize data
Use `utf8_normalize()` to convert to Unicode composed normal form (NFC).
Optionally apply compatibility maps for NFKC normal form or case-fold.```{r}
# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
utf8_normalize(angstrom) == "\u00c5"# perform full Unicode case-folding
utf8_normalize("Grรถรe", map_case = TRUE)# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("๐ธ๐ ๐๐ง๐ข๐๐จ๐๐ ๐ ๐๐พ๐๐ฝ ๐ ๐ ๐๐๐พ ๐ก๐ฆ๐๐๐๐๐๐๐ ๐๐ ๐๐พ ๐๐๐ ๐๐๐๐พ ๐๐๐๐๐๐๐๐๐๐ ๐๐ ๐๐๐๐ ๐๐ฒ๐ญ๐ญ๐ฉ๐ข๐ช๐ข๐ซ๐ฑ๐๐ฏ๐ถ ๐๐ฒ๐ฉ๐ฑ๐ฆ๐ฉ๐ฆ๐ซ๐ค๐ณ๐๐ฉ ๐๐ฉ๐๐ซ๐ข ๐๐ ๐๐๐ ๐ผ๐บ๐ ๐ฎ๐ท๐ฌ๐ธ๐ญ๐ฎ ๐๐ ๐๐ฅ๐ค ๐๐ ๐๐๐๐ ๐๐๐๐๐.",
map_compat = TRUE)
```### Print emoji
On some platforms (including MacOS), the R implementation of `print()` uses an
outdated version of the Unicode standard to determine which characters are
printable. Use `utf8_print()` for an updated print function:```{r}
print(intToUtf8(0x1F600 + 0:79)) # with default R print functionutf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line
utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
```## Citation
Cite *utf8* with the following BibTeX entry:
```{r echo = FALSE, comment = NA}
print(suppressWarnings(citation("utf8")), "Bibtex")
```## Contributing
The project maintainer welcomes contributions in the form of feature requests,
bug reports, comments, unit tests, vignettes, or other code. If you'd like to
contribute, either- fork the repository and submit a pull request
- [file an issue][issues];
- or contact the maintainer via e-mail.
This project is released with a [Contributor Code of Conduct][conduct],
and if you choose to contribute, you must adhere to its terms.[apache]: https://www.apache.org/licenses/LICENSE-2.0.html "Apache License, Version 2.0"
[apache-badge]: https://img.shields.io/badge/License-Apache%202.0-blue.svg "Apache License, Version 2.0"
[building]: #development-version "Building from Source"
[codecov]: https://app.codecov.io/github/patperry/r-utf8?branch=main "Code Coverage"
[codecov-badge]: https://codecov.io/github/patperry/r-utf8/coverage.svg?branch=main "Code Coverage"
[conduct]: https://github.com/patperry/r-utf8/blob/main/CONDUCT.md "Contributor Code of Conduct"
[cran]: https://cran.r-project.org/package=utf8 "CRAN Page"
[cran-badge]: https://www.r-pkg.org/badges/version/utf8 "CRAN Page"
[cranlogs-badge]: https://cranlogs.r-pkg.org/badges/utf8 "CRAN Downloads"
[issues]: https://github.com/patperry/r-utf8/issues "Issues"