Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/patperry/r-utf8
UTF-8 Text Processing (R Package)
https://github.com/patperry/r-utf8
Last synced: 3 months ago
JSON representation
UTF-8 Text Processing (R Package)
- Host: GitHub
- URL: https://github.com/patperry/r-utf8
- Owner: patperry
- License: apache-2.0
- Created: 2017-10-19T13:40:58.000Z (over 7 years ago)
- Default Branch: main
- Last Pushed: 2024-01-24T00:49:21.000Z (12 months ago)
- Last Synced: 2024-05-08T16:02:46.299Z (9 months ago)
- Language: C
- Size: 3.55 MB
- Stars: 113
- Watchers: 4
- Forks: 4
- Open Issues: 5
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- awesome-R - utf8 - Manipulating and printing UTF-8 text that fixes multiple bugs in R's UTF-8 handling. ![utf8](https://cranlogs.r-pkg.org/badges/utf8) (2017)
README
---
output: downlit::readme_document
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
options(width = 95)
```utf8
====[![rcc](https://github.com/patperry/r-utf8/workflows/rcc/badge.svg)](https://github.com/patperry/r-utf8/actions)
[![Coverage Status][codecov-badge]][codecov]
[![CRAN Status][cran-badge]][cran]
[![License][apache-badge]][apache]
[![CRAN RStudio Mirror Downloads][cranlogs-badge]][cran]*utf8* is an R package for manipulating and printing UTF-8 text that fixes
[multiple][windows-enc2utf8] [bugs][emoji-print] in R's UTF-8 handling.Installation
------------### Stable version
*utf8* is [available on CRAN][cran]. To install the latest released version,
run the following command in R:```r
install.packages("utf8")
```### Development version
To install the latest development version, run the following:
```r
devtools::install_github("patperry/r-utf8")
```Usage
-----```{r}
library(utf8)
```### Validate character data and convert to UTF-8
Use `as_utf8()` to validate input text and convert to UTF-8 encoding. The
function alerts you if the input text has the wrong declared encoding:```{r, error = TRUE}
# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
```### Normalize data
Use `utf8_normalize()` to convert to Unicode composed normal form (NFC).
Optionally apply compatibility maps for NFKC normal form or case-fold.```{r}
# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
utf8_normalize(angstrom) == "\u00c5"# perform full Unicode case-folding
utf8_normalize("Grรถรe", map_case = TRUE)# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("๐ธ๐ ๐๐ง๐ข๐๐จ๐๐ ๐ ๐๐พ๐๐ฝ ๐ ๐ ๐๐๐พ ๐ก๐ฆ๐๐๐๐๐๐๐ ๐๐ ๐๐พ ๐๐๐ ๐๐๐๐พ ๐๐๐๐๐๐๐๐๐๐ ๐๐ ๐๐๐๐ ๐๐ฒ๐ญ๐ญ๐ฉ๐ข๐ช๐ข๐ซ๐ฑ๐๐ฏ๐ถ ๐๐ฒ๐ฉ๐ฑ๐ฆ๐ฉ๐ฆ๐ซ๐ค๐ณ๐๐ฉ ๐๐ฉ๐๐ซ๐ข ๐๐ ๐๐๐ ๐ผ๐บ๐ ๐ฎ๐ท๐ฌ๐ธ๐ญ๐ฎ ๐๐ ๐๐ฅ๐ค ๐๐ ๐๐๐๐ ๐๐๐๐๐.",
map_compat = TRUE)
```### Print emoji
On some platforms (including MacOS), the R implementation of `print()` uses an
outdated version of the Unicode standard to determine which characters are
printable. Use `utf8_print()` for an updated print function:```{r}
print(intToUtf8(0x1F600 + 0:79)) # with default R print functionutf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line
utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
```Citation
--------Cite *utf8* with the following BibTeX entry:
```{r echo = FALSE, comment = NA}
print(suppressWarnings(citation("utf8")), "Bibtex")
```Contributing
------------The project maintainer welcomes contributions in the form of feature requests,
bug reports, comments, unit tests, vignettes, or other code. If you'd like to
contribute, either- fork the repository and submit a pull request
- [file an issue][issues];
- or contact the maintainer via e-mail.
This project is released with a [Contributor Code of Conduct][conduct],
and if you choose to contribute, you must adhere to its terms.[apache]: https://www.apache.org/licenses/LICENSE-2.0.html "Apache License, Version 2.0"
[apache-badge]: https://img.shields.io/badge/License-Apache%202.0-blue.svg "Apache License, Version 2.0"
[building]: #development-version "Building from Source"
[codecov]: https://codecov.io/github/patperry/r-utf8?branch=master "Code Coverage"
[codecov-badge]: https://codecov.io/github/patperry/r-utf8/coverage.svg?branch=master "Code Coverage"
[conduct]: https://github.com/patperry/r-utf8/blob/master/CONDUCT.md "Contributor Code of Conduct"
[cran]: https://cran.r-project.org/package=utf8 "CRAN Page"
[cran-badge]: https://www.r-pkg.org/badges/version/utf8 "CRAN Page"
[cranlogs-badge]: https://cranlogs.r-pkg.org/badges/utf8 "CRAN Downloads"
[emoji-print]: https://twitter.com/ptrckprry/status/887732831161425920 "MacOS Emoji Printing"
[issues]: https://github.com/patperry/r-utf8/issues "Issues"
[windows-enc2utf8]: https://twitter.com/ptrckprry/status/901494853758054401 "Windows enc2utf8 Bug"