https://github.com/paithiov909/audubon
An R package for Japanese text processing
https://github.com/paithiov909/audubon
japanese javascript r r-package
Last synced: 3 months ago
JSON representation
An R package for Japanese text processing
- Host: GitHub
- URL: https://github.com/paithiov909/audubon
- Owner: paithiov909
- License: other
- Created: 2020-09-24T18:18:01.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2023-12-15T02:38:54.000Z (almost 2 years ago)
- Last Synced: 2024-03-15T05:20:27.540Z (over 1 year ago)
- Topics: japanese, javascript, r, r-package
- Language: R
- Homepage: https://paithiov909.github.io/audubon/
- Size: 2.28 MB
- Stars: 8
- Watchers: 2
- Forks: 2
- Open Issues: 3
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
pkgload::load_all()
```[](https://paithiov909.r-universe.dev)
[](https://github.com/paithiov909/audubon/actions)
[](https://app.codecov.io/gh/paithiov909/audubon)
[](https://cran.r-project.org/package=audubon)audubon is Japanese text processing tools for:
- filling Japanese iteration marks
- hiraganization, katakanization and romanization using [hakatashi/japanese.js](https://github.com/hakatashi/japanese.js)
- segmentation by phrase using [google/budoux](https://github.com/google/budoux) and 'TinySegmenter.js'
- text normalization which is based on rules for the 'Sudachi' morphological analyzer and the 'NEologd' (Neologism dictionary for 'MeCab').Some features above are not implemented in 'ICU' (i.e., the stringi package), and the goal of the audubon package is to provide these additional features.
## Installation
```r
remotes::install_github("paithiov909/audubon")
```## Usage
### Fill Japanese iteration marks (Odori-ji)
`strj_fill_iter_mark` repeats the previous character and replaces the iteration marks if the element has more than 5 characters.
You can use this feature with `strj_normalize` or `strj_rewrite_as_def`.```{r fill_iter_mark}
strj_fill_iter_mark(c(
"あいうゝ〃かき",
"金子みすゞ",
"のたり〳〵かな",
"しろ/″\とした"
))strj_fill_iter_mark("いすゞエルフトラック") |>
strj_normalize()
```### Character class conversion
Character class conversion uses [hakatashi/japanese.js](https://github.com/hakatashi/japanese.js).
```{r japanesejs}
strj_hiraganize("あのイーハトーヴォのすきとおった風")
strj_katakanize("あのイーハトーヴォのすきとおった風")
strj_romanize("あのイーハトーヴォのすきとおった風")
```### Segmentation by phrase
`strj_tokenize` splits Japanese text into some phrases using [google/budoux](https://github.com/google/budoux), TinySegmenter,
or other tokenizers.```{r tokenize}
strj_tokenize("あのイーハトーヴォのすきとおった風", engine = "budoux")
```### Japanese text normalization
`strj_normalize` normalizes text following the rule based on [NEologd](https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja) style.
```{r normalize}
strj_normalize("――南アルプスの 天然水- Sparking* Lemon+ レモン一絞り")
````strj_rewrite_as_def` is an R port of [SudachiCharNormalizer](https://gist.github.com/sorami/bde9d441a147e0fc2e6e5fdd83f4f770) that typically normalizes characters following a '*.def' file.
audubon package contains several '*.def' files, so you can use them or write a 'rewrite.def' file by yourself as follows.
```
# single characters will **never** be normalized.
…
# if two characters are separated with a tab,
# left side forms are always rewritten to right side forms
# before normalized.
斎 斉
齋 斉
齊 斉
# supports rewriting a single character to a single character,
# i.e., this cannot work.
アッ ア
```This feature is more powerful than `stringi::stri_trans_*` because it allows users to control which characters are normalized. For instance, this function can be used to convert _kyuji-tai_ characters to _shinji-tai_ characters.
```{r rewrite_as_def}
stringi::stri_trans_nfkc("Ⅹⅳ")
strj_rewrite_as_def("Ⅹⅳ")
strj_rewrite_as_def("惡と假面のルール", read_rewrite_def(system.file("def/kyuji.def", package = "audubon")))
```## License
© 2024 Akiru Kato
Licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).
Icons made by iconixar from flaticon.