https://github.com/uribo/sudachir
R Interface to 'Sudachi'
https://github.com/uribo/sudachir
japanese-language nlp rpackage
Last synced: 8 months ago
JSON representation
R Interface to 'Sudachi'
- Host: GitHub
- URL: https://github.com/uribo/sudachir
- Owner: uribo
- License: other
- Created: 2020-10-23T05:03:30.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2023-02-01T00:10:04.000Z (over 3 years ago)
- Last Synced: 2025-10-18T21:55:25.080Z (8 months ago)
- Topics: japanese-language, nlp, rpackage
- Language: R
- Homepage: https://uribo.github.io/sudachir/
- Size: 268 KB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.Rmd
- Funding: .github/FUNDING.yml
- License: LICENSE.md
Awesome Lists containing this project
README
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
SudachiR is an R version of [Sudachi](https://github.com/WorksApplications/sudachi.rs), a Japanese morphological analyzer.
[](https://CRAN.R-project.org/package=sudachir)
[](https://github.com/uribo/sudachir/actions)
[](https://www.tidyverse.org/lifecycle/#experimental)
## Installation
You can install the released version of `{sudachir}` from CRAN with:
``` r
install.packages("sudachir")
```
and also, the developmment version from GitHub.
``` r
if (!requireNamespace("remotes"))
install.packages("remotes")
remotes::install_github("uribo/sudachir")
```
## Usage
### Set up 'r-sudachipy' environment
`{sudachir}` works with [sudachipy](https://github.com/WorksApplications/sudachi.rs/tree/develop/python) (>= 0.6.\*) via the [reticulate](https://github.com/rstudio/reticulate/) package.
To get started, it requires a Python environment that has sudachipy and its dictionaries already installed and available.
This package provides a function `install_sudachipy` which helps users prepare a Python virtual environment. The desired modules (`sudachipy`, `sudachidict_core`, `pandas`) can be installed with this function, but can also be installed manually.
```{r}
library(reticulate)
library(sudachir)
if (!virtualenv_exists("r-sudachipy")) {
install_sudachipy()
}
use_virtualenv("r-sudachipy", required = TRUE)
```
### Tokenize sentences
Use `tokenize_to_df` for tokenization.
```{r}
txt <- c(
"国家公務員は鳴門海峡に行きたい",
"吾輩は猫である。\n名前はまだない。"
)
tokenize_to_df(data.frame(doc_id = c(1, 2), text = txt))
```
You can control which dictionary features are parsed using the `col_select` argument.
```{r}
tokenize_to_df(txt, col_select = 1:3) |>
dplyr::glimpse()
tokenize_to_df(
txt,
into = dict_features("en"),
col_select = c("pos1", "pos2")
) |>
dplyr::glimpse()
```
The `as_tokens` function can tidy up tokens and the first part-of-speech informations into a list of named tokens. Also, you can use the `form` function as a shorthand of `tokenize_to_df(txt) |> as_tokens()`.
```{r}
tokenize_to_df(txt) |> as_tokens(type = "surface")
form(txt, type = "surface")
form(txt, type = "normalized")
form(txt, type = "dictionary")
form(txt, type = "reading")
```
### Change split mode
```{r}
tokenize_to_df(txt, instance = rebuild_tokenizer("B")) |>
as_tokens("surface", pos = FALSE)
tokenize_to_df(txt, instance = rebuild_tokenizer("A")) |>
as_tokens("surface", pos = FALSE)
```
### Change dictionary edition
You can touch dictionary options using the `rebuild_tokenizer` function.
```{r}
if (py_module_available("sudachidict_full")) {
tokenizer_full <- rebuild_tokenizer(mode = "C", dict_type = "full")
tokenize_to_df(txt, instance = tokenizer_full) |>
as_tokens("surface", pos = FALSE)
}
```
