An open API service indexing awesome lists of open source software.

https://github.com/uribo/sudachir

R Interface to 'Sudachi'
https://github.com/uribo/sudachir

japanese-language nlp rpackage

Last synced: 8 months ago
JSON representation

R Interface to 'Sudachi'

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# sudachir

SudachiR is an R version of [Sudachi](https://github.com/WorksApplications/sudachi.rs), a Japanese morphological analyzer.

[![CRAN status](https://www.r-pkg.org/badges/version/sudachir)](https://CRAN.R-project.org/package=sudachir)
[![R build status](https://github.com/uribo/sudachir/workflows/R-CMD-check/badge.svg)](https://github.com/uribo/sudachir/actions)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)

## Installation

You can install the released version of `{sudachir}` from CRAN with:

``` r
install.packages("sudachir")
```

and also, the developmment version from GitHub.

``` r
if (!requireNamespace("remotes"))
install.packages("remotes")

remotes::install_github("uribo/sudachir")
```

## Usage

### Set up 'r-sudachipy' environment

`{sudachir}` works with [sudachipy](https://github.com/WorksApplications/sudachi.rs/tree/develop/python) (>= 0.6.\*) via the [reticulate](https://github.com/rstudio/reticulate/) package.

To get started, it requires a Python environment that has sudachipy and its dictionaries already installed and available.

This package provides a function `install_sudachipy` which helps users prepare a Python virtual environment. The desired modules (`sudachipy`, `sudachidict_core`, `pandas`) can be installed with this function, but can also be installed manually.

```{r}
library(reticulate)
library(sudachir)

if (!virtualenv_exists("r-sudachipy")) {
install_sudachipy()
}

use_virtualenv("r-sudachipy", required = TRUE)
```

### Tokenize sentences

Use `tokenize_to_df` for tokenization.

```{r}
txt <- c(
"国家公務員は鳴門海峡に行きたい",
"吾輩は猫である。\n名前はまだない。"
)
tokenize_to_df(data.frame(doc_id = c(1, 2), text = txt))
```

You can control which dictionary features are parsed using the `col_select` argument.

```{r}
tokenize_to_df(txt, col_select = 1:3) |>
dplyr::glimpse()

tokenize_to_df(
txt,
into = dict_features("en"),
col_select = c("pos1", "pos2")
) |>
dplyr::glimpse()
```

The `as_tokens` function can tidy up tokens and the first part-of-speech informations into a list of named tokens. Also, you can use the `form` function as a shorthand of `tokenize_to_df(txt) |> as_tokens()`.

```{r}
tokenize_to_df(txt) |> as_tokens(type = "surface")

form(txt, type = "surface")
form(txt, type = "normalized")
form(txt, type = "dictionary")
form(txt, type = "reading")
```

### Change split mode

```{r}
tokenize_to_df(txt, instance = rebuild_tokenizer("B")) |>
as_tokens("surface", pos = FALSE)

tokenize_to_df(txt, instance = rebuild_tokenizer("A")) |>
as_tokens("surface", pos = FALSE)
```

### Change dictionary edition

You can touch dictionary options using the `rebuild_tokenizer` function.

```{r}
if (py_module_available("sudachidict_full")) {
tokenizer_full <- rebuild_tokenizer(mode = "C", dict_type = "full")
tokenize_to_df(txt, instance = tokenizer_full) |>
as_tokens("surface", pos = FALSE)
}
```