https://github.com/uribo/sudachir

R Interface to 'Sudachi'
https://github.com/uribo/sudachir

japanese-language nlp rpackage

Last synced: 8 months ago
JSON representation

R Interface to 'Sudachi'

Host: GitHub
URL: https://github.com/uribo/sudachir
Owner: uribo
License: other
Created: 2020-10-23T05:03:30.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2023-02-01T00:10:04.000Z (over 3 years ago)
Last Synced: 2025-10-18T21:55:25.080Z (8 months ago)
Topics: japanese-language, nlp, rpackage
Language: R
Homepage: https://uribo.github.io/sudachir/
Size: 268 KB
Stars: 6
Watchers: 1
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.Rmd
- Funding: .github/FUNDING.yml
- License: LICENSE.md

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# sudachir 

SudachiR is an R version of [Sudachi](https://github.com/WorksApplications/sudachi.rs), a Japanese morphological analyzer.

[![CRAN status](https://www.r-pkg.org/badges/version/sudachir)](https://CRAN.R-project.org/package=sudachir)

[![R build status](https://github.com/uribo/sudachir/workflows/R-CMD-check/badge.svg)](https://github.com/uribo/sudachir/actions)

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)

## Installation

You can install the released version of `{sudachir}` from CRAN with:

``` r

install.packages("sudachir")

```

and also, the developmment version from GitHub.

``` r

if (!requireNamespace("remotes"))

  install.packages("remotes")

remotes::install_github("uribo/sudachir")

```

## Usage

### Set up 'r-sudachipy' environment

`{sudachir}` works with [sudachipy](https://github.com/WorksApplications/sudachi.rs/tree/develop/python)  (>= 0.6.\*) via the [reticulate](https://github.com/rstudio/reticulate/) package.

To get started, it requires a Python environment that has sudachipy and its dictionaries already installed and available.

This package provides a function `install_sudachipy` which helps users prepare a Python virtual environment. The desired modules (`sudachipy`, `sudachidict_core`, `pandas`) can be installed with this function, but can also be installed manually.

```{r}

library(reticulate)

library(sudachir)

if (!virtualenv_exists("r-sudachipy")) {

  install_sudachipy()

}

use_virtualenv("r-sudachipy", required = TRUE)

```

### Tokenize sentences

Use `tokenize_to_df` for tokenization.

```{r}

txt <- c(

  "国家公務員は鳴門海峡に行きたい",

  "吾輩は猫である。\n名前はまだない。"

)

tokenize_to_df(data.frame(doc_id = c(1, 2), text = txt))

```

You can control which dictionary features are parsed using the `col_select` argument.

```{r}

tokenize_to_df(txt, col_select = 1:3) |>

  dplyr::glimpse()

tokenize_to_df(

  txt, 

  into = dict_features("en"),

  col_select = c("pos1", "pos2")

) |>

  dplyr::glimpse()

```

The `as_tokens` function can tidy up tokens and the first part-of-speech informations into a list of named tokens. Also, you can use the `form` function as a shorthand of `tokenize_to_df(txt) |> as_tokens()`.

```{r}

tokenize_to_df(txt) |> as_tokens(type = "surface")

form(txt, type = "surface")

form(txt, type = "normalized")

form(txt, type = "dictionary")

form(txt, type = "reading")

```

### Change split mode

```{r}

tokenize_to_df(txt, instance = rebuild_tokenizer("B")) |>

  as_tokens("surface", pos = FALSE)

tokenize_to_df(txt, instance = rebuild_tokenizer("A")) |>

  as_tokens("surface", pos = FALSE)

```

### Change dictionary edition

You can touch dictionary options using the `rebuild_tokenizer` function.

```{r}

if (py_module_available("sudachidict_full")) {

  tokenizer_full <- rebuild_tokenizer(mode = "C", dict_type = "full")

  tokenize_to_df(txt, instance = tokenizer_full) |>

    as_tokens("surface", pos = FALSE)

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/uribo/sudachir

Awesome Lists containing this project

README