An open API service indexing awesome lists of open source software.

https://github.com/paithiov909/sudachir2

(Unofficial) R wrapper for 'sudachi.rs'๐Ÿฆ€
https://github.com/paithiov909/sudachir2

pos-tagging r r-package rust

Last synced: 4 months ago
JSON representation

(Unofficial) R wrapper for 'sudachi.rs'๐Ÿฆ€

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
pkgload::load_all(export_all = FALSE)
```

# sudachir2 sudachir2 logo

[![sudachir2 status badge](https://paithiov909.r-universe.dev/sudachir2/badges/version)](https://paithiov909.r-universe.dev/sudachir2)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/paithiov909/sudachir2/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/paithiov909/sudachir2/actions/workflows/R-CMD-check.yaml)

An R wrapper for 'Sudachi'; a modern reimagining of
[uribo/sudachir](https://github.com/uribo/sudachir) and [yutannihilation/fledgingr](https://github.com/yutannihilation/fledgingr)
that directly wraps [sudachi.rs](https://github.com/WorksApplications/sudachi.rs) with [savvy](https://github.com/yutannihilation/savvy).

## Installation

To install from source package, the Rust toolchain is required.

```r
install.packages("sudachir2", repos = c("https://paithiov909.r-universe.dev", "https://cloud.r-project.org"))
```

## Usage

To use the package, you need to download a dictionary first.
You can use `sudachir2::fetch_dict()` to download the [SudachiDict](https://github.com/WorksApplications/SudachiDict).

```{r}
library(sudachir2)

small_dict <-
file.path(tempdir(),
"sudachi-dictionary-20250129",
"system_small.dic"
)

if (!file.exists(small_dict)) {
fetch_dict(tempdir(), dict_version = "20250129", dict_type = "small")
}
```

After downloading the dictionary, you can create a tagger function.
`sudachir2::create_tagger()` returns a function that can be used to tokenize texts.

```{r}
my_tagger <- create_tagger(small_dict, mode = "C")
my_tagger("ๆ–ฐใ—ใ„ๆœใŒๆฅใŸ")
```

For convenience, you can use `sudachir2::tokenize()` to tokenize a data.frame as well as a character vector
with a tagger function, and `sudachir2::prettify()` to parse comma-delimited features.

```{r}
tokenize("ๆ–ฐใ—ใ„ๆœใŒๆฅใŸ", tagger = my_tagger)

dat <-
dplyr::tibble(
text = c("ๆ–ฐใ—ใ„ๆœใŒๆฅใŸ", "ๅธŒๆœ›ใฎๆœใ ", "ๅ–œใณใซ่ƒธใ‚’้–‹ใ‘", "ๅคง็ฉบใ‚ใŠใ’"),
doc_id = seq_along(text)
)

toks <-
tokenize(
dat, text, doc_id,
tagger = my_tagger
)

str(toks)

prettify(toks) |> str()
```

## Versioning

This package is versioned by copying the version number of [sudachi.rs](https://github.com/WorksApplications/sudachi.rs),
where the first three digits represent that version number
and the fourth digit (if any) represents the patch release for this package.