An open API service indexing awesome lists of open source software.

https://github.com/paithiov909/ldccr

Utilities for various Japanese corpora
https://github.com/paithiov909/ldccr

r r-package

Last synced: 6 months ago
JSON representation

Utilities for various Japanese corpora

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
pkgload::load_all(export_all = FALSE)
```

# ldccr

[![ldccr status badge](https://paithiov909.r-universe.dev/badges/ldccr)](https://paithiov909.r-universe.dev)

## Overview

ldccr is utilities for various Japanese corpora.

The goal of ldccr package is to make easy to use Japanese language resources.

This package provides:

1. parsers for several Japanese corpora that are free or open licensed (non proprietary).
2. a downloader of zipped text files published on [Aozora Bunko](https://www.aozora.gr.jp/).

## Installation

```r
# install.packages("pak")
pak::pak("paithiov909/ldccr")
```

## Supported Corpora

### Monolingual

| ... | Name | License | Link |
| --- | ---- | ------- | ---- |
| :heavy_check_mark: | Live Door News Corpus | [CC BY-ND 2.1 JP](http://creativecommons.org/licenses/by-nd/2.1/jp/) | [#](http://www.rondhuit.com/download.html#ldcc) |
| :heavy_check_mark: | Japanese Realistic Textual Entailment Corpus | [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) | [#](https://github.com/megagonlabs/jrte-corpus) |
| :heavy_check_mark: | ja.text8 corpus | [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) | [#](https://github.com/Hironsan/ja.text8) |

### Multilingual

> Currently not supported.

## Download text file from Aozora Bunko

You can download a text file by specifying `テキストファイルURL` with `read_aozora()`:

```{r}
if (!dir.exists("cache")) dir.create("cache")

text <- ldccr::AozoraBunkoSnapshot |>
dplyr::slice_sample(n = 1L) |>
dplyr::pull("テキストファイルURL") |>
ldccr::read_aozora(directory = "cache") |>
readr::read_lines()

dplyr::glimpse(text)
```

If you want to read a large part of texts published at Aozora Bunko, alternatively,
you can download them at once via [globis-university/aozorabunko-clean](https://huggingface.co/datasets/globis-university/aozorabunko-clean).

For example, you can read those texts as follows:

```{r}
if (require("polars", quietly = TRUE)) {
# We are setting `HUGGINGFACE_HUB_CACHE` to a temporary directory.
# If you don't mind where the cache goes, you don't need to set this.
withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), {
path <- hfhub::hub_download(
"datasets/globis-university/aozorabunko-clean",
"aozorabunko-dedupe-clean.jsonl.gz"
)
})

df <- pl$read_ndjson(path)

df$unnest()$
select(
pl$col("作品ID", "人物ID")$str$to_integer(),
pl$col("作品名", "text")
)
# To convert this into a tibble, follow with `$to_dataframe() |> dplyr::as_tibble()`.
}
```

> NOTE: This example requires [polars](https://pola-rs.github.io/r-polars/) to read a gzipped NDJSON file.
> For installation of polars, please see [Installation details](https://pola-rs.github.io/r-polars/vignettes/install.html).

## License

MIT license.