https://github.com/paithiov909/ldccr

Utilities for various Japanese corpora
https://github.com/paithiov909/ldccr

r r-package

Last synced: 6 months ago
JSON representation

Utilities for various Japanese corpora

Host: GitHub
URL: https://github.com/paithiov909/ldccr
Owner: paithiov909
License: other
Created: 2021-01-02T04:09:03.000Z (almost 5 years ago)
Default Branch: main
Last Pushed: 2025-03-01T04:55:31.000Z (7 months ago)
Last Synced: 2025-03-01T05:25:49.244Z (7 months ago)
Topics: r, r-package
Language: C++
Homepage: https://paithiov909.github.io/ldccr/
Size: 24.4 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>"

)

pkgload::load_all(export_all = FALSE)

```

# ldccr

[![ldccr status badge](https://paithiov909.r-universe.dev/badges/ldccr)](https://paithiov909.r-universe.dev)

## Overview

ldccr is utilities for various Japanese corpora.

The goal of ldccr package is to make easy to use Japanese language resources.

This package provides:

1. parsers for several Japanese corpora that are free or open licensed (non proprietary).

2. a downloader of zipped text files published on [Aozora Bunko](https://www.aozora.gr.jp/).

## Installation

```r

# install.packages("pak")

pak::pak("paithiov909/ldccr")

```

## Supported Corpora

### Monolingual

| ... | Name | License | Link |

| --- | ---- | ------- | ---- |

| :heavy_check_mark: | Live Door News Corpus | [CC BY-ND 2.1 JP](http://creativecommons.org/licenses/by-nd/2.1/jp/) | [#](http://www.rondhuit.com/download.html#ldcc) |

| :heavy_check_mark: | Japanese Realistic Textual Entailment Corpus | [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) | [#](https://github.com/megagonlabs/jrte-corpus) |

| :heavy_check_mark: | ja.text8 corpus | [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) | [#](https://github.com/Hironsan/ja.text8) |

### Multilingual

> Currently not supported.

## Download text file from Aozora Bunko

You can download a text file by specifying `テキストファイルURL` with `read_aozora()`:

```{r}

if (!dir.exists("cache")) dir.create("cache")

text <- ldccr::AozoraBunkoSnapshot |>

  dplyr::slice_sample(n = 1L) |>

  dplyr::pull("テキストファイルURL") |>

  ldccr::read_aozora(directory = "cache") |>

  readr::read_lines()

dplyr::glimpse(text)

```

If you want to read a large part of texts published at Aozora Bunko, alternatively,

you can download them at once via [globis-university/aozorabunko-clean](https://huggingface.co/datasets/globis-university/aozorabunko-clean).

For example, you can read those texts as follows:

```{r}

if (require("polars", quietly = TRUE)) {

  # We are setting `HUGGINGFACE_HUB_CACHE` to a temporary directory.

  # If you don't mind where the cache goes, you don't need to set this.

  withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), {

    path <- hfhub::hub_download(

      "datasets/globis-university/aozorabunko-clean",

      "aozorabunko-dedupe-clean.jsonl.gz"

    )

  })

  df <- pl$read_ndjson(path)

  df$unnest()$

    select(

      pl$col("作品ID", "人物ID")$str$to_integer(),

      pl$col("作品名", "text")

    )

  # To convert this into a tibble, follow with `$to_dataframe() |> dplyr::as_tibble()`.

}

```

> NOTE: This example requires [polars](https://pola-rs.github.io/r-polars/) to read a gzipped NDJSON file.

> For installation of polars, please see [Installation details](https://pola-rs.github.io/r-polars/vignettes/install.html).

## License

MIT license.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/paithiov909/ldccr

Awesome Lists containing this project

README