https://github.com/paithiov909/ldccr
Utilities for various Japanese corpora
https://github.com/paithiov909/ldccr
r r-package
Last synced: 6 months ago
JSON representation
Utilities for various Japanese corpora
- Host: GitHub
- URL: https://github.com/paithiov909/ldccr
- Owner: paithiov909
- License: other
- Created: 2021-01-02T04:09:03.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2025-03-01T04:55:31.000Z (7 months ago)
- Last Synced: 2025-03-01T05:25:49.244Z (7 months ago)
- Topics: r, r-package
- Language: C++
- Homepage: https://paithiov909.github.io/ldccr/
- Size: 24.4 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
pkgload::load_all(export_all = FALSE)
```# ldccr
[](https://paithiov909.r-universe.dev)
## Overview
ldccr is utilities for various Japanese corpora.
The goal of ldccr package is to make easy to use Japanese language resources.
This package provides:
1. parsers for several Japanese corpora that are free or open licensed (non proprietary).
2. a downloader of zipped text files published on [Aozora Bunko](https://www.aozora.gr.jp/).## Installation
```r
# install.packages("pak")
pak::pak("paithiov909/ldccr")
```## Supported Corpora
### Monolingual
| ... | Name | License | Link |
| --- | ---- | ------- | ---- |
| :heavy_check_mark: | Live Door News Corpus | [CC BY-ND 2.1 JP](http://creativecommons.org/licenses/by-nd/2.1/jp/) | [#](http://www.rondhuit.com/download.html#ldcc) |
| :heavy_check_mark: | Japanese Realistic Textual Entailment Corpus | [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) | [#](https://github.com/megagonlabs/jrte-corpus) |
| :heavy_check_mark: | ja.text8 corpus | [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) | [#](https://github.com/Hironsan/ja.text8) |### Multilingual
> Currently not supported.
## Download text file from Aozora Bunko
You can download a text file by specifying `テキストファイルURL` with `read_aozora()`:
```{r}
if (!dir.exists("cache")) dir.create("cache")text <- ldccr::AozoraBunkoSnapshot |>
dplyr::slice_sample(n = 1L) |>
dplyr::pull("テキストファイルURL") |>
ldccr::read_aozora(directory = "cache") |>
readr::read_lines()dplyr::glimpse(text)
```If you want to read a large part of texts published at Aozora Bunko, alternatively,
you can download them at once via [globis-university/aozorabunko-clean](https://huggingface.co/datasets/globis-university/aozorabunko-clean).For example, you can read those texts as follows:
```{r}
if (require("polars", quietly = TRUE)) {
# We are setting `HUGGINGFACE_HUB_CACHE` to a temporary directory.
# If you don't mind where the cache goes, you don't need to set this.
withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), {
path <- hfhub::hub_download(
"datasets/globis-university/aozorabunko-clean",
"aozorabunko-dedupe-clean.jsonl.gz"
)
})df <- pl$read_ndjson(path)
df$unnest()$
select(
pl$col("作品ID", "人物ID")$str$to_integer(),
pl$col("作品名", "text")
)
# To convert this into a tibble, follow with `$to_dataframe() |> dplyr::as_tibble()`.
}
```> NOTE: This example requires [polars](https://pola-rs.github.io/r-polars/) to read a gzipped NDJSON file.
> For installation of polars, please see [Installation details](https://pola-rs.github.io/r-polars/vignettes/install.html).## License
MIT license.