An open API service indexing awesome lists of open source software.

https://github.com/scriptin/kanji-frequency

Kanji usage frequency data collected from various sources
https://github.com/scriptin/kanji-frequency

cjk cjk-characters corpus corpus-linguistics data data-visualization frequency-lists japanese japanese-language kanji kanji-frequency

Last synced: 5 days ago
JSON representation

Kanji usage frequency data collected from various sources

Awesome Lists containing this project

README

          

# Kanji usage frequency

Datasets built from various Japanese language corpora

- see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here:

## Building the datasets

You'll need Node.js 18 or later.

See `scripts` section in [package.json](./package.json).

Aozora:

- `aozora:download` - use crawler/scraper to collect the data
- `aozora:gaiji:extract` - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
- `aozora:gaiji:replacements` - build gaiji replacements file - produces only partial results, which may need to be manually completed
- `aozora:clean` - clean the scraped pages (apply gaiji replacements)
- `aozora:count` - create the dataset

Wikipedia:

- `wikipedia:fetch` - fetch random pages using MediaWiki API
- `wikipedia:count` - create the dataset

News:

- `news:wikinews:fetch` - fetch random pages from Wikinews using MediaWiki API
- `news:count` - create the dataset
- `news:dates` - create additional file with dates of articles

## Building the website

See [Astro](https://astro.build/) [docs](https://docs.astro.build/en/getting-started/) and the `scripts` section in [package.json](./package.json).