https://github.com/scriptin/kanji-frequency

Kanji usage frequency data collected from various sources
https://github.com/scriptin/kanji-frequency

cjk cjk-characters corpus corpus-linguistics data data-visualization frequency-lists japanese japanese-language kanji kanji-frequency

Last synced: 5 days ago
JSON representation

Kanji usage frequency data collected from various sources

Host: GitHub
URL: https://github.com/scriptin/kanji-frequency
Owner: scriptin
License: cc-by-4.0
Created: 2016-01-24T01:51:10.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2026-01-15T22:25:52.000Z (7 days ago)
Last Synced: 2026-01-16T01:50:02.962Z (7 days ago)
Topics: cjk, cjk-characters, corpus, corpus-linguistics, data, data-visualization, frequency-lists, japanese, japanese-language, kanji, kanji-frequency
Language: Astro
Homepage: http://scriptin.github.io/kanji-frequency/
Size: 4.26 MB
Stars: 155
Watchers: 3
Forks: 22
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# Kanji usage frequency

Datasets built from various Japanese language corpora

- see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here:

## Building the datasets

You'll need Node.js 18 or later.

See `scripts` section in [package.json](./package.json).

Aozora:

- `aozora:download` - use crawler/scraper to collect the data
- `aozora:gaiji:extract` - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
- `aozora:gaiji:replacements` - build gaiji replacements file - produces only partial results, which may need to be manually completed
- `aozora:clean` - clean the scraped pages (apply gaiji replacements)
- `aozora:count` - create the dataset

Wikipedia:

- `wikipedia:fetch` - fetch random pages using MediaWiki API
- `wikipedia:count` - create the dataset

News:

- `news:wikinews:fetch` - fetch random pages from Wikinews using MediaWiki API
- `news:count` - create the dataset
- `news:dates` - create additional file with dates of articles

## Building the website

See [Astro](https://astro.build/) [docs](https://docs.astro.build/en/getting-started/) and the `scripts` section in [package.json](./package.json).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scriptin/kanji-frequency

Awesome Lists containing this project

README