https://github.com/scriptin/kanji-frequency
Kanji usage frequency data collected from various sources
https://github.com/scriptin/kanji-frequency
cjk cjk-characters corpus corpus-linguistics data data-visualization frequency-lists japanese japanese-language kanji kanji-frequency
Last synced: 5 days ago
JSON representation
Kanji usage frequency data collected from various sources
- Host: GitHub
- URL: https://github.com/scriptin/kanji-frequency
- Owner: scriptin
- License: cc-by-4.0
- Created: 2016-01-24T01:51:10.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2026-01-15T22:25:52.000Z (7 days ago)
- Last Synced: 2026-01-16T01:50:02.962Z (7 days ago)
- Topics: cjk, cjk-characters, corpus, corpus-linguistics, data, data-visualization, frequency-lists, japanese, japanese-language, kanji, kanji-frequency
- Language: Astro
- Homepage: http://scriptin.github.io/kanji-frequency/
- Size: 4.26 MB
- Stars: 155
- Watchers: 3
- Forks: 22
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Kanji usage frequency
Datasets built from various Japanese language corpora
- see this website for the dataset description. This readme describes only technical aspects.
You can download the datasets here:
## Building the datasets
You'll need Node.js 18 or later.
See `scripts` section in [package.json](./package.json).
Aozora:
- `aozora:download` - use crawler/scraper to collect the data
- `aozora:gaiji:extract` - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
- `aozora:gaiji:replacements` - build gaiji replacements file - produces only partial results, which may need to be manually completed
- `aozora:clean` - clean the scraped pages (apply gaiji replacements)
- `aozora:count` - create the dataset
Wikipedia:
- `wikipedia:fetch` - fetch random pages using MediaWiki API
- `wikipedia:count` - create the dataset
News:
- `news:wikinews:fetch` - fetch random pages from Wikinews using MediaWiki API
- `news:count` - create the dataset
- `news:dates` - create additional file with dates of articles
## Building the website
See [Astro](https://astro.build/) [docs](https://docs.astro.build/en/getting-started/) and the `scripts` section in [package.json](./package.json).