https://github.com/uncomputable/frequency-data
Raw Japanese frequency data.
https://github.com/uncomputable/frequency-data
dictionary japanese japanese-study language raw-data
Last synced: 15 days ago
JSON representation
Raw Japanese frequency data.
- Host: GitHub
- URL: https://github.com/uncomputable/frequency-data
- Owner: uncomputable
- License: other
- Created: 2023-07-23T12:51:31.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-11-09T16:55:09.000Z (over 1 year ago)
- Last Synced: 2025-03-29T19:11:32.846Z (3 months ago)
- Topics: dictionary, japanese, japanese-study, language, raw-data
- Homepage: https://www.ninjal.ac.jp/
- Size: 96.4 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Raw Japanese frequency data
This repository hosts data from [NINJAL](https://www.ninjal.ac.jp/).
The data is already public. I mirror it here to prevent link rot.
## Balanced Corpus of Contemporary Written Japanese (BCCWJ)
One of the largest and most popular corpora out there. It focuses on written language.
[See the university website](https://clrd.ninjal.ac.jp/bccwj/bcc-chu.html).
## Corpus of Spontaneous Japanese (CSJ)
Another popular corpus with a focus on spoken language.
[See the university website](https://clrd.ninjal.ac.jp/csj/chunagon.html).
## NINJAL Web Japanese Corpus (NWJC)
A corpus which was created by crawling the web.
[The official website](https://masayu-a.github.io/NWJC/) doesn't seem to host any data.
[See NINJAL's repository](https://repository.ninjal.ac.jp/) and navigate like so:
1. 言語資源
2. 国語研日本語ウェブコーパス
3. 『国語研日本語ウェブコーパス』中納言搭載データ語彙表## Corpus of Historical Japanese (CHJ)
A corpus that covers different eras of Japanese history.
[See the university website](https://clrd.ninjal.ac.jp/chj/chj-wc.html).
## Showa-Heisei corpus of written Japanese (SHC)
A corpus that covers the Showa and Heisei era of Japanese history.
[See the university website](https://clrd.ninjal.ac.jp/shc/stats.html).