{"id":22700437,"url":"https://github.com/uncomputable/frequency-dict","last_synced_at":"2025-10-27T00:33:17.009Z","repository":{"id":183198646,"uuid":"663128755","full_name":"uncomputable/frequency-dict","owner":"uncomputable","description":"Frequency dictionaries for CHJ (Corpus of Historical Japanese), SHC (Showa-Heisei Corpus of written Japanese) and NWJC (NINJAL Web Japanese Corpus).","archived":false,"fork":false,"pushed_at":"2023-12-11T14:44:02.000Z","size":47,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-04T19:46:42.134Z","etag":null,"topics":["dictionary","japanese","japanese-learning","japanese-study","language","yomichan"],"latest_commit_sha":null,"homepage":"https://www.ninjal.ac.jp/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/uncomputable.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-06T16:01:25.000Z","updated_at":"2023-11-08T14:14:40.000Z","dependencies_parsed_at":null,"dependency_job_id":"f8b7bd61-6e07-48b0-b45d-7dc300938977","html_url":"https://github.com/uncomputable/frequency-dict","commit_stats":null,"previous_names":["uncomputable/frequency-dict"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uncomputable%2Ffrequency-dict","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uncomputable%2Ffrequency-dict/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uncomputable%2Ffrequency-dict/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uncomputable%2Ffrequency-dict/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/uncomputable","download_url":"https://codeload.github.com/uncomputable/frequency-dict/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246230541,"owners_count":20744349,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dictionary","japanese","japanese-learning","japanese-study","language","yomichan"],"created_at":"2024-12-10T06:12:11.577Z","updated_at":"2025-10-27T00:33:16.953Z","avatar_url":"https://github.com/uncomputable.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Frequency dictionaries for Yomichan\n\nHigh-quality frequency dictionaries ready to be imported into [Yomichan](https://foosoft.net/projects/yomichan/).\n\nGenerate frequency dictionaries from source for customization.\n\nA frequency dictionary displays the ranked frequency (1st most frequent, 2nd most frequent, ...) of a word inside a context (written language, spoken language, web, Showa era, Heisei era, ...).\n\nFrequency dictionaries can help language learners distinguish common words from uncommon ones.\n\n## Features\n\n### Latest data\n\nThe data is kept up to date with NINJAL.\n\n### Unique dictionaries\n\nLearn how words changed in frequency throughout history (CHJ, SHW).\n\nLearn about frequent words on the Japanese web (NWJC).\n\n### Careful merging of files\n\nWhen compiling a frequency dictionary, one has to be careful to not count the same word occurrence twice. This would corrupt the resulting word frequency.\n\nThe dictionaries in this repo are vetted against double-counting.\n\n### Frequency rank cap\n\nThe default dictionaries include the 50k most frequent words only. This keeps the files small and the learner focus on what is important: frequent words. Language fluency requires around 10k to 20k words of vocabulary.\n\n## Included dictionaries\n\nYou can find the dictionaries of the following corpora as GitHub releases.\n\nThe dictionary file shares the same license as its source data.\n\n### [Corpus of Historical Japanese (CHJ)](https://clrd.ninjal.ac.jp/chj/index.html)\n\n\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003e\u003cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png\" /\u003e\u003c/a\u003e\n\nA corpus that covers different eras of Japanese history.\n\nThe corpus ranges from the Nara period through the Edo period and Meiji era up to the Taishō era.\n\nTo track words across eras, two dictionaries are generated:\n\n1. A dictionary for the premodern part (Nara to Edo)\n2. A dictionary for the modern part (Meiji to Taishō)\n\n_The corpus is likely too small to generate dictionaries for each era._\n\n### [Showa-Heisei Corpus of written Japanese](https://clrd.ninjal.ac.jp/shc/index.html)\n\n\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003e\u003cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png\" /\u003e\u003c/a\u003e\n\nA corpus that covers the Showa and Heisei era of Japanese history.\n\nThere is one dictionary for both eras.\n\n### [NINJAL Web Japanese Corpus (NWJC)](https://masayu-a.github.io/NWJC/)\n\n\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by/4.0/\"\u003e\u003cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by/4.0/88x31.png\" /\u003e\u003c/a\u003e\n\nA corpus which was created by crawling the web.\n\n## Supported dictionaries\n\nThe licence of the following corpora doesn't allow me to upload a derived dictionary.\n\nMy solution is to publish the [raw data in a separate repo](https://github.com/uncomputable/frequency-data).\n\nUse my script to generate a frequency dictionary on your local machine.\n\n### [Balanced Corpus of Contemporary Written Japanese (BCCWJ)](https://clrd.ninjal.ac.jp/bccwj/index.html)\n\n\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/3.0/\"\u003e\u003cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-nd/3.0/88x31.png\" /\u003e\u003c/a\u003e\n\nOne of the largest and most popular corpora out there. It focuses on written language.\n\n### [Corpus of Spontaneous Japanese (CSJ)](https://clrd.ninjal.ac.jp/csj/index.html)\n\n\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/3.0/\"\u003e\u003cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-nd/3.0/88x31.png\" /\u003e\u003c/a\u003e\n\nAnother popular corpus with a focus on spoken language.\n\n## Set up the runtime environment\n\n### Use nix\n\nEnter the provided nix shell.\n\n```bash\nnix-shell\n```\n\n### Use pip\n\nCreate a virtual environment and use pip to install the dependencies.\n\n```bash\npython3 -m venv venv \u0026\u0026 source venv/bin/activate\npip install -r requirements.txt\n```\n\n## Run the script\n\nRun the script on the command line with the desired arguments.\n\n```bash\npython3 main.py [arguments...]\n```\n\nFor example, generate the frequency dictionary for BCCWJ (short-unit words) like so:\n\n```bash\npython3 main.py bccjw BCCWJ_frequencylist_suw_ver1_1.tsv\n```\n\nThere is help in case you get stuck.\n\n```bash\npython3 main.py --help\npython3 main.py bccjw --help\n```\n\n## Import the dictionary\n\nOpen the Yomichan settings in your browser and click \"Import Dictionary\".\n\nSelect the zip file and wait for it to be processed.\n\nThe dictionary should now be working.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funcomputable%2Ffrequency-dict","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funcomputable%2Ffrequency-dict","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funcomputable%2Ffrequency-dict/lists"}