{"id":13656639,"url":"https://github.com/rspeer/wordfreq","last_synced_at":"2026-03-17T16:02:15.182Z","repository":{"id":11473453,"uuid":"13941524","full_name":"rspeer/wordfreq","owner":"rspeer","description":"Access a database of word frequencies, in various natural languages.","archived":false,"fork":false,"pushed_at":"2025-01-04T20:59:31.000Z","size":452815,"stargazers_count":1592,"open_issues_count":11,"forks_count":109,"subscribers_count":51,"default_branch":"master","last_synced_at":"2025-12-31T10:13:40.548Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rspeer.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2013-10-28T23:28:12.000Z","updated_at":"2025-12-24T17:59:50.000Z","dependencies_parsed_at":"2023-01-13T18:00:29.236Z","dependency_job_id":"48b7239a-5a80-46ca-ac06-b8a0f16ab10f","html_url":"https://github.com/rspeer/wordfreq","commit_stats":{"total_commits":548,"total_committers":13,"mean_commits":42.15384615384615,"dds":"0.33211678832116787","last_synced_commit":"79eb74e9bf74e8ee965dbb3e2de49fc273c62e03"},"previous_names":[],"tags_count":125,"template":false,"template_full_name":null,"purl":"pkg:github/rspeer/wordfreq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rspeer%2Fwordfreq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rspeer%2Fwordfreq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rspeer%2Fwordfreq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rspeer%2Fwordfreq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rspeer","download_url":"https://codeload.github.com/rspeer/wordfreq/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rspeer%2Fwordfreq/sbom","scorecard":{"id":787579,"data":{"date":"2025-08-11","repo":{"name":"github.com/rspeer/wordfreq","commit":"912caf64b657478d1dff1138efdc078947d54bb1"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.9,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"License","score":9,"reason":"license file detected","details":["Info: project has a license file: LICENSE.txt:0","Warn: project license file does not contain an FSF or OSI license."],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Vulnerabilities","score":9,"reason":"1 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2025-49 / GHSA-5rjg-fvgr-3xxf"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-23T06:29:38.300Z","repository_id":11473453,"created_at":"2025-08-23T06:29:38.300Z","updated_at":"2025-08-23T06:29:38.300Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30626906,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-17T14:16:03.965Z","status":"ssl_error","status_checked_at":"2026-03-17T14:16:03.380Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T05:00:27.949Z","updated_at":"2026-03-17T16:02:15.116Z","avatar_url":"https://github.com/rspeer.png","language":"Python","funding_links":[],"categories":["Python","Libraries/Packages","Developer Resources","Feature Extraction"],"sub_categories":["Python","Frequency Lists","Text/NLP"],"readme":"wordfreq is a Python library for looking up the frequencies of words in many\nlanguages, based on many sources of data.\n\nThe word frequencies are a snapshot of language usage through about 2021. I may\ncontinue to make packaging updates, but the data is unlikely to be updated again.\nThe world where I had a reasonable way to collect reliable word frequencies is\nnot the world we live in now. See [SUNSET.md](./SUNSET.md) for more information.\n\nAuthor: Robyn Speer\n\n## Installation\n\nwordfreq requires Python 3 and depends on a few other Python modules\n(msgpack, langcodes, and regex). You can install it and its dependencies\nin the usual way, either by getting it from pip:\n\n    pip3 install wordfreq\n\nor by getting the repository and installing it for development, using [poetry][]:\n\n    poetry install\n\n[poetry]: https://python-poetry.org/\n\nSee [Additional CJK installation](#additional-cjk-installation) for extra\nsteps that are necessary to get Chinese, Japanese, and Korean word frequencies.\n\n## Usage\n\nwordfreq provides access to estimates of the frequency with which a word is\nused, in over 40 languages (see *Supported languages* below). It uses many\ndifferent data sources, not just one corpus.\n\nIt provides both 'small' and 'large' wordlists:\n\n- The 'small' lists take up very little memory and cover words that appear at\n  least once per million words.\n- The 'large' lists cover words that appear at least once per 100 million\n  words.\n\nThe default list is 'best', which uses 'large' if it's available for the\nlanguage, and 'small' otherwise.\n\nThe most straightforward function for looking up frequencies is:\n\n    word_frequency(word, lang, wordlist='best', minimum=0.0)\n\nThis function looks up a word's frequency in the given language, returning its\nfrequency as a decimal between 0 and 1.\n\n    \u003e\u003e\u003e from wordfreq import word_frequency\n    \u003e\u003e\u003e word_frequency('cafe', 'en')\n    1.23e-05\n\n    \u003e\u003e\u003e word_frequency('café', 'en')\n    5.62e-06\n\n    \u003e\u003e\u003e word_frequency('cafe', 'fr')\n    1.51e-06\n\n    \u003e\u003e\u003e word_frequency('café', 'fr')\n    5.75e-05\n\n`zipf_frequency` is a variation on `word_frequency` that aims to return the\nword frequency on a human-friendly logarithmic scale. The Zipf scale was\nproposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency\nof a word is the base-10 logarithm of the number of times it appears per\nbillion words. A word with Zipf value 6 appears once per thousand words, for\nexample, and a word with Zipf value 3 appears once per million words.\n\nReasonable Zipf values are between 0 and 8, but because of the cutoffs\ndescribed above, the minimum Zipf value appearing in these lists is 1.0 for the\n'large' wordlists and 3.0 for 'small'. We use 0 as the default Zipf value\nfor words that do not appear in the given wordlist, although it should mean\none occurrence per billion words.\n\n    \u003e\u003e\u003e from wordfreq import zipf_frequency\n    \u003e\u003e\u003e zipf_frequency('the', 'en')\n    7.73\n\n    \u003e\u003e\u003e zipf_frequency('word', 'en')\n    5.26\n\n    \u003e\u003e\u003e zipf_frequency('frequency', 'en')\n    4.36\n\n    \u003e\u003e\u003e zipf_frequency('zipf', 'en')\n    1.49\n\n    \u003e\u003e\u003e zipf_frequency('zipf', 'en', wordlist='small')\n    0.0\n\nThe parameters to `word_frequency` and `zipf_frequency` are:\n\n- `word`: a Unicode string containing the word to look up. Ideally the word\n  is a single token according to our tokenizer, but if not, there is still\n  hope -- see *Tokenization* below.\n\n- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.\n\n- `wordlist`: which set of word frequencies to use. Current options are\n  'small', 'large', and 'best'.\n\n- `minimum`: If the word is not in the list or has a frequency lower than\n  `minimum`, return `minimum` instead. You may want to set this to the minimum\n  value contained in the wordlist, to avoid a discontinuity where the wordlist\n  ends.\n\n## Frequency bins\n\nwordfreq's wordlists are designed to load quickly and take up little space in\nthe repository.  We accomplish this by avoiding meaningless precision and\npacking the words into frequency bins.\n\nIn wordfreq, all words that have the same Zipf frequency rounded to the nearest\nhundredth have the same frequency. We don't store any more precision than that.\nSo instead of having to store that the frequency of a word is\n.000011748975549395302, where most of those digits are meaningless, we just store\nthe frequency bins and the words they contain.\n\nBecause the Zipf scale is a logarithmic scale, this preserves the same relative\nprecision no matter how far down you are in the word list. The frequency of any\nword is precise to within 1%.\n\n(This is not a claim about *accuracy*, but about *precision*. We believe that\nthe way we use multiple data sources and discard outliers makes wordfreq a\nmore accurate measurement of the way these words are really used in written\nlanguage, but it's unclear how one would measure this accuracy.)\n\n## The figure-skating metric\n\nWe combine word frequencies from different sources in a way that's designed\nto minimize the impact of outliers. The method reminds me of the scoring system\nin Olympic figure skating:\n\n- Find the frequency of each word according to each data source.\n- For each word, drop the sources that give it the highest and lowest frequency.\n- Average the remaining frequencies.\n- Rescale the resulting frequency list to add up to 1.\n\n## Numbers\n\nThese wordlists would be enormous if they stored a separate frequency for every\nnumber, such as if we separately stored the frequencies of 484977 and 484978\nand 98.371 and every other 6-character sequence that could be considered a number.\n\nInstead, we have a frequency-bin entry for every number of the same \"shape\", such\nas `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibility\nwith earlier versions of wordfreq, our stand-in character is actually `0`.) This\nis the same form of aggregation that the word2vec vocabulary does.\n\nSingle-digit numbers are unaffected by this process; \"0\" through \"9\" have their own\nentries in each language's wordlist.\n\nWhen asked for the frequency of a token containing multiple digits, we multiply\nthe frequency of that aggregated entry by a distribution estimating the frequency\nof those digits. The distribution only looks at two things:\n\n- The value of the first digit\n- Whether it is a 4-digit sequence that's likely to represent a year\n\nThe first digits are assigned probabilities by Benford's law, and years are assigned\nprobabilities from a distribution that peaks at the \"present\". I explored this in\na Twitter thread at \u003chttps://twitter.com/r_speer/status/1493715982887571456\u003e.\n\nThe part of this distribution representing the \"present\" is not strictly a peak and\ndoesn't move forward with time as the present does. Instead, it's a 20-year-long\nplateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated,\nand 2039 is a time by which I will probably have figured out a new distribution.)\n\nSome examples:\n\n    \u003e\u003e\u003e word_frequency(\"2022\", \"en\")\n    5.15e-05\n    \u003e\u003e\u003e word_frequency(\"1922\", \"en\")\n    8.19e-06\n    \u003e\u003e\u003e word_frequency(\"1022\", \"en\")\n    1.28e-07\n\nAside from years, the distribution does not care about the meaning of the numbers:\n\n    \u003e\u003e\u003e word_frequency(\"90210\", \"en\")\n    3.34e-10\n    \u003e\u003e\u003e word_frequency(\"92222\", \"en\")\n    3.34e-10\n    \u003e\u003e\u003e word_frequency(\"802.11n\", \"en\")\n    9.04e-13\n    \u003e\u003e\u003e word_frequency(\"899.19n\", \"en\")\n    9.04e-13\n\nThe digit rule applies to other systems of digits, and only cares about the numeric\nvalue of the digits:\n\n    \u003e\u003e\u003e word_frequency(\"٥٤\", \"ar\")\n    6.64e-05\n    \u003e\u003e\u003e word_frequency(\"54\", \"ar\")\n    6.64e-05\n\nIt doesn't know which language uses which writing system for digits:\n\n    \u003e\u003e\u003e word_frequency(\"٥٤\", \"en\")\n    5.4e-05\n\n## Sources and supported languages\n\nThis data comes from a Luminoso project called [Exquisite Corpus][xc], whose\ngoal is to download good, varied, multilingual corpus data, process it\nappropriately, and combine it into unified resources such as wordfreq.\n\n[xc]: https://github.com/LuminosoInsight/exquisite-corpus\n\nExquisite Corpus compiles 8 different domains of text, some of which themselves\ncome from multiple sources:\n\n- **Wikipedia**, representing encyclopedic text\n- **Subtitles**, from OPUS OpenSubtitles 2018 and SUBTLEX\n- **News**, from NewsCrawl 2014 and GlobalVoices\n- **Books**, from Google Books Ngrams 2012\n- **Web** text, from OSCAR\n- **Twitter**, representing short-form social media\n- **Reddit**, representing potentially longer Internet comments\n- **Miscellaneous** word frequencies: in Chinese, we import a free wordlist\n  that comes with the Jieba word segmenter, whose provenance we don't really\n  know\n\nThe following languages are supported, with reasonable tokenization and at\nleast 3 different sources of word frequencies:\n\n    Language    Code    #  Large?   WP    Subs  News  Books Web   Twit. Redd. Misc.\n    ──────────────────────────────┼────────────────────────────────────────────────\n    Arabic      ar      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -\n    Bangla      bn      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -\n    Bosnian     bs [1]  3  -      │ Yes   Yes   -     -     -     Yes   -     -\n    Bulgarian   bg      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -\n    Catalan     ca      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -\n    Chinese     zh [3]  7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   -     Jieba\n    Croatian    hr [1]  3         │ Yes   Yes   -     -     -     Yes   -     -\n    Czech       cs      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -\n    Danish      da      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -\n    Dutch       nl      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -\n    English     en      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -\n    Finnish     fi      6  Yes    │ Yes   Yes   Yes   -     Yes   Yes   Yes   -\n    French      fr      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -\n    German      de      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -\n    Greek       el      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -\n    Hebrew      he      5  Yes    │ Yes   Yes   -     Yes   Yes   Yes   -     -\n    Hindi       hi      4  Yes    │ Yes   -     -     -     Yes   Yes   Yes   -\n    Hungarian   hu      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -\n    Icelandic   is      3  -      │ Yes   Yes   -     -     Yes   -     -     -\n    Indonesian  id      3  -      │ Yes   Yes   -     -     -     Yes   -     -\n    Italian     it      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -\n    Japanese    ja      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -\n    Korean      ko      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -\n    Latvian     lv      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -\n    Lithuanian  lt      3  -      │ Yes   Yes   -     -     Yes   -     -     -\n    Macedonian  mk      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -\n    Malay       ms      3  -      │ Yes   Yes   -     -     -     Yes   -     -\n    Norwegian   nb [2]  5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -\n    Persian     fa      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -\n    Polish      pl      6  Yes    │ Yes   Yes   Yes   -     Yes   Yes   Yes   -\n    Portuguese  pt      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -\n    Romanian    ro      3  -      │ Yes   Yes   -     -     Yes   -     -     -\n    Russian     ru      5  Yes    │ Yes   Yes   Yes   Yes   -     Yes   -     -\n    Slovak      sk      3  -      │ Yes   Yes   -     -     Yes   -     -     -\n    Slovenian   sl      3  -      │ Yes   Yes   -     -     Yes   -     -     -\n    Serbian     sr [1]  3  -      │ Yes   Yes   -     -     -     Yes   -     -\n    Spanish     es      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -\n    Swedish     sv      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -\n    Tagalog     fil     3  -      │ Yes   Yes   -     -     Yes   -     -     -\n    Tamil       ta      3  -      │ Yes   -     -     -     Yes   Yes   -     -\n    Turkish     tr      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -\n    Ukrainian   uk      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -\n    Urdu        ur      3  -      │ Yes   -     -     -     Yes   Yes   -     -\n    Vietnamese  vi      3  -      │ Yes   Yes   -     -     Yes   -     -     -\n\n[1] Bosnian, Croatian, and Serbian use the same underlying word list, because\nthey share most of their vocabulary and grammar, they were once considered the\nsame language, and language detection cannot distinguish them. This word list\ncan also be accessed with the language code `sh`.\n\n[2] The Norwegian text we have is specifically written in Norwegian Bokmål, so\nwe give it the language code 'nb' instead of the vaguer code 'no'. We would use\n'nn' for Nynorsk, but there isn't enough data to include it in wordfreq.\n\n[3] This data represents text written in both Simplified and Traditional\nChinese, with primarily Mandarin Chinese vocabulary. See \"Multi-script\nlanguages\" below.\n\nSome languages provide 'large' wordlists, including words with a Zipf frequency\nbetween 1.0 and 3.0. These are available in 14 languages that are covered by\nenough data sources.\n\n## Other functions\n\n`tokenize(text, lang)` splits text in the given language into words, in the same\nway that the words in wordfreq's data were counted in the first place. See\n*Tokenization*.\n\n`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in\nthe list, in descending frequency order.\n\n    \u003e\u003e\u003e from wordfreq import top_n_list\n    \u003e\u003e\u003e top_n_list('en', 10)\n    ['the', 'to', 'and', 'of', 'a', 'in', 'i', 'is', 'for', 'that']\n\n    \u003e\u003e\u003e top_n_list('es', 10)\n    ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'un']\n\n`iter_wordlist(lang, wordlist='best')` iterates through all the words in a\nwordlist, in descending frequency order.\n\n`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in\na wordlist as a dictionary, for cases where you'll want to look up a lot of\nwords and don't need the wrapper that `word_frequency` provides.\n\n`available_languages(wordlist='best')` returns a dictionary whose keys are\nlanguage codes, and whose values are the data file that will be loaded to\nprovide the requested wordlist in each language.\n\n`get_language_info(lang)` returns a dictionary of information about how we\npreprocess text in this language, such as what script we expect it to be\nwritten in, which characters we normalize together, and how we tokenize it.\nSee its docstring for more information.\n\n`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`\nreturns a selection of random words, separated by spaces. `bits_per_word=n`\nwill select each random word from 2^n words.\n\nIf you happen to want an easy way to get [a memorable, xkcd-style\npassword][xkcd936] with 60 bits of entropy, this function will almost do the\njob. In this case, you should actually run the similar function\n`random_ascii_words`, limiting the selection to words that can be typed in\nASCII. But maybe you should just use [xkpa][].\n\n[xkcd936]: https://xkcd.com/936/\n[xkpa]: https://github.com/beala/xkcd-password\n\n## Tokenization\n\nwordfreq uses the Python package `regex`, which is a more advanced\nimplementation of regular expressions than the standard library, to\nseparate text into tokens that can be counted consistently. `regex`\nproduces tokens that follow the recommendations in [Unicode\nAnnex #29, Text Segmentation][uax29], including the optional rule that\nsplits words between apostrophes and vowels.\n\nThere are exceptions where we change the tokenization to work better\nwith certain languages:\n\n- In Arabic and Hebrew, it additionally normalizes ligatures and removes\n  combining marks.\n\n- In Japanese and Korean, instead of using the regex library, it uses the\n  external library `mecab-python3`. This is an optional dependency of wordfreq,\n  and compiling it requires the `libmecab-dev` system package to be installed.\n\n- In Chinese, it uses the external Python library `jieba`, another optional\n  dependency.\n\n- While the @ sign is usually considered a symbol and not part of a word,\n  wordfreq will allow a word to end with \"@\" or \"@s\". This is one way of\n  writing gender-neutral words in Spanish and Portuguese.\n\n[uax29]: http://unicode.org/reports/tr29/\n\nWhen wordfreq's frequency lists are built in the first place, the words are\ntokenized according to this function.\n\n    \u003e\u003e\u003e from wordfreq import tokenize\n    \u003e\u003e\u003e tokenize('l@s niñ@s', 'es')\n    ['l@s', 'niñ@s']\n    \u003e\u003e\u003e zipf_frequency('l@s', 'es')\n    3.03\n\nBecause tokenization in the real world is far from consistent, wordfreq will\nalso try to deal gracefully when you query it with texts that actually break\ninto multiple tokens:\n\n    \u003e\u003e\u003e zipf_frequency('New York', 'en')\n    5.32\n    \u003e\u003e\u003e zipf_frequency('北京地铁', 'zh')  # \"Beijing Subway\"\n    3.29\n\nThe word frequencies are combined with the half-harmonic-mean function in order\nto provide an estimate of what their combined frequency would be. In Chinese,\nwhere the word breaks must be inferred from the frequency of the resulting\nwords, there is also a penalty to the word frequency for each word break that\nmust be inferred.\n\nThis method of combining word frequencies implicitly assumes that you're asking\nabout words that frequently appear together. It's not multiplying the\nfrequencies, because that would assume they are statistically unrelated. So if\nyou give it an uncommon combination of tokens, it will hugely over-estimate\ntheir frequency:\n\n    \u003e\u003e\u003e zipf_frequency('owl-flavored', 'en')\n    3.3\n\n## Multi-script languages\n\nTwo of the languages we support, Serbian and Chinese, are written in multiple\nscripts. To avoid spurious differences in word frequencies, we automatically\ntransliterate the characters in these languages when looking up their words.\n\nSerbian text written in Cyrillic letters is automatically converted to Latin\nletters, using standard Serbian transliteration, when the requested language is\n`sr` or `sh`. If you request the word list as `hr` (Croatian) or `bs`\n(Bosnian), no transliteration will occur.\n\nChinese text is converted internally to a representation we call\n\"Oversimplified Chinese\", where all Traditional Chinese characters are replaced\nwith their Simplified Chinese equivalent, *even if* they would not be written\nthat way in context. This representation lets us use a straightforward mapping\nthat matches both Traditional and Simplified words, unifying their frequencies\nwhen appropriate, and does not appear to create clashes between unrelated words.\n\nEnumerating the Chinese wordlist will produce some unfamiliar words, because\npeople don't actually write in Oversimplified Chinese, and because in\npractice Traditional and Simplified Chinese also have different word usage.\n\n## Similar, overlapping, and varying languages\n\nAs much as we would like to give each language its own distinct code and its\nown distinct word list with distinct source data, there aren't actually sharp\nboundaries between languages.\n\nSometimes, it's convenient to pretend that the boundaries between languages\ncoincide with national borders, following the maxim that \"a language is a\ndialect with an army and a navy\" (Max Weinreich). This gets complicated when the\nlinguistic situation and the political situation diverge. Moreover, some of our\ndata sources rely on language detection, which of course has no idea which\ncountry the writer of the text belongs to.\n\nSo we've had to make some arbitrary decisions about how to represent the\nfuzzier language boundaries, such as those within Chinese, Malay, and\nCroatian/Bosnian/Serbian.\n\nSmoothing over our arbitrary decisions is the fact that we use the `langcodes`\nmodule to find the best match for a language code. If you ask for word\nfrequencies in `cmn-Hans` (the fully specific language code for Mandarin in\nSimplified Chinese), you will get the `zh` wordlist, for example.\n\n## Additional CJK installation\n\nChinese, Japanese, and Korean have additional external dependencies so that\nthey can be tokenized correctly. They can all be installed at once by requesting\nthe 'cjk' feature:\n\n    pip install wordfreq[cjk]\n\nYou can put `wordfreq[cjk]` in a list of dependencies, such as the\n`[tool.poetry.dependencies]` list of your own project.\n\nTokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends\non `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`\nand `mecab-ko-dic`.\n\nAs of version 2.4.2, you no longer have to install dictionaries separately.\n\n## License\n\n`wordfreq` is freely redistributable under the Apache license (see\n`LICENSE.txt`), and it includes data files that may be\nredistributed under a Creative Commons Attribution-ShareAlike 4.0\nlicense (\u003chttps://creativecommons.org/licenses/by-sa/4.0/\u003e).\n\n`wordfreq` contains data extracted from Google Books Ngrams\n(\u003chttp://books.google.com/ngrams\u003e) and Google Books Syntactic Ngrams\n(\u003chttp://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html\u003e).\nThe terms of use of this data are:\n\n    Ngram Viewer graphs and data may be freely used for any purpose, although\n    acknowledgement of Google Books Ngram Viewer as the source, and inclusion\n    of a link to http://books.google.com/ngrams, would be appreciated.\n\n`wordfreq` also contains data derived from the following Creative Commons-licensed\nsources:\n\n- The Leeds Internet Corpus, from the University of Leeds Centre for Translation\n  Studies (\u003chttp://corpus.leeds.ac.uk/list.html\u003e)\n\n- Wikipedia, the free encyclopedia (\u003chttp://www.wikipedia.org\u003e)\n\n- ParaCrawl, a multilingual Web crawl (\u003chttps://paracrawl.eu\u003e)\n\nIt contains data from OPUS OpenSubtitles 2018\n(\u003chttp://opus.nlpl.eu/OpenSubtitles.php\u003e), whose data originates from the\nOpenSubtitles project (\u003chttp://www.opensubtitles.org/\u003e) and may be used with\nattribution to OpenSubtitles.\n\nIt contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,\nSUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.\n(see citations below) and available at\n\u003chttp://crr.ugent.be/programs-data/subtitle-frequencies\u003e.\n\nI (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to\ndistribute these wordlists in wordfreq, to be used for any purpose, not just\nfor academic use, under these conditions:\n\n- Wordfreq and code derived from it must credit the SUBTLEX authors.\n- It must remain clear that SUBTLEX is freely available data.\n\nThese terms are similar to the Creative Commons Attribution-ShareAlike license.\n\nSome additional data was collected by a custom application that watches the\nstreaming Twitter API, in accordance with Twitter's Developer Agreement \u0026\nPolicy. This software gives statistics about words that are commonly used on\nTwitter; it does not display or republish any Twitter content.\n\n## Can I convert wordfreq to a more convenient form for my purposes, like a CSV file?\n\nNo. The CSV format does not have any space for attribution or license\ninformation, and therefore does not follow the CC-By-SA license. Even if you\ntried to include the proper attribution in a header or in another file, someone\nwould likely just strip it out.\n\nwordfreq isn't particularly separable from its code, anyway. It depends on its\nnormalization and word segmentation process, which is implemented in Python\ncode, to give appropriate results.\n\nA reasonable way to transform wordfreq would be to port the library to another\nprogramming language, with all credits included and packaged in the usual way\nfor that language.\n\n\n## Citing wordfreq\n\nIf you use wordfreq in your research, please cite it! We publish the code\nthrough Zenodo so that it can be reliably cited using a DOI. The current\ncitation is:\n\n\u003e Robyn Speer. (2022). rspeer/wordfreq: v3.0 (v3.0.2). Zenodo. https://doi.org/10.5281/zenodo.7199437\n\nThe same citation in BibTex format:\n\n```\n@software{robyn_speer_2022_7199437,\n  author       = {Robyn Speer},\n  title        = {rspeer/wordfreq: v3.0},\n  month        = sep,\n  year         = 2022,\n  publisher    = {Zenodo},\n  version      = {v3.0.2},\n  doi          = {10.5281/zenodo.7199437},\n  url          = {https://doi.org/10.5281/zenodo.7199437}\n}\n```\n\n## Citations to work that wordfreq is built on\n\n- Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C.,\n  Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C.,\n  Specia, L., \u0026 Turchi, M. (2015). Findings of the 2015 Workshop on Statistical\n  Machine Translation.\n  \u003chttp://www.statmt.org/wmt15/results.html\u003e\n\n- Brysbaert, M. \u0026 New, B. (2009). Moving beyond Kucera and Francis: A Critical\n  Evaluation of Current Word Frequency Norms and the Introduction of a New and\n  Improved Word Frequency Measure for American English. Behavior Research\n  Methods, 41 (4), 977-990.\n  \u003chttp://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf\u003e\n\n- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., \u0026 Böhl, A.\n  (2011). The word frequency effect: A review of recent developments and\n  implications for the choice of frequency estimates in German. Experimental\n  Psychology, 58, 412-424.\n\n- Cai, Q., \u0026 Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character\n  frequencies based on film subtitles. PLoS One, 5(6), e10729.\n  \u003chttp://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729\u003e\n\n- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.\n  \u003chttp://unicode.org/reports/tr29/\u003e\n\n- Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., \u0026 Trón, V.\n  (2004). Creating open language resources for Hungarian. In Proceedings of the\n  4th international conference on Language Resources and Evaluation (LREC2004).\n  \u003chttp://mokk.bme.hu/resources/webcorpus/\u003e\n\n- Keuleers, E., Brysbaert, M. \u0026 New, B. (2010). SUBTLEX-NL: A new frequency\n  measure for Dutch words based on film subtitles. Behavior Research Methods,\n  42(3), 643-650.\n  \u003chttp://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf\u003e\n\n- Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological\n  analyzer.\n  \u003chttp://mecab.sourceforge.net/\u003e\n\n- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,\n  S. (2012). Syntactic annotations for the Google Books Ngram Corpus.\n  Proceedings of the ACL 2012 system demonstrations, 169-174.\n  \u003chttp://aclweb.org/anthology/P12-3029\u003e\n\n- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large\n  Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th\n  International Conference on Language Resources and Evaluation (LREC 2016).\n  \u003chttp://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf\u003e\n\n- Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines\n  for processing huge corpora on medium to low resource infrastructures. In\n  Proceedings of the Workshop on Challenges in the Management of Large Corpora\n  (CMLC-7) 2019.\n  \u003chttps://oscar-corpus.com/publication/2019/clmc7/asynchronous/\u003e\n\n- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official\n  European Languages. \u003chttps://paracrawl.eu/\u003e\n\n- van Heuven, W. J., Mandera, P., Keuleers, E., \u0026 Brysbaert, M. (2014).\n  SUBTLEX-UK: A new and improved word frequency database for British English.\n  The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.\n  \u003chttp://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frspeer%2Fwordfreq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frspeer%2Fwordfreq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frspeer%2Fwordfreq/lists"}