{"id":17101413,"url":"https://github.com/hackerb9/gwordlist","last_synced_at":"2025-04-13T00:03:06.185Z","repository":{"id":37770540,"uuid":"120914356","full_name":"hackerb9/gwordlist","owner":"hackerb9","description":"All the words from Google Books, sorted by frequency","archived":false,"fork":false,"pushed_at":"2023-07-04T18:42:56.000Z","size":127710,"stargazers_count":114,"open_issues_count":7,"forks_count":21,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-26T18:02:43.249Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hackerb9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-02-09T14:18:31.000Z","updated_at":"2025-03-17T07:26:35.000Z","dependencies_parsed_at":"2023-01-23T12:01:00.651Z","dependency_job_id":null,"html_url":"https://github.com/hackerb9/gwordlist","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fgwordlist","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fgwordlist/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fgwordlist/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fgwordlist/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hackerb9","download_url":"https://codeload.github.com/hackerb9/gwordlist/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248647225,"owners_count":21139084,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-14T15:25:09.004Z","updated_at":"2025-04-13T00:03:06.147Z","avatar_url":"https://github.com/hackerb9.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gwordlist\n\n| ***** |*NOTICE*| ***** |\n|-|:----:|-|\n| |This repository serves large files using GitHub's LFS which now charges for bandwidth. If you receive a quota error, download the tiny [1gramsbyfreq.sh](1gramsbyfreq.sh) shell script. Running that on your own machine will download Google's entire corpus (over 15 GB) and then, after much processing, prune it down to 0.25 GB.|   |\n| ***** |     | ***** |\n*     *     *     *     *\n\nThis project includes wordlists derived from [Google's ngram corpora](http://books.google.com/ngrams/) plus the programs used to automatically download and derive the lists, should you so wish. \n\nThe most import files:\n\n* [frequency-all.txt.gz](frequency-all.txt.gz) 266 MB. Compressed list\n  of all 29 billion words in the corpus, sorted by frequency.\n  Decompresses to over 2 GB. Includes words with weird symbols,\n  numbers, misspellings, OCR errors, and foreign languages. \n  \n* [frequency-alpha-alldicts.txt](frequency-alpha-alldicts.txt) 18 MB.\n  List of the 246,591 alphabetical words which were able to be\n  verified by various dictionaries. (GCIDE/Websters1913, WordNet, and\n  OEDv2). Sorted by frequency.\n\n* [1gramsbyfreq.sh:](1gramsbyfreq.sh) \n  The main shell script which downloads the data from Google and\n  extracts frequency information.\n\n\n## What does the data look like?\n\nHere's a sample of one of the files:\n\n    #RANKING   WORD                             COUNT      PERCENT   CUMULATIVE\n    1          ,                      115,513,165,249    5.799422%    5.799422%\n    2          the                    109,892,823,605    5.517249%   11.316671%\n    3          .                       86,243,850,165    4.329935%   15.646607%\n    4          of                      66,814,250,204    3.354458%   19.001065%\n    5          and                     47,936,995,099    2.406712%   21.407776%\n\nInterestingly, if this data is right, only five words make up 20% of\nall the words in books from 1880 to 2020. And two of those \"words\" are\npunctuation marks!! (Don't believe comma is a word? I've also created\nwordlists that exclude punctuation. See the files named *\"alpha\"*).\n\n## Why does this exist?\nI needed my [XKCD 936]() compliant [password generator](https://github.com/hackerb9/mkpass) to have a good list of words in order to make memorable passphrases. Most lists I've seen are not terribly good for my purposes as the words are often from extremely narrow domains. The best I found was [SCOWL](), but I didn't like that the words weren't sorted by frequency so I couldn't easily take a slice of, say, the top 4096 most frequent words.\n\nThe obvious solution was to use Google's ngram corpus which claims to have a trillion different words pruned from all the books they've scanned for books.google.com (about 4% of all books ever published, they say). Unfortunately, while some people had posted small lists, nobody had the entire list of every word sorted by frequency. So, I made this and here it is.\n\n## What can this data be used for?\nAnything you want. While my programs are licensed under the GNU GPL ≥3, I'm explicitly releasing the data produced under the same license as Google granted me: Creative Commons Attribution 3.0. \n\n## How many words does it *really* have? \n\nThere are 37,235,985 entries in the V3 (20200217) corpus, but it's a\nmistake to think there are 37 million *different*, *useful* words. For\nexample, 6% of the words found are a single comma. Google used completely\nautomated OCR techniques to find the words and it made a lot of\nmistakes. Moreover, their definition of a “word” includes things like\n`s`, `A4oscow`, `IIIIIIIIIIIIIIIIIIIIIIIIIIIII`, `cuando`, `لاامش`,\n`ihm`,`SpecialMarkets@ThomasNelson`, `buisness`[sic], and `,`. \n\nTo compensate, they only included words in the corpus that appeared at\nleast 40 times, but even so there's so much dreck at the bottom of the\nlist that it's really not worth bothering. Personally, I found that\nwords that appeared over 100,000 times tended to be worthwhile. In\naddition, I was getting so many obvious OCR errors that I decided to\nalso create some cleaner lists by using [dict](http://dict.org) to\ncheck every word against a dictionary. (IMPORTANT NOTE! If you run\nthese scripts, be sure to setup your own dictd so you're not pounding\nthe internet servers for a bazillion lookups.)\n\nAfter pruning with dictionaries, I found 65536 words seemed like a\nmore reasonable number to cutoff. However, the script currently does\nnot limit the number of words. Because this part has not been\noptimized yet, it can take a *very* long time. For faster runs, set\n`maxcount=65536`.\n\n## How big are the files? \n\nIf you run my scripts (which are tiny) they\nwill download about 14 GiB of data from Google. However, if you simply\nwant [the final\nlist](https://github.com/hackerb9/gwordlist/blob/master/1gramsbyfreq.txt.gz),\nit uncompresses to over 350 MB. Alternately, if you don't need so\nmany words, consider downloading one of the smaller files I created that\nhave been cleaned up and limited to only the top words verified in dictionaries, such as\n[frequency-alpha-alldicts.txt](https://github.com/hackerb9/gwordlist/blob/master/frequency-alpha-alldicts.txt).\n\n## What got thrown away in these subcopora?\nAs you can guess, since the file size went down by 90%, I tossed a lot of info. The biggest changes were from losing the separate counts for each year, ignoring the tags for part of speech (e.g., I used only the count for \"watch\", which includes the counts for watch_VERB with watch_NOUN), and from combining different capitalization into a single term. (Each word is listed under its most frequent capitalization: for example, \"London\", instead of \"london\".) If you need that data, it's not hard to modify the scripts. Let me know if you have trouble.\n\n## What got added?\nI counted up the total number of words in all the books so I could get a rough percentage of how often each word was being used in English. I also include a running total of the percentage so you can truncate the file wherever you want. (E.g., to get a list of 95% of all words used in English). \n\n## Part of Speech tags\n\nThe corpus includes words suffixed with an underscore and then a tag\nmarking what part of speech the word appears to have been used. For\nexample: \n\n```\n#5101      watch               \t    76,770,311\t    0.001284%\t   85.124506%\n#8225      watch_VERB          \t    44,060,908\t    0.000737%\t   88.174382%\n#10464     watch_NOUN          \t    32,697,074\t    0.000547%\t   89.601624%\n```\n\n* Words tagged with part of speech appear to be simply duplicate\n  counts of the root word. In the example of `watch` above, note that\n  76,770,311 ≈ 44,060,908 + 32,697,074.\n\n* List of Part of Speech tags (from books.google.com/ngrams/info)\n  _NOUN_\tnoun\t\t(Examples: `time_NOUN`, `State_Noun`, `Mr._Noun`)\n  _VERB_\tverb\t\t(Examples: `is_VERB`, `be_VERB`, `have_VERB`)\n  _ADJ_\t\tadjective\t(Examples: `other_ADJ`, `such_ADJ`, `same_ADJ`)\n  _ADV_\t\tadverb\t\t(Examples: `not_ADV`, `when_ADV`, `so_ADV`)\n  _PRON_\tpronoun\t\t(Examples: `it_PRON`, `I_PRON`, `he_PRON`)\n  _DET_\t\tdeterminer or article\t(Examples: `the_DET`, `a_DET`, `this_DET`)\n  _ADP_\t\tan adposition: either a preposition or a postposition\t(Examples: `of_ADP`, `in_ADP`, `for_ADP`)\n  _NUM_ \tnumeral\t\t(Examples: `one_NUM`, `1_NUM`, `2001_NUM`)\n  _CONJ_\tconjunction (Examples: `and_CONJ`, `or_CONJ`, `but_CONJ`)\n  _PRT_\t\tparticle\t(Examples: `to_PRT`, `'s_PRT`, `'_PRT` `out_PRT`)\n\n* Part of speech tags, undocumented by Google:\n  _._\t\tpunctuation (Example: `,_.`)\n  _X_\t\t??? (Example: `[_X`, `*_X`, `=_X`, `etc._x`, `de_X`, `No_X`)\n\n* Google uses these tags for searching, but they don't appear (at least in 1-grams):\n  _ROOT_\troot of the parse tree\tThese tags must stand alone (e.g., _START_)\n  _START_\tstart of a sentence\n  _END_\t\tend of a sentence\n\n\n## Bugs\n\n* Script cannot run on a 32-bit machine as it briefly requires more\n  than 4 GiB of RAM as it makes a hashtable of every word.\n\n\n## To Do\n\n* Use Makefile for dependencies so that multiprocessing is built in\n  (using `make -j`), instead of having to append \u0026 to commands.\n\n* Use `comm` instead of `dict` to check wordlists against\n  dictionaries. \n\n* The number of books a word occurs in should help determine popularity.\n  Perhaps _popularity_ = _occurences_ × _books_ ? \n\n## LFS\n\nGithub does not allow files larger than 100MB. The file\nfrequency-all.txt.gz is 266 MB, so it has been placed on\n[git-lfs](https://git-lfs.github.com).\n\n## Misc Notes\n\n* Hyphenated words do not appear in the 1-gram list. Why not? Perhaps\n  they are considered 2-grams?\n\n* I may need a manually created \"stopword\" list due to all the\n  obviously non-English words appearing in the list.\n\n* Some of the 1-grams I'm turning up as quite popular should actually\n  be 2-grams: e.g. York -\u003e New York. Maybe I should add in 2-grams to\n  the list, since some of them will clearly be in the list of most\n  common \"words\".\n\n* Some words *should* be capitalized, such as \"I\" and \"London\".\n  But it makes sense to accumulate \"the\" and \"The\", since otherwise\n  both will be listed as one of the most common words.\n\n  Solution: Accumulate twice. First time case-sensitive. Sort by\n  frequency. Then, second time, case-insensitive, outputting the\n  *first* variation found. \n\n\n* I'm currently getting some very strange results, or at least\n  unexpected, results.  While the 100 words seem reasonably common,\n  there are some strangely highly ranked words:\n\n\t124 s\n\t147 p\n\t151 J\n\t165 de\n\t202 M\n\t209 general\n\t214 B\n\t225 S\n\t226 Mr\n\t228 York\n\t238 D\n\t241 government\n\t254 R\n\t272 et\n\t282 E\n\t291 John\n\t292 University\n\t294 U\n\t309 H\n\t325 P\n\t328 pp\n\t359 English\n\t365 L\n\t371 v\n\t373 London\n\t390 W\n\t391 Fig\n\t399 e\n\t405 F\n\t422 Figure\n\t426 G\n\t444 British\n\t445 T\n\t446 c\n\t455 N\n\t466 II\n\t472 b\n\t478 French\n\t\t479 England\n\t508 St\n\t509 General\n\nCompare that with common words that are found much less frequently:\n\n\t2124 eat\n\t4004 TV\n\t6040 ate\n\t6041 bedroom\n\t6138 fool\n\t10007 foul\n\t10012 swim\n \t10017 sore\n\t15013 lone\n\t15020 doom\n\t\n* Certain domain specific texts overuse the same words over and over.\n  For example, looking at just the total number of uses, _bats_ and\n  _psychosocial_ are equally common. However, _bats_ is used in twice\n  as many books. \n  \n  * I should weight the number of occurrences by the number of\n    different books the word is found in. But what is the proper weight?\n\t\n  * Most naive approach would be to multiply the two numbers together. \n    Is that sufficient? Defensible?\n\t\n        f(occurrences, books):\n          return (occurrences * books)\n\n  * Can I ignore occurrences? What would a sort by number of books the\n    word has occurred in at least once look like?\n\n        f(occurrences, books):\n          return (books)\n\n* Am I adding things up correctly? The least amount of times any word\n  is found is 40. Was that a cut off when they were creating the\n  corpus, presuming words that showed up less than that were OCR\n  errors? \n\n  _Yes._ \n\n\n* There are a bunch of nonsense/units/foreign words mixed in to this\n corpus. How can I get rid of them all easily?\n\n** Maybe I can get a list of unit abbreviations and grep them out?\n\n      lbs, J, gm, ppm\n\n** Maybe look up words in gcide and reject non-existent words? OED is too liberal.\n\n   cuando, aro, ihm\n\n** A lot of the words that are of type \"_X\" are suspicious and there's\n   only 159 of them in the over-1E6 list.\n\n*** Some are not in WordNet and can be easily discarded:\n\n    et dem bei durch deux der per je ibid wird und auf su comme lui que ch\n    della hoc quam del ou auch bien cette les zur sont seq ont du che\n    facto leur nur di una einer entre ich op sich avec um mais qui nicht\n    inasmuch zum peut dans por ah vel quae los eine vous esse sunt im quod\n    nach como une ein aux wie ist lo sie fait las aus werden dei\n\n*** However, that still leaves 83 that are not as easy:\n\n    de e el il au r u tout hell esp b d est sur iv pas sa nous ni z la f\n    se in das chap fig er oder des ii iii m mit als dear alas ma c le o h\n    ex para j vii mi no yes den x oh vi ut bye mm en die l zu v well pro w\n    ab al un si ne ce es k cf viii i y non ad g cum ha sind te\n\n*** Most of the real words (\"well\", \"hell\", \"dear\", \"chap\", \"no\",\n    \"den\", \"die\", show up as\n    other types of speech. On the other hand, words like \"bye\", \"yes\",\n    and \"alas\" are definitely words, and they're not listed under any\n    other type than _X. (What does _X mean? Interjection?)\n\n* At first I tried accumulating a different count for each usage of a\n  word (e.g., watch_VERB and watch_NOUN), but that meant some words\n  would split their vote and not be listed among the most common. \n  [This does not seem to be the case in the 2020 dataset in which the\n  plain word is equal to the total of the various part of speech\n  versions].\n\n  Also, it meant I had many duplicates of the same word. \n\n  The current solution is to skip any words with an underscore in them. \n\n  \t       \n  r_ADP   1032605\t\t\t\tout_ADV 9199818\n  r_CONJ  1048981\t\t\t\tout_ADJ 8645123\n  r_PRON  1019601\t\t\t\tout_PRT 332451517\n  r_NUM   3316486\t\t\t\tout_NOUN        4462386\n  r_X     3125051\t\t\t\tout_ADP 159492310\n  r_NOUN  2975438\n  r_VERB  2931183\n  r_PRT   1044181\n  r_ADJ   2327691\n\n* There are 117 words with no vowels, none of them real words.\n  grep '^[^aeiouy]*_' foo\n\n* Some words are contractions:\n\n\tcit_NOUN (webster says it means citizen, but given how\n\t\t commonly used it is, more often than \"dogs\", maybe it\n\t\t was for citations?)\n\n* Some words make no sense whatsoever to me:\n\n        eds_NOUN\t12084339\n\n* Some words are british:\n\n\tprogramme\n\n\n* If I had some way to accumulate words to their lemmas (\"head word\"),\n  that would maybe allow me to accumulate them so they'll make the 1E6\n  threshold of useful words: (watching, watches, watched, -\u003e watch)\n\n** Perhaps dict using wordnet? No. Websters? Sort of. It works for\n  'watching'-\u003e'watch', but not 'dogs' -\u003e 'dog'.\n\n* There are some odd orderings. How can \"go\" be less common than\n  \"children\"?\n  \n* Some words appear to be misspellings.\n\n  buisness\n\n* Some words may be misspellings or OCR problems.\n\n  ADJOURNMEN, ADMINISTATION, bonjamin, Buddhisn\n\n* Some words are clearly OCR errors, not misspellings in the original\n\n  A1most, A1ways, A1uminum, a1titude\n  A1nerica, A1nerican\n  ADMlNlSTRATlON, LlBRARY, lNSTANT, lNTERNAL, LlVED, lDEAS, lNVERSE, lRELAND\n  lNTRODUCTlON, lNlTlAL, lNTERlM, (and on and on...)\n  areheology\n  anniverfary\n  beingdeveloped    \n  A0riculture, A0erage, 0paque, 0ndustry, 1nch\n  A9riculture, a9ain, a9ainst, a9ent, a9ked, \n  Aariculture\n  AAppppeennddiixx\n  Thmking\n  A4oscow (should be \"Moscow\")\n  \n* Some words have been mangled by Google on purpose:\n\n\tcan't, cannot -\u003e \"can not\" (bigram)\n\n* 2020 format is  WORD [ TAB YEAR,COUNT,BOOKS ]+\n      Alcohol\t1983,905,353    1984,1285,433   1985,1088,449\n\n* List of Corpora (from books.google.com/ngrams/info)\n\n  Informal corpus name   Shorthand         Persistent identifier\n  ==============================================================\n  American English 2012  eng_us_2012       googlebooks-eng-us-all-20120701\n  American English 2009  eng_us_2009       googlebooks-eng-us-all-20090715\n\t\t\t Books predominantly in the English language\n\t\t\t that were published in the United States.\n\n  British English 2012   eng_gb_2012       googlebooks-eng-gb-all-20120701\n  British English 2009   eng_gb_2009       googlebooks-eng-gb-all-20090715\n\t\t\t Books predominantly in the English language\n\t\t\t that were published in Great Britain.\n\n  English 2012           eng_2012          googlebooks-eng-all-20120701\n  English 2009           eng_2009          googlebooks-eng-all-20090715\n\t\t\t Books predominantly in the English language\n\t\t\t published in any country.\n\n  English Fiction 2012   eng_fiction_2012  googlebooks-eng-fiction-all-20120701\n  English Fiction 2009   eng_fiction_2009  googlebooks-eng-fiction-all-20090715\n\t\t\t Books predominantly in the English language\n\t\t\t that a library or publisher identified as\n\t\t\t fiction.\n\n  English One Million    eng_1m_2009       googlebooks-eng-1M-20090715\n\t\t\t The \"Google Million\". All are in English with\n\t\t\t dates ranging from 1500 to 2008. No more than\n\t\t\t about 6000 books were chosen from any one year,\n\t\t\t which means that all of the scanned books from\n\t\t\t early years are present, and books from later\n\t\t\t years are randomly sampled. The random\n\t\t\t samplings reflect the subject distributions for\n\t\t\t the year (so there are more computer books in\n\t\t\t 2000 than 1980).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackerb9%2Fgwordlist","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhackerb9%2Fgwordlist","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackerb9%2Fgwordlist/lists"}