{"id":16294453,"url":"https://github.com/scriptsmith/topwords","last_synced_at":"2026-01-21T13:01:52.932Z","repository":{"id":97266010,"uuid":"201351301","full_name":"ScriptSmith/topwords","owner":"ScriptSmith","description":"A list of the top 3 million+ English words in Project Gutenberg, along with their frequency.","archived":false,"fork":false,"pushed_at":"2020-10-26T22:42:13.000Z","size":55699,"stargazers_count":13,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-09T12:47:29.262Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-sa-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ScriptSmith.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-08T23:17:35.000Z","updated_at":"2025-02-01T17:46:14.000Z","dependencies_parsed_at":"2023-05-06T17:00:33.941Z","dependency_job_id":null,"html_url":"https://github.com/ScriptSmith/topwords","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ScriptSmith/topwords","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ScriptSmith%2Ftopwords","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ScriptSmith%2Ftopwords/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ScriptSmith%2Ftopwords/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ScriptSmith%2Ftopwords/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ScriptSmith","download_url":"https://codeload.github.com/ScriptSmith/topwords/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ScriptSmith%2Ftopwords/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28633747,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-21T04:47:28.174Z","status":"ssl_error","status_checked_at":"2026-01-21T04:47:22.943Z","response_time":86,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-10T20:15:24.955Z","updated_at":"2026-01-21T13:01:48.268Z","avatar_url":"https://github.com/ScriptSmith.png","language":null,"readme":"# Top english words\n\nA comprehensive list of the top 3 million+ english words in project gutenberg. Data is sourced from [Allison Parrish's](https://github.com/aparrish) awesome [gutenberg-dammit](https://github.com/aparrish/gutenberg-dammit) project.\n\n## Usage\n\nUse the word list:\n```\n$ head words.txt\nthe\nof\nand\nto\na\nin\nthat\ni\nhe\n```\n\nUse the word count list:\n```\n$ head counts.txt\n169852828 the\n92493412 of\n83626800 and\n69017783 to\n54796935 a\n47554786 in\n30598554 that\n30324861 i\n27900933 he\n```\n\n## Download\n\n- [Download words](https://raw.githubusercontent.com/ScriptSmith/topwords/master/words.txt)\n- [Download word counts](https://raw.githubusercontent.com/ScriptSmith/topwords/master/counts.txt)\n\nor\n\nClone this repo:\n```\ngit clone https://github.com/scriptsmith/topwords.git\ncd topwords\n```\n\n## Recreating\n\nTools used:\n\n- jq\n- parallel\n- grep\n- sed\n- GNU coreutils\n\t- tr\n\t- sort\n\t- uniq\n    - cut\n\nThe following pattern was used to find words in the corpus:\n```regex\n[A-Za-z]+('[A-Za-z]+)?(?\u003c!('s))\n```\n\n### Clone this repo\n\n```\ngit clone https://github.com/scriptsmith/topwords.git\ncd topwords\n```\n\n### Get the data\n\nDownload and extract the [guttenberg-dammit](https://github.com/aparrish/gutenberg-dammit) data. This is a free resource, so don't abuse it.\n\n### Extract the words\n\nFinds words from the 40000+ books with English as a primary language:\n\n```\njq -r '.[] | select((.Language | length) == 1 and .Language[0] == \"English\") | \"gutenberg-dammit-files/\" + .\"gd-path\"' gutenberg-dammit-files/gutenberg-metadata.json | parallel \"grep -ohPf pattern.txt {}\" | tr '[:upper:]' '[:lower:]' \u003e allwords.txt\n```\n\n### Sort and count words\n\nIf your temporary directory can't store more than 60GiB, change the value of `TMP_DIR`\n\n```\nTMP_DIR=/tmp\nsort -T $TMP_DIR allwords.txt | uniq -c | sed 's/^\\s*//' | sort -nr \u003e counts.txt\n```\n\n### Remove word counts\n\n```\ncut -d ' ' -f2 counts.txt \u003e words.txt\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscriptsmith%2Ftopwords","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscriptsmith%2Ftopwords","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscriptsmith%2Ftopwords/lists"}