https://github.com/scriptsmith/topwords
A list of the top 3 million+ English words in Project Gutenberg, along with their frequency.
https://github.com/scriptsmith/topwords
Last synced: 9 months ago
JSON representation
A list of the top 3 million+ English words in Project Gutenberg, along with their frequency.
- Host: GitHub
- URL: https://github.com/scriptsmith/topwords
- Owner: ScriptSmith
- License: cc-by-sa-4.0
- Created: 2019-08-08T23:17:35.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-10-26T22:42:13.000Z (about 5 years ago)
- Last Synced: 2025-02-15T06:44:31.462Z (11 months ago)
- Homepage:
- Size: 53.1 MB
- Stars: 13
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Top english words
A comprehensive list of the top 3 million+ english words in project gutenberg. Data is sourced from [Allison Parrish's](https://github.com/aparrish) awesome [gutenberg-dammit](https://github.com/aparrish/gutenberg-dammit) project.
## Usage
Use the word list:
```
$ head words.txt
the
of
and
to
a
in
that
i
he
```
Use the word count list:
```
$ head counts.txt
169852828 the
92493412 of
83626800 and
69017783 to
54796935 a
47554786 in
30598554 that
30324861 i
27900933 he
```
## Download
- [Download words](https://raw.githubusercontent.com/ScriptSmith/topwords/master/words.txt)
- [Download word counts](https://raw.githubusercontent.com/ScriptSmith/topwords/master/counts.txt)
or
Clone this repo:
```
git clone https://github.com/scriptsmith/topwords.git
cd topwords
```
## Recreating
Tools used:
- jq
- parallel
- grep
- sed
- GNU coreutils
- tr
- sort
- uniq
- cut
The following pattern was used to find words in the corpus:
```regex
[A-Za-z]+('[A-Za-z]+)?(? allwords.txt
```
### Sort and count words
If your temporary directory can't store more than 60GiB, change the value of `TMP_DIR`
```
TMP_DIR=/tmp
sort -T $TMP_DIR allwords.txt | uniq -c | sed 's/^\s*//' | sort -nr > counts.txt
```
### Remove word counts
```
cut -d ' ' -f2 counts.txt > words.txt
```