https://github.com/scriptsmith/topwords

A list of the top 3 million+ English words in Project Gutenberg, along with their frequency.
https://github.com/scriptsmith/topwords

Last synced: about 2 months ago
JSON representation

A list of the top 3 million+ English words in Project Gutenberg, along with their frequency.

Host: GitHub
URL: https://github.com/scriptsmith/topwords
Owner: ScriptSmith
License: cc-by-sa-4.0
Created: 2019-08-08T23:17:35.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-10-26T22:42:13.000Z (over 5 years ago)
Last Synced: 2025-04-09T12:47:29.262Z (11 months ago)
Homepage:
Size: 53.1 MB
Stars: 13
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# Top english words

A comprehensive list of the top 3 million+ english words in project gutenberg. Data is sourced from [Allison Parrish's](https://github.com/aparrish) awesome [gutenberg-dammit](https://github.com/aparrish/gutenberg-dammit) project.

## Usage

Use the word list:
```
$ head words.txt
the
of
and
to
a
in
that
i
he
```

Use the word count list:
```
$ head counts.txt
169852828 the
92493412 of
83626800 and
69017783 to
54796935 a
47554786 in
30598554 that
30324861 i
27900933 he
```

## Download

- [Download words](https://raw.githubusercontent.com/ScriptSmith/topwords/master/words.txt)
- [Download word counts](https://raw.githubusercontent.com/ScriptSmith/topwords/master/counts.txt)

Clone this repo:
```
git clone https://github.com/scriptsmith/topwords.git
cd topwords
```

## Recreating

Tools used:

- jq
- parallel
- grep
- sed
- GNU coreutils
- tr
- sort
- uniq
- cut

The following pattern was used to find words in the corpus:
```regex
[A-Za-z]+('[A-Za-z]+)?(? allwords.txt
```

### Sort and count words

If your temporary directory can't store more than 60GiB, change the value of `TMP_DIR`

```
TMP_DIR=/tmp
sort -T $TMP_DIR allwords.txt | uniq -c | sed 's/^\s*//' | sort -nr > counts.txt
```

### Remove word counts

```
cut -d ' ' -f2 counts.txt > words.txt
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scriptsmith/topwords

Awesome Lists containing this project

README