{"id":43447805,"url":"https://github.com/sts10/tidy","last_synced_at":"2026-02-12T05:48:50.727Z","repository":{"id":40361677,"uuid":"286079219","full_name":"sts10/tidy","owner":"sts10","description":"Combine and clean word lists","archived":false,"fork":false,"pushed_at":"2026-01-14T18:15:10.000Z","size":561,"stargazers_count":95,"open_issues_count":16,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-02-03T12:16:01.160Z","etag":null,"topics":["entropy","information-theory","rust","rust-lang","wordlist","wordlist-generator","wordlist-processing","wordlist-technique"],"latest_commit_sha":null,"homepage":"https://sts10.github.io/2021/12/09/tidy-0-2-0.html","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sts10.png","metadata":{"files":{"readme":"readme.markdown","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2020-08-08T16:29:13.000Z","updated_at":"2026-01-24T02:16:26.000Z","dependencies_parsed_at":"2023-12-03T17:41:11.191Z","dependency_job_id":"a911cf06-582c-41d0-a1c0-6605e786299b","html_url":"https://github.com/sts10/tidy","commit_stats":null,"previous_names":[],"tags_count":30,"template":false,"template_full_name":null,"purl":"pkg:github/sts10/tidy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sts10%2Ftidy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sts10%2Ftidy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sts10%2Ftidy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sts10%2Ftidy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sts10","download_url":"https://codeload.github.com/sts10/tidy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sts10%2Ftidy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29359898,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-12T01:03:07.613Z","status":"online","status_checked_at":"2026-02-12T02:00:06.911Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["entropy","information-theory","rust","rust-lang","wordlist","wordlist-generator","wordlist-processing","wordlist-technique"],"created_at":"2026-02-03T01:07:47.004Z","updated_at":"2026-02-12T05:48:50.688Z","avatar_url":"https://github.com/sts10.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tidy\n[![](https://deps.rs/repo/github/sts10/tidy/status.svg)](https://deps.rs/repo/github/sts10/tidy)\n\nA command-line tool for combining and cleaning large word list files.\n\n\u003e A throw of the dice will never abolish chance. — Stéphane Mallarmé\n\n## What this tool aims to help users do\n\nTidy aims to help users create \"_better_\" word lists -- generally word lists that will be used to create passphrases like \"block insoluble cardinal discounts\".\n\nTidy performs basic list-cleaning operations like removing duplicate words and blank lines by default. It additionally provides various optional standardizations and filters, like lowercasing all words (`-l`), or removing words with integers in them (`-I`), as well as protections against rare-but-possible passphrase pitfalls, such as prefix codes (`-P`) and low minimum word lengths (see below for explanations).\n\nTidy can also make word lists more \"typo-resistant\" by enforcing a minimum edit distance (`-d`), removing homophones and/or enforcing a unique prefix length (`-x`), which can allow users to auto-complete words after a specified number of characters.\n\nTidy can be used to **create new word lists** (for example, if given more than one list, Tidy will combine and de-duplicate them) with desirable qualities. You can obviously **edit** existing word lists.\n\n### Other resources\n* If you want to _audit_ an existing word list without editing it, Tidy can do that, but I'd suggest using my related [Word List Auditor](https://github.com/sts10/wla).\n* If you just want some word lists, you can check out my [Orchard Street Wordlists](https://github.com/sts10/orchard-street-wordlists).\n\n## Tidy's features\n\nGiven a text file with one word per line, this tool will create a new word list in which...\n\n-   duplicate lines (words) are removed\n-   empty lines have been removed\n-   whitespace from beginning and end of words is deleted\n-   words are sorted alphabetically (though this can be optionally prevented -- see below)\n\nand print that new word list to the terminal or to a text file.\n\nOptionally, the tool can...\n\n-   combine two or more inputted word lists\n-   make all characters lowercase (`-l`)\n-   set a minimum and maximum for word lengths\n-   handle words with integers and non-alphanumeric characters\n-   delete all characters before or after a delimiter (`-d`/`-D`)\n-   take lists of words to reject or allow\n-   remove homophones from a provided list of comma-separated pairs of homophones\n-   enforce a minimum [edit distance](https://en.wikipedia.org/wiki/Edit_distance) between words\n-   remove prefix words (see below) (`-P`)\n-   remove suffix words (`-S`)\n-   remove all words with non-alphabetic characters from new list\n-   straighten curly/smart quotes, i.e. replacing them with their \"straight\" equivalents (`-q`)\n-   guarantee a maximum shared prefix length (see below) (`-x`)\n-   normalize Unicode of all characters of all words on list to a specified [normalization form](https://www.unicode.org/faq/normalization.html) (NFC, NFKD, etc.) (`-z`)\n-   print corresponding dice rolls before words, separated by a tab. Dice can have 2 to 36 sides. (`--dice`)\n-   print information about the new list, such as entropy per word, to the terminal (`-A`, `-AA`, `-AAA`, or `-AAAA` depending on how much information you want to printed)\n\nand more!\n\nNOTE: If you do NOT want Tidy to sort list alphabetically, you can use the `--no-sort` option.\n\n## Usage\n\n```txt\nUsage: tidy [OPTIONS] \u003cInputted Word Lists\u003e...\n\nArguments:\n  \u003cInputted Word Lists\u003e...\n          Word list input files. Can be more than one, in which case they'll be\n          combined and de-duplicated. Requires at least one file\n\nOptions:\n  -a, --approve \u003cAPPROVED_LIST\u003e\n          Path(s) for optional list of approved words. Can accept multiple files\n\n  -A, --attributes...\n          Print attributes about new list to terminal. Can be used more than once to\n          print more attributes. Some attributes may take a nontrivial amount of time\n          to calculate\n\n  -j, --json\n          Print attributes and word samples in JSON format\n\n      --cards\n          Print playing card abbreviation next to each word. Strongly recommend only\n          using on lists with lengths that are powers of 26 (26^1, 26^2, 26^3, etc.)\n\n      --debug\n          Debug mode\n\n  -d, --delete-after \u003cDELETE_AFTER_DELIMITER\u003e\n          Delete all characters after the first instance of the specified delimiter\n          until the end of line (including the delimiter). Delimiter must be a single\n          character (e.g., ','). Use 't' for tab and 's' for space. May not be used\n          together with -g or -G options\n\n  -D, --delete-before \u003cDELETE_BEFORE_DELIMITER\u003e\n          Delete all characters before and including the first instance of the specified\n          delimiter. Delimiter must be a single character (e.g., ','). Use 't' for tab\n          and 's' for space. May not be used together with -g or -G options\n\n  -i, --delete-integers\n          Delete all integers from all words on new list\n\n  -n, --delete-nonalphanumeric\n          Delete all non-alphanumeric characters from all words on new list. Characters\n          with diacritics will remain\n\n      --dice \u003cDICE_SIDES\u003e\n          Print dice roll before word in output. Set number of sides of dice. Must be\n          between 2 and 36. Use 6 for normal dice\n\n      --dry-run\n          Dry run. Don't write new list to file or terminal\n\n  -f, --force\n          Force overwrite of output file if it exists\n\n      --homophones \u003cHOMOPHONES_LIST\u003e\n          Path(s) to file(s) containing homophone pairs. There must be one pair of\n          homophones per line, separated by a comma (sun,son). If BOTH words are found\n          on a list, the SECOND word is removed. File(s) can be a CSV (with no column\n          headers) or TXT file(s)\n\n  -g, --ignore-after \u003cIGNORE_AFTER_DELIMITER\u003e\n          Ignore characters after the first instance of the specified delimiter until the\n          end of line, treating anything before the delimiter as a word. Delimiter must be\n          a single character (e.g., ','). Use 't' for tab and 's' for space. Helpful for\n          ignoring metadata like word frequencies. Works with attribute analysis and most\n          word removal options, but not with word modifications (like to lowercase).\n          May not be used together with -d, -D or -G options\n\n  -G, --ignore-before \u003cIGNORE_BEFORE_DELIMITER\u003e\n          Ignore characters before and including the first instance of the specified\n          delimiter, treating anything after the delimiter as a word. Delimiter must\n          be a single character (e.g., ','). Use 't' for tab and 's' for space. Helpful\n          for ignoring metadata like word frequencies. Works with attribute analysis\n          and most word removal options, but not with word modifications (like to lowercase).\n          May not be used together with -d, -D or -g options\n\n      --locale \u003cLOCALE\u003e\n          Specify a locale for words on the list. Aids with sorting. Examples: en-US,\n          es-ES. Defaults to system LANG. If LANG environmental variable is not set,\n          uses en-US\n\n  -l, --lowercase\n          Lowercase all words on new list\n\n  -M, --maximum-word-length \u003cMAXIMUM_LENGTH\u003e\n          Set maximum word length\n\n  -x, --shared-prefix-length \u003cMAXIMUM_SHARED_PREFIX_LENGTH\u003e\n          Set number of leading characters to get to a unique prefix, which can aid\n          auto-complete functionality. Setting this value to say, 4, means that knowing\n          the first 4 characters of any word on the generated list is enough to know\n          which word it is\n\n  -e, --minimum-edit-distance \u003cMINIMUM_EDIT_DISTANCE\u003e\n          Set minimum edit distance between words, which can reduce the cost of typos\n          when entering words\n\n  -m, --minimum-word-length \u003cMINIMUM_LENGTH\u003e\n          Set minimum word length\n\n      --sort-by-length\n          Sort by word length, with longest words first. First sorts words \n          alphabetically, respecting inputted locale\n\n  --concat\n        If multiple word list files give, concatenate word lists in order given. \n        Default behavior is to \"blend\" them, like dealing playing cards in reverse\n\n  -O, --no-sort\n          Do NOT sort outputted list alphabetically. Preserves original list order. Note\n          that duplicate lines and blank lines will still be removed\n\n  -z, --normalization-form \u003cNORMALIZATION_FORM\u003e\n          Normalize Unicode of all characters of all words. Accepts nfc, nfd, nfkc,\n          or nfkd (case insensitive)\n\n  -o, --output \u003cOUTPUT\u003e\n          Path for outputted list file. If none given, generated word list will be printed\n          to terminal\n\n      --sides-as-base\n          When printing dice roll before word in output, print dice values according to\n          the base selected through --dice option. Effectively this means that letters will\n          be used to represent numbers higher than 9. Note that this option also 0-indexes\n          the dice values. This setting defaults to `false`, which will 1-indexed\n          dice values, and use double-digit numbers when necessary (e.g. 18-03-08)\n\n      --print-first \u003cPRINT_FIRST\u003e\n          Just before printing generated list, cut list down to a set number of\n          words. Can accept expressions in the form of base**exponent (helpful\n          for generating diceware lists). Words are selected from the beginning\n          of processed list, and before it is sorted alphabetically\n\n      --print-rand \u003cPRINT_RAND\u003e\n          Just before printing generated list, cut list down to a set number of words.\n          Can accept expressions in the form of base**exponent (helpful for generating\n          diceware lists). Cuts are done randomly\n\n      --quiet\n          Do not print any extra information\n\n  -I, --remove-integers\n          Remove all words with integers in them from list\n\n  -N, --remove-nonalphanumeric\n          Remove all words with non-alphanumeric characters from new list. Words\n          with diacritics will remain\n\n      --remove-nonalphabetic\n          Remove all words with non-alphabetic characters from new list. Words with\n          diacritcis and other non-Latin characters will remain\n\n  -L, --remove-non-latin-alphabetic\n          Remove all words with any characters not in the Latin alphabet (A through\n          Z and a through z). All words with accented or diacritic characters\n          will be removed, as well as any words with puncuation and internal whitespace\n\n  -C, --remove-nonascii\n          Remove all words that have any non-ASCII characters from new list\n\n  -P, --remove-prefix\n          Remove prefix words from new list\n\n  -S, --remove-suffix\n          Remove suffix words from new list\n\n  -r, --reject \u003cREJECT_LIST\u003e\n          Path(s) for optional list of words to reject. Can accept multiple files\n\n  -s, --samples\n          Print a handful of pseudorandomly selected words from the created list\n          to the terminal. Should NOT be used as secure passphrases\n\n  -K, --schlinkert-prune\n          Use Sardinas-Patterson algorithm to remove words to make list\n          uniquely decodable. Experimental!\n\n      --skip-rows-start \u003cSKIP_ROWS_START\u003e\n          Skip first number of lines from inputted files. Useful for dealing\n          with headers like from PGP signatures\n\n      --skip-rows-end \u003cSKIP_ROWS_END\u003e\n          Skip last number of lines from inputted files. Useful for dealing\n          with footers like from PGP signatures\n\n  -q, --straighten\n          Replace “smart” quotation marks, both “double” and ‘single’, with\n          their \"straight\" versions\n\n      --take-first \u003cTAKE_FIRST\u003e\n          Only take first N words from inputted word list. If two or more word\n          list files are inputted, it will combine all given lists by alternating words\n          from the given word list files until it has N words\n\n      --take-rand \u003cTAKE_RAND\u003e\n          Only take a random N number of words from inputted word list. If two or more\n          word lists are inputted, it will combine arbitrarily and then take a random\n          N words. If you're looking to cut a list exactly to a specified size,\n          consider print-rand or whittle-to options\n\n  -W, --whittle-to \u003cWHITTLE_TO\u003e\n          Whittle list exactly to a specified length, only taking minimum number\n          of words from the beginning of inputted list(s). If the outputted list\n          is not exactly the specified length, it will try again by taking a\n          different amount of words form input list(s). As a result, this using this\n          option may cause Tidy to take a moment to produce the finished list. Can\n          accept expressions in the form of base**exponent (helpful for generating\n          diceware lists).\n\n          This option should generally only be used if all of the following conditions\n          are met: (a) the inputted word list is sorted by desirability (e.g. ordered\n          by word frequency); (b) the user is either removing prefix words, removing\n          suffix words, or doing a Schlinkert prune; (c) the user needs the resulting\n          list to be a specified length.\n\n          Optionally can also take a \"starting point\" after a comma. For example,\n          --whittle-to 7776,15000 would start by taking the first 15,000 words from\n          the inputted list(s) as a first attempt at making a list of 7,776 words,\n          iterating if necessary.\n\n  -h, --help\n          Print help (see a summary with '-h')\n\n  -V, --version\n          Print version\n```\n\n## Usage examples\n\n-   `tidy --output new_list.txt word_list1.txt word_list2.txt` Combines the word lists in `word_list1.txt` and `word_list2.txt`, removing whitespace, empty lines, and duplicate words into one list. It sorts this list alphabetically, and then prints this new, combined list to the specified output location, in this case: `new_list.txt`.\n\n-   `tidy -l -o new_list.txt inputted_word_list.txt` Deletes whitespace, removes empty lines and duplicate words from `inputted_word_list.txt`. Due to the `-l` flag, it makes all the words lowercase. It sorts this list alphabetically and removes duplicates once again. It then prints this new list to the specified output location, in this case: `new_list.txt`.\n\n-   `tidy -l inputted_word_list.txt \u003e new_list.txt` Alternatively, you can use `\u003e` to print tidy's output to a file.\n\n-   `tidy -lP -o new_list.txt inputted_word_list.txt` Same as above, but the added `-P` flag removes prefix words from the list. See below for more on prefix words.\n\n-   `tidy -lPi -o new_list.txt inputted_word_list.txt` Same as above, but the added `-i` flag deletes any integers in words. Words with integers in them are not removed, only the integers within them. For example, \"11326 agency\" becomes \"agency\".\n\n-   `tidy -lPiO -o new_list.txt inputted_word_list.txt` Same as above, but the added `-O` flag preserves the original order of the list, rather than sort it alphabetically. Note that duplicates and blank lines are still removed.\n\n-   `tidy -I -o new_list.txt inputted_word_list.txt` Using the `-I` flag removes any words with integers from the list. For example, \"hello1\" would be completely removed from the list, since it has an integer in it. Note that this is distinct from the lowercase `-i` flag, which would leave the word \"hello\" on the resulting list (removing the \"1\").\n\n-   `tidy -AA -I -o new_list.txt inputted_word_list.txt` Adding `-AA` prints some information about the created list to the terminal. You can add up to 4 `A` flags to get the maximum amount of information that Tidy can print about a list. See below for more information.\n\n-   `tidy -l -o new_list.txt -r profane_words.txt inputted_word_list.txt` Similar to above, but ensures that none of the words in the profane_words.txt file make it on to the final list that is printed to new_list.txt. The reject list is case sensitive, so you may want to run it through tidy using the `-l` flag before using it. (You can find lists of profane words [here](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) and [here](https://code.google.com/archive/p/badwordslist/downloads).)\n\n-   `tidy -l -o new_list.txt -a approved_words.txt inputted_word_list.txt` Similar to above, but ensures that only words in the approved_words.txt file make it on to the final list that is printed to new_list.txt. The approved list is case sensitive. (On Mac and some Linux distributions, `/usr/share/dict/words` should contain a list of words for spellcheck purposes.)\n\n-   `tidy -l -o new_list.txt --homophones homophone_pairs.txt inputted_word_list.txt` Similar to above, but expects `homophones_pairs.txt` to be a list of homophones pairs separated by a comma (\"right,write\" then next line: \"epic,epoch\"). If both words in the pair are on the inputted_word_list, Tidy will remove the second one. If only one of the words in the pair are on the list, Tidy won't remove it. Must be only two words per line.\n\n-   `tidy -lA -m 3 -o new-list.txt inputted_word_list.txt` Similar to above, but the `-m 3` means new list won't have any words under 3 characters in length. Have Tidy also print some attributes about the new list to the terminal screen.\n\n-   `tidy -z nfkd --locale fr -o bip-0039/french.txt --force bip-0039/french.txt` Verify that [the BIP-0039 French list](https://github.com/bitcoin/bips/blob/master/bip-0039/french.txt) is (a) normalized to [Unicode Normalization Form](https://www.unicode.org/reports/tr15/) Compatibility Decomposition (abbreviated as NFKD) (as per [the BIP-0039 specification](https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki#wordlist)) and (b) sorted appropriately for the French language (thanks to specifying `--locale fr`). Locales can also be specified like \"en-US\" or \"es-ES\". If a `locale` is not specified, locale uses system LANG. If no LANG is found, uses \"en-US\". This locale setting only really affects how the words on the outputted list are **sorted**, so it's not _crucial_ for most use-cases to specify one.\n\n-   `tidy -d t -o just_the_words.txt diceware_list.txt` If you've got [a diceware list with numbers and a tab before each word](https://www.eff.org/files/2016/07/18/eff_large_wordlist.txt), the `-d t` flag will delete everything up to and including the first tab in each line (\"11133 abruptly\" becomes \"abruptly\").\n\n-   `tidy --dice 6 -o diceware_list.txt just_words.txt` Add corresponding dice roll numbers to a list with `--dice`. Can accept dice sides between 2 and 36. Each dice roll and word are separated by a tab.\n\n-   `tidy -P -x 4 --print-rand 7776 --dice 6 --output diceware.txt 1password-2021.txt` Make a 7,776-word list from a [1Password (~18k) word list](https://1password.com/txt/agwordlist.txt), removing prefix words and guaranteeing 4 characters can auto-complete any word. Lastly, add corresponding 6-sided dice role for each word.\n\n-   `tidy -o d-and-d.txt --dice 20 --print-rand 20**3 wordlist.txt` Create an 8,000-word list where each word corresponds to 3 rolls of a 20-sided die (`06-07-07\tdragon`). `--print-rand` randomly truncates the resulting list to the specified amount -- can accept integers (`8000`) or informal exponent notation (`20**3`).\n\n-   `tidy -d s --whittle-to 7776 -PlL -m 3 -M 12 --dice 6 -o wiki-diceware.txt ~/Downloads/enwiki-20190320-words-frequency-sorted.txt` Carefully make a 7,776-word list by only taking the words needed from the top of `~/Downloads/enwiki-20190320-words-frequency-sorted.txt` [file](https://github.com/IlyaSemenov/wikipedia-word-frequency/blob/master/results/enwiki-20190320-words-frequency.txt). Assumes this file is sorted by word frequencies, with a frequency count after the word, separated by a space (example line: `located 1039008`). Since we only want to use the most common words, we'll use Tidy's `--whittle-to` option to only take exactly how many words we need to construct a list of 7,776 words. Note that this may take longer that usual Tidy executions, since Tidy will very likely need to make multiple attempts to make a list that's exactly the requested length. [More info on whittle](https://github.com/sts10/tidy/issues/15#issuecomment-1215907335).\n\n## Installation\n\n### Using Rust and cargo\n1. [Install Rust](https://www.rust-lang.org/tools/install) if you haven't already\n2. Run: `cargo install --git https://github.com/sts10/tidy --locked --branch main` (Run this same command to upgrade Tidy.)\n\nYou should then be able to run `tidy --help` for help text.\n\nUninstall Tidy by running `cargo uninstall tidy`.\n\n### Releases\nCheck the [GitHub Releases page](https://github.com/sts10/tidy/releases) for binaries suitable for Mac, Windows, and Linux users.\n\nTo install the executable on a Linux/macOS machine, download the `tidy` executable and move it to somewhere in your `$PATH`, like `$HOME/.local/bin` (you can do this on the command line with something like `mv ~/Downloads/tidy ~/.local/bin/`).\n\n## Tidy can print attributes about a word list\n\n**Note when using Tidy to audit a list**: Tidy will remove blank lines and duplicate lines (words) _before_ calculating these list attributes. For example, if you're 4,000-word list has, say, 5 duplicate words, Tidy will report that the list has 3,995 words. No warning of duplicate words is given.\n\nIf you really want to _audit_ a word list, without making changes to it, try [Word List Auditor](https://github.com/sts10/wla).\n\nThat said, Tidy can calculate different attributes about a created list. `tidy -AAAA -G t --dry-run eff_long_list.txt` prints:\n\n```text\nAttributes of new list\n----------------------\nList length               : 7776 words\nMean word length          : 6.99 characters\nLength of shortest word   : 3 characters (aim)\nLength of longest word    : 9 characters (zoologist)\nFree of prefix words?     : true\nFree of suffix words?     : false\nUniquely decodable?       : true\nEntropy per word          : 12.925 bits\nEfficiency per character  : 1.849 bits\nAssumed entropy per char  : 4.308 bits\nAbove brute force line?   : true\nShortest edit distance    : 1\nMean edit distance        : 6.858\nLongest shared prefix     : 8\nUnique character prefix   : 9\n```\n\nUsing the `--samples` flag will print 5 sample passphrases to the terminal. (Note that these sample passphrases should not be used for security purposes, as Tidy has not been audited.)\n\n```txt\nWord samples\n------------\ndestruct subpar dizzy outshine stipend ovary\nslapstick hastily tremor visibly gizzard unloaded\nsalaried unwieldy churn vanity speak vessel\ndeserve humble pantyhose dayroom reprise unnatural\nvascular stencil visible sporty embellish submarine\n```\n\n## How Tidy counts the length of a word\n\nWhen counting the length of a word, Tidy counts the number of [grapheme clusters](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) in the word. Generally, less common characters like accented letters and emoji all count as 1 grapheme cluster and thus, to Tidy, one character. I believe this better fits with how us humans intuitively count characters in a string/word.\n\n## What types of files does Tidy work with?\nIn general, Tidy expects inputted files to have one word per line.\n\n### Line endings\nTidy supports `\\n` and `\\r\\n` line endings.\n\n## On verbs used\n\nIn both Tidy's code and documentation, \"remove\" means that a word will be removed (e.g. words with integers will be removed from the list), while \"delete\" means that a word will only be modified (e.g. integers removed from words). Uppercase flags remove words, while lowercase flags delete specified characters. All delete calls and word modifications (like \"to lowercase\") occur _before_ any remove call.\n\n## Blog posts related to this project\n\n* [Read about how Tidy handles Unicode normalization, locales, and alphabetizing words](https://sts10.github.io/2023/01/29/sorting-words-alphabetically-rust.html)\n* [Read more about the 0.2 version of this project](https://sts10.github.io/2021/12/09/tidy-0-2-0.html)\n* [Read about uniquely decodable codes and \"Schlinkert pruning\"](https://sts10.github.io/2022/08/12/efficiently-pruning-until-uniquely-decodable.html) (introduced in Tidy version 0.2.60)\n* [Read about initial inspiration for the project](https://sts10.github.io/2020/09/30/making-a-word-list.html)\n\n## Using Tidy with non-English words and/or accented characters\n\nTidy does its best to work well with all languages. That said, I'm an English speaker and have not tested Tidy with other languages all that much.\n\nThere are a few steps you can take to help Tidy produce a good word list in all languages.\n\nIf you're using Tidy to work a word list with accented characters, it is highly recommended that you:\n1. have Tidy normalize the Unicode of all characters on the list (e.g. `-z nfc` or `-z nfkd`). This will better ensure that there are no duplicate-looking words on the list, which could cause Tidy and others to over-estimate the strength of passphrases generated from the outputted list. Note that if you're passing a reject list file or approved list file to Tidy, you should normalize those lists _before_ using them. For example: `tidy -z nfc --locale ES-es -l --force -o profane-spanish-words.txt profane-spanish-words.txt \u0026\u0026 tidy -z nfc --locale ES-es -r profane-spanish-words.txt -o my-new-spanish-word-list.txt -l a-bunch-of-spanish-words.txt`\n2. specify the \"locale\" of the words on your list (e.g. `--locale fr` or `--locale ES-es`). This will ensure that the outputted list is sorted correctly.\n3. if the language you're working with has or may have apostrophes in words, consider using the `-q` or `--straighten` option to standardize these characters across all words on the new list.\n\nSee [this blog post](https://sts10.github.io/2023/01/29/sorting-words-alphabetically-rust.html) for more. If you find Tidy not performing as expected with non-English words, please open an Issue on this repository with an example.\n\n## Using Tidy to remove homophones\n\nIf passphrases from your list will ever be spoken out loud, you may want to consider removing homophones -- words that sound alike -- from your list.\n\nI'd say that Tidy offers two ways of dealing with homophones.\n\nGiven a pair of homophones, like \"sun\" and \"son\":\n\n1. To ensure you don't have BOTH homophones in your generated list, you'd run `tidy` with a flag like `--homophones ../homophones/homophone-lists/homophones-large-as-pairs.txt` ([link](https://github.com/sts10/homophones/blob/main/homophone-lists/homophones-large-as-pairs.txt)). This will let either \"sun\" or \"son\" on your list but NOT both.\n2. To ensure you have NEITHER of the words in the homophone pair on your generated word list, you'd use the reject words flags: `-r ../homophones/homophone-lists/cleaned-as-singles.txt` ([link](https://github.com/sts10/homophones/blob/main/homophone-lists/cleaned-as-singles.txt)). This will remove _both_ \"sun\" and \"son\" from your generated list before its outputted.\n\nIf you're looking for a relatively long list of English homophones, I'd humbly point you to [this other project of mine](https://github.com/sts10/homophones).\n\n## Prefix codes, suffix codes, and uniquely decodable codes\n\nIf a word list is \"uniquely decodable\" that means that words from the list can be safely combined _without_ a delimiter between each word, e.g. `enticingneurosistriflecubeshiningdupe`.\n\nAs a brief example, if a list has \"boy\", \"hood\", and \"boyhood\" on it, users who specified they wanted two words worth of randomness (entropy) might end up with \"boyhood\", which an attacker guessing single words would try. Removing the word \"boy\", which makes the remaining list uniquely decodable, prevents this possibility from occurring.\n\nTo make a list uniquely decodable, Tidy removes words. Tidy offers three (3) distinct procedures to make cuts until a list is uniquely decodable. Users can (1) remove all [prefix words](https://en.wikipedia.org/wiki/Prefix_code), (2) remove all suffix words, or (3) perform \"Schlinkert pruning,\" a procedure based on [the Sardinas–Patterson algorithm](https://en.wikipedia.org/wiki/Sardinas%E2%80%93Patterson_algorithm) that I developed for Tidy. Note that Schlinkert pruning a long inputted word list may take hours or days; removing prefix or suffix words should be significantly quicker. You can learn more about uniquely decodable codes and Schlinkert pruning by reading [this blog post](https://sts10.github.io/2022/08/12/efficiently-pruning-until-uniquely-decodable.html).\n\nTidy can also simply _check_ if the inputted list is (already) uniquely decodable. It does this using [the Sardinas–Patterson algorithm](https://en.wikipedia.org/wiki/Sardinas%E2%80%93Patterson_algorithm). You can do this by passing Tidy four `attributes` flag (`-AAAA`).\n\n## Whittling\n\nTidy offers an option `--whittle-to`. This option should **only** be used in specific situations -- users generally should prefer `--print-rand` or `--print-first` options. The situation where whittling gives an advantage over the `print` options is when the following conditions are met:\n(a) the inputted word list is sorted by desirability (e.g. ordered by word frequency) and\n(b) the user is either removing prefix words (`-P`), removing suffix words (`-S`), and/or doing a Schlinkert prune (`-K`).\n\nTo see why whittling is best for this particular situation, see [this document](https://gist.github.com/sts10/25e75d39acdeeafddad943d4d32684ff).\n\n## On maximum shared prefix length\n\nTidy allows users to set a maximum shared prefix length.\n\nSetting this value to say, 4, means that knowing the first 4 characters of any word on the generated list is sufficient to know which word it is.\n\nOn this example generated list where we told Tidy to make the maximum shared prefix length 4 characters, we'd know that if a word starts with \"radi\", we know it must be the word \"radius\" (if \"radical\" had been on the list, Tidy would have removed it).\n\nThis is useful if you intend the list to be used by software that uses auto-complete. For example, a user will only have to type the first 4 characters of any word before a program could successfully auto-complete the entire word.\n\n(Note that this setting is distinct from the operation of eliminating prefix words, though can be used in conjunction with that feature.)\n\nUse the attributes flag twice (`-AA`) to get information about shared prefix length for a generated list. Tidy will print both \"Longest shared prefix\" and \"Unique character prefix\" (which is longest shared prefix + 1).\n\n## What is \"Efficiency per character\" and \"Assumed entropy per char\" and what's the difference?\n\nIf we take the entropy per word from a list (log\u003csub\u003e2\u003c/sub\u003e(list_length)) and divide it by the **average** word length of words on the list, we get a value we might call \"efficiency per character\". This just means that, on average, you get _E_ bits per character typed.\n\nIf we take the entropy per word from a list (log\u003csub\u003e2\u003c/sub\u003e(list_length)) and divide it by the length of the **shortest** word on the list, we get a value we might call \"assumed entropy per char\" (or character).\n\nFor example, if we're looking at the EFF long list, we see that it is 7,776-words long, so we'd assume an entropy of log\u003csub\u003e2\u003c/sub\u003e7776 or 12.925 bits per word. The average word length is 7.0, so the efficiency is 1.8 bits per character. (I got this definition of \"efficiency\" from [an EFF blog post about their list](https://www.eff.org/deeplinks/2016/07/new-wordlists-random-passphrases).) And lastly, the shortest word on the list is three letters long, so we'd divide 12.925 by 3 and get an \"assumed entropy per character\" of about 4.31 bits per character.\n\nI contend that this \"assumed entropy per character\" value in particular may be useful when we ask the more theoretical question of \"how short should the shortest word on a good word list should be?\" There may be an established method for determining what this minimum word length should be, but if there is I don't know about it yet! Here's the math I've worked out on my own.\n\n\u003c!-- Consider the story of a user who gets a passphrase compromised of only the shortest words on the list. Does this passphrase genuinely have the entropy of `log2(list_length)` per word? --\u003e\n\n### The \"brute force line\"\n\nAssuming the list is comprised of 26 unique characters, if the shortest word on a word list is shorter than log\u003csub\u003e26\u003c/sub\u003e(list_length), there's a possibility that a user generates a passphrase such that the formula of entropy_per_word = log\u003csub\u003e2\u003c/sub\u003e(list_length) will _overestimate_ the entropy per word. This is because a brute-force character attack would have fewer guesses to run through than the number of guesses we'd assume given the word list we used to create the passphrase.\n\nAs an example, let's say we had a 10,000-word list that contained the one-character word \"a\" on it. Given that it's 10,000 words, we'd expect each word to add an additional ~13.28 bits of entropy. That would mean a three-word passphrase would give users 39.86 bits of entropy. However! If a user happened to get \"a-a-a\" as their passphrase, a brute force method shows that entropy to be only 14.10 bits (4.7 \\* 3 words). Thus we can say that it falls below the \"brute force line\", a phrase I made up.\n\nTo see if a given generated list falls above or below this line, use the `-A`/`--attributes` flag.\n\n#### Maximum word list lengths to clear the Brute Force Line\n\nFormula:\n\nWhere _S_ is the length of the shortest word on the list, 26 is the number of letters in the English alphabet, and _M_ is max list length: _M_ = 2\u003csup\u003e_S_ * log\u003csub\u003e2\u003c/sub\u003e(26)\u003c/sup\u003e. Conveniently, [this simplifies rather nicely](https://github.com/sts10/tidy/issues/9#issuecomment-1216003299) to _M_ = 26\u003csup\u003e_S_\u003c/sup\u003e.\n\n(or in Python: `max_word_list_length = 26**shortest_word_length`)\n\n| shortest word length | max list length |\n|----------------------|-----------------|\n| 2                    | 676             |\n| 3                    | 17576           |\n| 4                    | 456976          |\n| 5                    | 11881376        |\n\n### An even stricter \"line\"\n\nIf we go by [a 1951 Claude Shannon paper](https://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf), each letter in English actually only gives 2.6 bits of entropy. Users can see if their generated word list falls above this (stricter) line -- which I've dubbed the \"Shannon line\" -- by using the `-A`/`--attributes` flag.\n\n#### Maximum word list lengths to clear the Shannon Line\n\nFormula:\n\nWhere _S_ is the length of the shortest word on the list and _M_ is max list length: 2\u003csup\u003e_S_ * 2.6\u003c/sup\u003e = _M_\n\n(or in Python: `max_word_list_length = 2**(shortest_word_length*2.6)`, which, to preserve correct number of significant digits, should be `max_word_list_length = 6.1**shortest_word_length`)\n\n| shortest word length | max list length |\n|----------------------|-----------------|\n| 2                    | 37              |\n| 3                    | 226             |\n| 4                    | 1384            |\n| 5                    | 8445            |\n| 6                    | 51520           |\n\nAs you can see, the Shannon line is quite a bit more \"strict\" than the brute force line.\n\n## A separate tool to help you set dice rolls to correspond with your list\n\nA word list of 7,776 words \"fits\" nicely into 5 6-sided dice rolls. But not all word lists are 7,776 words long.\n\nIf you'd like some help figuring out how to fit your list to a number of dice rolls, another tool I wrote called [Dice Tailor](https://github.com/sts10/dice-tailor) might help.\n\n## What's up with the memchr dependency?\n\nTidy's function for removing characters on either side of a given delimiter uses a library called [memchr](https://docs.rs/memchr/2.3.4/memchr/), which \"provides heavily optimized routines for searching bytes.\" The optimization gained from using this crate is far from noticeable or necessary for most uses of Tidy -- using Rust's built-in `find` is not much slower -- but I figured the extra speed was worth the dependency in this case.\n\nSee [this repo](https://github.com/sts10/splitter) for more information.\n\n## For Tidy developers\n\n* Run all code tests: `cargo test`\n* Generate docs: `cargo doc --document-private-items --no-deps`. Add `--open` flag to open docs after generation. Locally, docs are printed to `./target/doc/tidy/index.html`.\n* Check license compatibility of Tidy's dependencies: `cargo deny check licenses` (requires that you [have cargo-deny installed locally](https://github.com/EmbarkStudios/cargo-deny#install-cargo-deny))\n\nPull Requests welcome!\n\n### How to create a release\n\nThis project uses [cargo-dist](https://opensource.axo.dev/cargo-dist/) to create releases.\n\nSome of [my personal docs are here](https://sts10.github.io/docs/cargo-dist-tips.html); but basically, first, update dist `cargo install cargo-dist`. Then, from within the tidy project folder, run `dist init` to ensure Tidy will use the latest version of dist when creating next release. \n\nWhen you're ready to cut a new release, test the current state of the project with `dist build` and `dist plan`. If that went well, create a new git tag that matches the current project version in `Cargo.toml` with `git tag vX.X.X`. Finally, run `git push --tags` to kick off the release process. GitHub will handle it from here -- check your project's GitHub Releases page in about 5 to 10 minutes.\n\n## Appendix: Tools that seem similar to Tidy\n-   [cook](https://github.com/giteshnxtlvl/cook): \"An overpower[ed] wordlist generator, splitter, merger, finder, saver, create words permutation and combinations, apply different encoding/decoding and everything you need.\" Written in Go.\n-   [duplict](https://github.com/nil0x42/duplicut): \"Remove duplicates from MASSIVE wordlist, without sorting it\". Seems to indeed be much faster (approximately 10x) than `tidy --no-sort` for de-duplicating large word lists. Written in C.\n-   [wordlist-knife](https://github.com/kazkansouh/wordlist-knife): \"Versatile tool for managing wordlists.\" Written in Python.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsts10%2Ftidy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsts10%2Ftidy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsts10%2Ftidy/lists"}