https://github.com/ziman/dictpress
Preprocess wordlists before compression
https://github.com/ziman/dictpress
Last synced: 8 months ago
JSON representation
Preprocess wordlists before compression
- Host: GitHub
- URL: https://github.com/ziman/dictpress
- Owner: ziman
- Created: 2010-09-13T15:11:09.000Z (over 15 years ago)
- Default Branch: master
- Last Pushed: 2012-06-12T20:00:38.000Z (almost 14 years ago)
- Last Synced: 2025-07-05T04:03:57.253Z (11 months ago)
- Language: C
- Homepage:
- Size: 109 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
README
dictpress.c
Encode/decode dictionaries before/after compression.
Copyright (c) 2010, Matus Tejiscak
Released under the BSD license.
http://creativecommons.org/licenses/BSD/
Purpose:
Encode dictionaries in the form that is better compressible by
general-purpose compression algorithms, exploiting their structure.
Usage:
$ cc -O2 dictpress.c -o dictpress
$ cat dictionary.txt | ./dictpress | bzip2 -9 > dictionary.dp.bz2
$ cat dictionary.dp.bz2 | bunzip2 | ./dictpress -d > dictionary-decompressed.txt
Used best with bzip2 -9. (lzma, gzip and 7z perform worse).
Approximate compression ratios:
My 96-megabyte dictionary compresses to
-> 24M with dictpress alone
-> 18M with bzip2
-> 11M with 7z
-> 1.4M witch dictpress+7z
-> 1M with dictpress+bzip2
Prerequisites:
There's no point in running this on an unsorted dictionary.
Words must not contain binary zeroes.
Algorithm used:
We exploit the fact that consecutive words differ only a little in the suffix.
Therefore, for each word we record a pair (n,s) saying "remove n chars from the
end of the last word and append the string s".
Warning:
Does not preserve CRs (#13, '\r', ...) in the input.