https://github.com/ziman/dictpress

Preprocess wordlists before compression
https://github.com/ziman/dictpress

Last synced: 10 months ago
JSON representation

Preprocess wordlists before compression

Host: GitHub
URL: https://github.com/ziman/dictpress
Owner: ziman
Created: 2010-09-13T15:11:09.000Z (almost 16 years ago)
Default Branch: master
Last Pushed: 2012-06-12T20:00:38.000Z (about 14 years ago)
Last Synced: 2025-07-05T04:03:57.253Z (about 1 year ago)
Language: C
Homepage:
Size: 109 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README

Awesome Lists containing this project

README

dictpress.c
Encode/decode dictionaries before/after compression.

Purpose:
Encode dictionaries in the form that is better compressible by
general-purpose compression algorithms, exploiting their structure.

Usage:
$ cc -O2 dictpress.c -o dictpress
$ cat dictionary.txt | ./dictpress | bzip2 -9 > dictionary.dp.bz2
$ cat dictionary.dp.bz2 | bunzip2 | ./dictpress -d > dictionary-decompressed.txt

Used best with bzip2 -9. (lzma, gzip and 7z perform worse).

Approximate compression ratios:
My 96-megabyte dictionary compresses to
-> 24M with dictpress alone
-> 18M with bzip2
-> 11M with 7z
-> 1.4M witch dictpress+7z
-> 1M with dictpress+bzip2

Prerequisites:
There's no point in running this on an unsorted dictionary.
Words must not contain binary zeroes.

Algorithm used:
We exploit the fact that consecutive words differ only a little in the suffix.
Therefore, for each word we record a pair (n,s) saying "remove n chars from the
end of the last word and append the string s".

Warning:
Does not preserve CRs (#13, '\r', ...) in the input.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ziman/dictpress

Awesome Lists containing this project

README