Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jibsen/lzdatagen

LZ data generator
https://github.com/jibsen/lzdatagen

c compression data-generator

Last synced: 3 days ago
JSON representation

LZ data generator

Awesome Lists containing this project

README

        

LZ data generator
=================

About
-----

Sometimes it can be useful to be able to generate data that is similar to real
data for testing or benchmarking purposes. For instance it may be impractical
to distribute large data sets with an application.

lzdatagen generates data suitable for dictionary compression techniques.

Usage
-----

lzdatagen comes with an example application lzdgen that provides a command-line
interface for generating data:

usage: lzdgen [options] OUTFILE

Generate compressible data for testing purposes.

options:
-b, --bulk use faster, less precise method
-f, --force overwrite output file
-h, --help print this help and exit
-l, --literal-exp EXP literal distribution exponent [3.0]
-m, --match-exp EXP match length distribution exponent [3.0]
-o, --output OUTFILE write output to OUTFILE
-r, --ratio RATIO compression ratio target [3.0]
-S, --seed SEED use 64-bit SEED to seed PRNG
-s, --size SIZE size with opt. k/m/g suffix [1m]
-V, --version print version and exit
-v, --verbose verbose mode

If OUTFILE is `-', write to standard output.

Examples
--------

Generate 1 MiB data which should compress roughly 1:4:

lzdgen -r 4.0 foo.bin

Generate 1 MiB data compressible by entropy coding, but without LZ repetitions:

lzdgen -r 1.0 foo.bin

Generate 1 GiB of data, piped to zstd:

lzdgen -s 1g - | zstd -o foo.zstd

Details
-------

Data is generated by inserting sequences of either random bytes or repetitions
from a buffer of bytes, depending on the ratio parameter. This is based on the
[paper][SDGen] "SDGen: Mimicking Datasets for Content Generation in Storage
Benchmarks" by Raúl Gracia-Tinedo et al.

Instead of sampling actual data, lzdatagen uses a simple power function to
determine the distributions of literal values and match lengths. The exponents
used can be set using the `--literal-exp` and `--match-exp` options.

This simplification means it cannot generate data with a limited alphabet, like
DNA sequences.

The ratio parameter is approximate. Skewed literal distributions may create
matches, and the way matches are created from a buffer may affect the
distribution of byte values.

Please note that while data generated in this way may be useful for some kinds
of testing and benchmarking, it is no substitute for unit tests that cover the
limits of an algorithm.

lzdatagen uses a [PCG][] random number generator. In verbose mode it will print
the seed value to stderr. The `--seed` option can be used to generate
reproducible data.

A few other projects in this area:

- [SDGen](https://github.com/iostackproject/SDGen)
- [uiq2](http://mattmahoney.net/dc/uiq/)
- [lzgen](http://encode.ru/threads/305-Searching-for-special-file-generator)

[SDGen]: https://www.usenix.org/node/188461
[PCG]: http://www.pcg-random.org/

License
-------

This projected is licensed under the [Apache License, Version 2.0](LICENSE).