{"id":17724862,"url":"https://github.com/jibsen/lzdatagen","last_synced_at":"2025-05-13T01:05:01.741Z","repository":{"id":72895322,"uuid":"62487263","full_name":"jibsen/lzdatagen","owner":"jibsen","description":"LZ data generator","archived":false,"fork":false,"pushed_at":"2024-02-02T07:04:44.000Z","size":31,"stargazers_count":21,"open_issues_count":0,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-02-02T08:29:13.741Z","etag":null,"topics":["c","compression","data-generator"],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jibsen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2016-07-03T08:09:41.000Z","updated_at":"2024-02-02T08:29:15.180Z","dependencies_parsed_at":null,"dependency_job_id":"5ae9bb73-34c0-4e95-ac87-ce0e48134e72","html_url":"https://github.com/jibsen/lzdatagen","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jibsen%2Flzdatagen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jibsen%2Flzdatagen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jibsen%2Flzdatagen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jibsen%2Flzdatagen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jibsen","download_url":"https://codeload.github.com/jibsen/lzdatagen/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246588691,"owners_count":20801522,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","compression","data-generator"],"created_at":"2024-10-25T15:48:56.500Z","updated_at":"2025-04-01T12:30:30.739Z","avatar_url":"https://github.com/jibsen.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"\nLZ data generator\n=================\n\nAbout\n-----\n\nSometimes it can be useful to be able to generate data that is similar to real\ndata for testing or benchmarking purposes. For instance it may be impractical\nto distribute large data sets with an application.\n\nlzdatagen generates data suitable for dictionary compression techniques.\n\n\nUsage\n-----\n\nlzdatagen comes with an example application lzdgen that provides a command-line\ninterface for generating data:\n\n    usage: lzdgen [options] OUTFILE\n\n    Generate compressible data for testing purposes.\n\n    options:\n      -b, --bulk             use faster, less precise method\n      -f, --force            overwrite output file\n      -h, --help             print this help and exit\n      -l, --literal-exp EXP  literal distribution exponent [3.0]\n      -m, --match-exp EXP    match length distribution exponent [3.0]\n      -o, --output OUTFILE   write output to OUTFILE\n      -r, --ratio RATIO      compression ratio target [3.0]\n      -S, --seed SEED        use 64-bit SEED to seed PRNG\n      -s, --size SIZE        size with opt. k/m/g suffix [1m]\n      -V, --version          print version and exit\n      -v, --verbose          verbose mode\n\n    If OUTFILE is `-', write to standard output.\n\n\nExamples\n--------\n\nGenerate 1 MiB data which should compress roughly 1:4:\n\n    lzdgen -r 4.0 foo.bin\n\nGenerate 1 MiB data compressible by entropy coding, but without LZ repetitions:\n\n    lzdgen -r 1.0 foo.bin\n\nGenerate 1 GiB of data, piped to zstd:\n\n    lzdgen -s 1g - | zstd -o foo.zstd\n\n\nDetails\n-------\n\nData is generated by inserting sequences of either random bytes or repetitions\nfrom a buffer of bytes, depending on the ratio parameter. This is based on the\n[paper][SDGen] \"SDGen: Mimicking Datasets for Content Generation in Storage\nBenchmarks\" by Raúl Gracia-Tinedo et al.\n\nInstead of sampling actual data, lzdatagen uses a simple power function to\ndetermine the distributions of literal values and match lengths. The exponents\nused can be set using the `--literal-exp` and `--match-exp` options.\n\nThis simplification means it cannot generate data with a limited alphabet, like\nDNA sequences.\n\nThe ratio parameter is approximate. Skewed literal distributions may create\nmatches, and the way matches are created from a buffer may affect the\ndistribution of byte values.\n\nPlease note that while data generated in this way may be useful for some kinds\nof testing and benchmarking, it is no substitute for unit tests that cover the\nlimits of an algorithm.\n\nlzdatagen uses a [PCG][] random number generator. In verbose mode it will print\nthe seed value to stderr. The `--seed` option can be used to generate\nreproducible data.\n\nA few other projects in this area:\n\n  - [SDGen](https://github.com/iostackproject/SDGen)\n  - [uiq2](http://mattmahoney.net/dc/uiq/)\n  - [lzgen](http://encode.ru/threads/305-Searching-for-special-file-generator)\n\n[SDGen]: https://www.usenix.org/node/188461\n[PCG]: http://www.pcg-random.org/\n\n\nLicense\n-------\n\nThis projected is licensed under the [Apache License, Version 2.0](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjibsen%2Flzdatagen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjibsen%2Flzdatagen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjibsen%2Flzdatagen/lists"}