{"id":13549466,"url":"https://github.com/linkedin/migz","last_synced_at":"2025-08-17T01:35:04.683Z","repository":{"id":49766158,"uuid":"152490070","full_name":"linkedin/migz","owner":"linkedin","description":"Multithreaded, gzip-compatible compression and decompression, available as a platform-independent Java library and command-line utilities.","archived":false,"fork":false,"pushed_at":"2020-06-10T22:12:02.000Z","size":5035,"stargazers_count":79,"open_issues_count":7,"forks_count":11,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-02T22:35:11.050Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linkedin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-10-10T21:07:32.000Z","updated_at":"2024-09-11T06:13:39.000Z","dependencies_parsed_at":"2022-09-17T13:51:19.443Z","dependency_job_id":null,"html_url":"https://github.com/linkedin/migz","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/linkedin/migz","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fmigz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fmigz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fmigz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fmigz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linkedin","download_url":"https://codeload.github.com/linkedin/migz/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fmigz/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270796217,"owners_count":24647319,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-16T02:00:11.002Z","response_time":91,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T12:01:22.141Z","updated_at":"2025-08-17T01:35:04.664Z","avatar_url":"https://github.com/linkedin.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"# MiGz\n\n## Motivation\n\nCompressing and decompressing files with standard compression utilities like gzip is a single-threaded affair.  For\nlarge files on fast disks, that single thread becomes the bottleneck.\n\nThere are several utilities for multithreaded compression, including an extant Java library \n(https://github.com/shevek/parallelgzip), but no Java library (or GZip utility) also supports\nmultithreaded **de**compression, which is especially important for large files that are read repeatedly.  Hence, MiGz.\n\n## Benefits\n\nMiGz uses the GZip format, which has widespread support and offers both reasonably fast speed and a good compression \nratio.\n\nMiGz'ed files are also entirely valid GZip files, and can be read (single-threaded) by any GZip utility/library, \nincluding GZipInputStream!  Better still, MiGz'ed files can be multithreadedly decompressed by the MiGz decompressor.\n\nOn multicore machines, MiGz compression is *much* faster for any reasonably large file (tens of megabytes or more);\n6x gains were seen on a MacBook with a large Wikipedia dump vs. the gzip command line utility (see Performance, below),\nwith only ~1% increase in file size vs. gzip at max compression.\n\nDecompression is also sped up for larger files (many tens of megabytes or more); for smaller files, it's about the same\nas Java's built-in single-threaded GZipInputStream.  Decompression of the aforementioned Wikipedia data was over 3x\nfaster.\n\n## Performance\n\nUsing default settings on a MacBook Pro (with a SSD) with four hyperthreaded physical cores (8 logical cores):\n\n### Shakespeare\n\nThe time to compress a 25.6MB collection of Shakespeare text was 25% that of GZip at max compression (~1.35s vs. ~6s),\nwith MiGz's output being ~1% larger.  However, the time to decompress, measured with the MUnzip command-line tool, is\n~0.25s vs. GZip's ~0.09s, mostly attributable Java overhead: the time to decompress in Java with GZip is a slightly\nfaster ~0.23s.\n\nStill, using the Java API, in a tight loop decompressing the same in-memory data 100 times and discarding the result,\nthe decompression time per copy is ~0.019s vs. ~0.073s for GZipStream.  We suspect that MiGz requires either some\nJIT-related warm-up or amortizing the extra class loading cost vs. GZipStream before gains are seen on smaller files.\n\n### German Wikipedia\n\nThis is an 18GB XML dump of German Wikipedia articles.  At maximum compression, MiGz compresses it in 198.2s, vs. 810.2s\nfor GZip.  Decompression is 15.6s for MiGz and 65.2s for GZip.  Compressed file size is roughly equal: 5.74GB for MiGz\nand 5.70GB for GZip (a difference of less than 1%).\n\n\n## Using MiGz in Java and other JVM Languages\n\nMiGz is used just like you would use GZipInputStream and GZipOutputStream, with the analogous MiGzInputStream and \nMiGzOutputStream classes.  For example, decompression is as simple as:\n\n```java\nInputStream is = ...\nMiGzInputStream mis = new MiGzInputStream(is);\n```\nCompression is just as simple:\n\n```java\nOutputStream os = ...\nMiGzOutputStream mos = new MiGzOutputStream(os);\n```\n## Using MiGz from the Command-line \n\nThe MiGz project also comes with modules for two simple command-line tools; you may build these yourself or use our\nprecompiled executables (for *nix platforms) or JARs (other platforms).\n\n### mzip\n\nmzip uses MiGz to compresses data from stdin and outputs the compressed data to stdout.  For example, to compress \ndata.txt and write the result to data.gz, we can run:\n\n```bash\nmzip \u003c data.txt \u003e data.gz\n```\n### munzip\n\nmunzip likewise uses MiGz to decompress data from stdin and output the original, uncompressed data to stdout.  For \nexample, to decompress data.gz back to data.txt:\n\n```bash\nmuzip \u003c data.gz \u003e data.txt\n```\n## Recommended settings\n\nThe default block size is 512KB, which provides good speed (smaller block sizes -\u003e better parallelization) on\nrelatively \"small\" (10s of MB) files, while still maintaining file sizes very close to standard gzip, though you can\nreduce block size to ~100KB before the difference is really noticeable.\n\nThe default thread count is either the number of logical cores on your machine (decompression) or twice that\n(compression).  Extra threads are use for compression because MiGz uses the threads to effectively buffer the\noutput without using a dedicated writer thread.  However, this may change in the future and we recommend sticking with\nthe default thread count as \"future proofing\".\n\n## Building MiGz\nTo build the MiGz Java library, use the command `gradle :migz:build`.\nTo build the munzip tool, use the command `gradle :munzip:build`.\nTo build the mzip tool, use the command `gradle :mzip:build`.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkedin%2Fmigz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinkedin%2Fmigz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkedin%2Fmigz/lists"}