{"id":13741624,"url":"https://github.com/judofyr/minz","last_synced_at":"2025-03-22T15:30:43.996Z","repository":{"id":66280390,"uuid":"443133211","full_name":"judofyr/minz","owner":"judofyr","description":"Minimal string compression","archived":false,"fork":false,"pushed_at":"2023-07-20T09:43:06.000Z","size":11,"stargazers_count":49,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-18T12:50:59.712Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Zig","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/judofyr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-12-30T16:51:03.000Z","updated_at":"2025-01-26T11:29:25.000Z","dependencies_parsed_at":"2024-01-25T05:09:14.085Z","dependency_job_id":"cfc6c627-105b-4f13-901e-6b71e2d23382","html_url":"https://github.com/judofyr/minz","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/judofyr%2Fminz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/judofyr%2Fminz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/judofyr%2Fminz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/judofyr%2Fminz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/judofyr","download_url":"https://codeload.github.com/judofyr/minz/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244978404,"owners_count":20541850,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T04:01:01.043Z","updated_at":"2025-03-22T15:30:43.561Z","avatar_url":"https://github.com/judofyr.png","language":"Zig","funding_links":[],"categories":["Libraries"],"sub_categories":[],"readme":"# minz: A minimal compressor\n\nminz is a minimal string compressor based on the paper [FSST: Fast Random Access String Compression](http://www.vldb.org/pvldb/vol13/p2649-boncz.pdf).\n\nThe compressed format is very simple:\nIt uses a pre-computed dictionary of 255 entries, each word being at most 8 bytes long.\nBytes 0x00 to 0xFE adds a word from the dictionary, while byte 0xFF is an escape character which adds the next character as-is.\n\n**Example:** If the dictionary contains `0x00 = hello` and `0x01 = world`,\nthen `0x00 0xFF 0x20 0x01 0xFF 0x21` (six bytes) decompresses into `hello world!`.\n\nThis has the following characteristics:\n\n* You'll have to build the dictionary on sample data before you can compress anything.\n* There's extremely little overhead in the compressed string. \n  This makes it usable for compressing small strings (\u003c200 bytes) directly.\n* The maximal compression ratio is 8x (since each word in the dictionary is at most 8 bytes long), but typical ratio seems to be around ~2x-3x.\n* This makes minz quite different from \"classical\" compression algorithms and it has different use cases.\n  In a database system you can use minz to compress the individual _entries_ in an index,\n  while with other compression schemes you typically have to compress a bigger block.\n  This is what the authors of the paper mean by \"random access string compression\".\n\n## Usage\n\nminz is currently provided as a **library in Zig**.\nThere's no documentation and you'll have to look at the public functions and test cases.\n\nThere's also a small command-line tool which reads in a file, trains a dictionary (from 1% of the lines), compresses each line separately, and then reports the total ratio:\n\n```\n$ zig build\n$ ./zig-out/bin/line-compressor access.log\nReading file: access.log\nRead 689253 lines.\nTraining...\nCompressing...\nUncompressed: 135114557\nCompressed:   46209436\nRatio: 2.9239603140795745\n```\n\n## Current status\n\nThis is just a learning project for me to personally learn the algorithm in the paper.\nIt's not being used in any production systems, and I'm not actively developing it.\n\nIn addition, the dictionary-training algorithm presented in the paper is actually a bit vague on the exact details.\nThere is some choice in how you combine symbols and right now it doesn't seem to create an \"optimal\" dictionary according to human inspection.\nIf you intend to use this for a \"real\" project you'll probably have to invest some more time.\n\n## Roadmap / pending work\n\n- [ ] Improve training algorithm.\n- [ ] Command-line tool (for training/encoding/decoding).\n- [ ] Plain JavaScript encoder/decoder.\n- [ ] Optimized encoder using AVX512.\n- [ ] Integrate encoder/decoder as a native Node module.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjudofyr%2Fminz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjudofyr%2Fminz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjudofyr%2Fminz/lists"}