{"id":17717288,"url":"https://github.com/bwesterb/go-ncrlite","last_synced_at":"2025-04-30T22:04:29.700Z","repository":{"id":248895028,"uuid":"827896670","full_name":"bwesterb/go-ncrlite","owner":"bwesterb","description":"Compress sets of integers efficiently","archived":false,"fork":false,"pushed_at":"2024-08-02T10:30:12.000Z","size":61,"stargazers_count":17,"open_issues_count":2,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-08-02T12:04:41.452Z","etag":null,"topics":["compression","huffman-coding","integers","ncr"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bwesterb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-12T16:04:21.000Z","updated_at":"2024-08-02T10:30:15.000Z","dependencies_parsed_at":"2024-07-17T19:40:23.780Z","dependency_job_id":"71c9051b-711b-4a69-b6a7-5c1f3c9e527a","html_url":"https://github.com/bwesterb/go-ncrlite","commit_stats":null,"previous_names":["bwesterb/go-ncrlite"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwesterb%2Fgo-ncrlite","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwesterb%2Fgo-ncrlite/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwesterb%2Fgo-ncrlite/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwesterb%2Fgo-ncrlite/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bwesterb","download_url":"https://codeload.github.com/bwesterb/go-ncrlite/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246373955,"owners_count":20766815,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compression","huffman-coding","integers","ncr"],"created_at":"2024-10-25T14:19:42.415Z","updated_at":"2025-04-30T22:04:29.693Z","avatar_url":"https://github.com/bwesterb.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"go-ncrlite\n==========\n\n*ncrlite* is a simple and fast compression format specifically designed to compress an unordered\nset of positive integers (below 2⁶⁴).\nThis repository contains a [Go package](https://pkg.go.dev/github.com/bwesterb/go-ncrlite#Compress)\nthat implements it and a commandline tool.\n\n**Warning.** The file format is not yet final.\n\nPerformance\n-----------\n\n*ncrlite* achieves smaller compressed sizes than general-purpose compressors.\n\n| Dataset | Description | CSV | ncrlite | `gzip -9` | `xz -9` |\n| --- | --- | --- | --- | --- | --- |\n| [le.csv](https://westerbaan.name/~bas/ncrlite/le.csv.ncrlite) | Sequence numbers of Let's Encrypt certificates revoked on July 18th, 2024 | 4.8MB | 706kB | 1.7MB | 900kB |\n| [primes.csv](https://westerbaan.name/~bas/ncrlite/primes.csv.ncrlite) | First million prime numbers | 8.2MB | 674kB | 2.4MB | 941kB |\n| [sigs.csv](https://westerbaan.name/~bas/ncrlite/sigs.csv.ncrlite) | List of the 9 signature algorithms supported by Chrome 126 | 44B | 16B | 58B | 96B |\n| [9900.csv](https://westerbaan.name/~bas/ncrlite/9900.csv.ncrlite) | Numbers {9900, 9901, ..., 9999, 10000} | 506B | 24B | 181B | 200B |\n\nCompared to more specialized compressors, *ncrlite* outperforms [Elias–Fano](https://github.com/bwesterb/go-ncrlite/issues/2).\n*nrclite* performs slightly worse than [Rice coding](https://en.wikipedia.org/wiki/Golomb_coding) on random sets,\nbut is still close to the theoretical limit of *lg N choose k*. *ncrlite* does perform better than Rice coding on skewed sets like {9900, ..., 10000}.\n\n| Dataset | ncrlite | Rice | Elias–Fano | Limit for random sets |\n| --- | --- | --- | --- | --- |\n| le.csv | 706kB | 707kB | 734kB | 704kB |\n| primes.csv | 674kB | 669kB | 742kB | 668kB |\n| sigs.csv | 16B | 11B | 11B | 11B |\n| 9900.csv | 24B | 108B | 108B | 101B |\n\n### Theoretical limit for random sets\n\nThere are *N choose k* subsets of *k* positive integers below *N*.\nThus there is a hard limit: no compression method can encode *every*\nsuch set in less than *lg N choose k* bits.\n\nOf course a compression method can beat the limit for specific sets,\nbut it will have to compensate by using more bits for others.\n\n#### Origin of the name\n\nThe name *ncrlite* is a pun on this theoretical limit\nand [CRLite](https://blog.mozilla.org/security/2020/01/09/crlite-part-1-all-web-pki-revocations-compressed/).\nNamely *N choose k* is sometimes written as *N nCr k*, including\non my old [TI 83+](https://en.wikipedia.org/wiki/TI-83_series),\nand I studied this problem initially in the context of compressing\ncertificate transparency index numbers of revoked certificates.\n\nCommandline tool\n----------------\n\n### Installation\n\nInstall [Go](https://go.dev/doc/install) and run\n\n```\n$ go install github.com/bwesterb/go-ncrlite/cmd/ncrlite@latest\n```\n\nNow you can use `ncrlite`.\n\n### Basic operation\n\n`ncrlite` takes as input a textfile with a positive number on each line.\n\n```\n$ cat dunbar\n5\n15\n35\n150\n500\n1500\n```\n\nTo compress simply run:\n\n```\n$ ncrlite dunbar\n```\n\nThis will create `dunbar.ncrlite` and remove `dunbar`.\n\nThe input file does not have to be sorted (numerically). If it is not, `ncrlite` will sort the input first, which is slower.\n\nTo decompress, run:\n\n```\n$ ncrlite -d dunbar.ncrlite\n```\n\nThis will create `dunbar` and remove `dunbar.ncrlite`. The output file is always sorted.\n\n### Other formats\n\nAt the moment, the `ncrlite` commandline tool only supports the simple text format.\n[Reach out](https://github.com/bwesterb/go-ncrlite/issues/1) if another is useful.\n\n### Other flags\n\n`ncrlite` supports several familiar flags.\n\n```\n  -f, --force\n    \toverwrite output\n  -k, --keep\n    \tkeep (don't delete) input file\n  -c, --stdout\n    \twrite to stdout; implies -k\n```\n\nWithout specifying a filename (or using `-`),\n`ncrlite` will read from `stdin` and write to `stdout`.\n\n### Inspect compressed file\n\nWith `-i` we can inspect a compressed file:\n\n```\n$ ncrlite -i le.csv.ncrlite \nmax bitlength        14\ncodelength h[0]      9\ndictionary size      56b\n\nCodebook bitlengths:\n 0 111111110\n 1 11111110\n 2 1111100\n 3 1111101\n 4 11110\n 5 1100\n 6 1101\n 7 100\n 8 00\n 9 01\n10 101\n11 1110\n12 1111110\n13 1111111110\n14 1111111111\n\nMaximum value    (N)  382584265\nNumber of values (k)  512652\nTheoretical best avg  703953.8B\nOverhead              0.4%\n```\n\nFormat\n------\nIn short: we store the deltas (differences) which are each prefixed by a Huffman\ncode for their bitlength. The Huffman code is stored using bzip2's method.\n\nNow, in detail. The file starts with the **size** of the set as an unsigned varint.\n\nThere are two special cases.\n\n1. If the size of the set is zero, the file ends immediately after the size\n   (without endmarker.)\n\n2. If the size of the set is one, then the value of that element is encoded\n   as an unsigned varint after it and the file ends (without endmarker.)\n\nThe values of the set are not encoded directly, but instead their **deltas**\nare encoded. The *n*th delta is the difference between the *n*th\nand the *n-1*th value, considering the set as a sorted list.\n\nThe first delta is special: it's the minimum value of the set plus one\nso that a delta is never zero.\n\nFor each delta *d*, we consider its **bitlength**. That is the least *l*\nsuch that *2^(l+1) \u003e d*. Note that this is different from the typical\ndefinition of bitlength being one smaller: the length of 1 is 0 and of 4 is 2.\n\nThe six least significant bits of the next byte encode the largest bitlength\nof any delta that occurs. We assign to that bitlength, and each smaller\nbitlength, a canonical Huffman code by encoding the length of each of their\ncodewords.\n\nThe next six bits (that is: the two most significant\nbits of the byte used for the largest bitlength, and the four least significant\nbits of the byte afterward) encode the length of the codeword for the\nzero bitlength.\n\nWe continue with the remaining bitlengths in order. If the next bitlength\nhas a codeword of the same length as the previous codeword, we encode this\nwith a single bit 1.\n\nInstead, if the codeword is one larger we encode this as first a single bit 0\nto say we're not done; then a single bit 1 to say the next is larger;\nand finally a 1 to say we're done. Together: `0b101`.\nIf it's two larger we repeat twice: `0b10101`. And so on. If the codeword\nis one smaller we use `0b001`, and repeat `00` if the difference is larger.\n\nAfter having encoded the Huffman code for the bitlengths, we encode\nthe deltas themselves. First we write the Huffman code for the bitlength.\nThen we write the delta with that many bits, without its most significant bit\nas it's implied.\n\nFinally, we write the endmarker `0xaa` = `0b10101010`. This allows for simpler\ndecompression using prefix tables. The remaining high bits in the final byte\nare set to zero.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbwesterb%2Fgo-ncrlite","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbwesterb%2Fgo-ncrlite","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbwesterb%2Fgo-ncrlite/lists"}