{"id":17498410,"url":"https://github.com/pothos/zpaqlpy","last_synced_at":"2025-08-25T21:06:54.828Z","repository":{"id":142163494,"uuid":"66582823","full_name":"pothos/zpaqlpy","owner":"pothos","description":"Compiles a zpaqlpy source file (a Python-subset) to a ZPAQ configuration file for usage with zpaqd","archived":false,"fork":false,"pushed_at":"2022-08-30T15:40:42.000Z","size":18035,"stargazers_count":21,"open_issues_count":0,"forks_count":4,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-04T21:36:23.229Z","etag":null,"topics":["bytecode","compiler","compression","python-subset","zpaq","zpaql"],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pothos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-08-25T18:24:50.000Z","updated_at":"2024-05-02T19:45:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"90615473-02f7-4135-a333-8465284ac01e","html_url":"https://github.com/pothos/zpaqlpy","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pothos%2Fzpaqlpy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pothos%2Fzpaqlpy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pothos%2Fzpaqlpy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pothos%2Fzpaqlpy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pothos","download_url":"https://codeload.github.com/pothos/zpaqlpy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248896589,"owners_count":21179458,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bytecode","compiler","compression","python-subset","zpaq","zpaql"],"created_at":"2024-10-19T16:58:11.401Z","updated_at":"2025-04-14T14:29:06.625Z","avatar_url":"https://github.com/pothos.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"zpaqlpy compiler\n================\n\nCompiles a zpaqlpy source file (a Python-subset) to a ZPAQ configuration file for usage with zpaqd.\n\nThat way it is easy to develop new compression algorithms with ZPAQ.\n\nOr to bring a decompression algorithm to the ZPAQ format so that the compressed data can be stored in a ZPAQ archive without breaking compatibility.\n\nAn example is the `brotlizpaq` wrapper around `zpaqd` which compresses the input files with brotli and stores them as valid blocks in a ZPAQ archive (which will decompress slower than native brotli decompression due to the less efficient ZPAQL implementation).\n\nThe Python source files are standalone executable with Python 3 (tested: 3.4, 3.5).\n\nJump to the end for a tutorial or look into [test/lz1.py](https://github.com/pothos/zpaqlpy/tree/master/test/lz1.py), [test/pnm.py](https://github.com/pothos/zpaqlpy/tree/master/test/pnm.py) or [test/brotli.py](https://github.com/pothos/zpaqlpy/tree/master/test/brotli.py) for an example.\n\nDownload from [releases](https://github.com/pothos/zpaqlpy/releases)\nor install with\n\n    git clone https://github.com/pothos/zpaqlpy.git\n    cd zpaqlpy\n    cargo install  # will build and copy the binary to ~/.cargo/bin/\n\nBuild in place with: `make zpaqlpy`\n\nTo build again: `make clean`\n\n[B.Sc. Thesis](https://pothos.github.io/papers/BSc_thesis_ZPAQL_compiler.pdf)\n\nCopyright (C) 2016 Kai Lüke kailueke at@ riseup.net\n\nThis program is free software: you can redistribute it and/or modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation, either version 3 of the License, or\n(at your option) any later version.\n\nThis program is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\nGNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program.  If not, see \u003chttp://www.gnu.org/licenses/\u003e.\n\n\nThe ZPAQ format and the zpaq archiver\n=====================================\n\n**The ZPAQ Open Standard Format for Highly Compressed Data**\n\nBased on the idea to deliver the decompression algorithm together with\nthe compressed data this archive format wants to solve the problem that\nchanges to the algorithm need new software at the recipient's device.\nAlso it acknowledges the fact that different input data should be\nhandled with different compression techniques.\n\nThe PAQ compression programmes typically use context mixing i.e.\nmixing different predictors which are context-aware for usage in an\narithmetic encoder, and thus often achieve the best known compression\nresults. The ZPAQ archiver is the successor to them and also supports\nmore simple models like LZ77 and BWT depending on the input data.\n\nIt is only specified how decompression takes place. The format makes\nuse of predefined context model components which can be woven into\na network, a binary code for context computation for components and a\npostprocessor which reverts a transformation on the input data that\ntook place before the data was passed to the context mixing and\nencoding phase. The postprocessor is also delivered as a bytecode\nlike the context computation code before the compressed data begins.\n\nSpecification: http://mattmahoney.net/dc/zpaq206.pdf\n\n**zpaq - Incremental Journaling Backup Utility and Archiver**\n\nThe end user archiver supports incremental backups with deduplication as\nwell as flat streaming archives (ZPAQ format Level 1). It picks simple\nor more complex depending on whether they perform for the input data\nand which compression level was specified for the files to append\nto the archive. Arbitrary algorithms are not supported, but a good\nvariety of specialised and universal methods is available.\n\nHomepage: http://mattmahoney.net/dc/zpaq.html\n\nWorking principle: http://mattmahoney.net/dc/zpaq_compression.pdf\n\n**zpaqd - development tool for new algorithms**\n\nThe zpaqd development tool only allows creation of streaming mode\narchives, but therefore accepts a ZPAQ configuration file containing\ninformation on the used context mixing components, the ZPAQL programme\nfor context computation and the ZPAQL postprocessing programme in order\nto revert a possible transformation that took place (LZ77, BWT,\nE8E9 for x86 files or any custom transformation), which is applied\nbefore compression an externally called programme named in the\nconfiguration. There are special configurations for JPG, BMP and more.\n\nHomepage: http://mattmahoney.net/dc/zpaqutil.html\n\nThe zpaqlpy Python-subset\n=========================\n\n**Grammar**\n\nFor user-defined sections of the template. Not all is supported but anyway\nincluded for specific error messages instead of parser errors (e.g. nonlocal,\ndicts, strings or the @-operator for matrix multiplication).\n\nListed here are productions with NUMBER, NAME, ”symbols”, NEWLINE, INDENT,\nDEDENT or STRING as terminals, nonterminals are defined on the left side of the -\u003e arrow.\n\n    Prog -\u003e (NEWLINE* stmt)* ENDMARKER?\n    funcdef -\u003e ”def” NAME Parameters ”:” suite\n    Parameters -\u003e ”(” Typedargslist? ”)”\n    Typedargslist -\u003e Tfpdef (”=” test)? (”,” Tfpdef (”=” test)?)* (”,” (”**” Tfpdef)?)?\n    Tfpdef -\u003e NAME (”:” test)?\n    stmt -\u003e simple_stmt | compound_stmt\n    simple_stmt -\u003e small_stmt (”;” small_stmt)* ”;”? NEWLINE\n    small_stmt -\u003e expr_stmt, pass_stmt, flow_stmt, global_stmt, nonlocal_stmt\n    expr_stmt -\u003e (store_assign augassign test) | ((store_assign ”=”)? test)\n    store_assign -\u003e NAME (”[” test ”]”)?\n    augassign -\u003e ”+=” | ”-=” | ”*=” | ”@=” | ”//=” | ”/=” | ”%=” | ”\u0026=” | ”|=” | ”^=” | ”\u003c\u003c=” | ”\u003e\u003e=” | ”**=”\n    pass_stmt -\u003e ”pass”\n    flow_stmt -\u003e break_stmt | continue_stmt | return_stmt\n    break_stmt -\u003e ”break”\n    continue_stmt -\u003e ”continue”\n    return_stmt -\u003e ”return” test\n    global_stmt -\u003e ”global” NAME (”,” NAME)*\n    nonlocal_stmt -\u003e ”nonlocal” NAME (”,” NAME)*\n    compound_stmt -\u003e if_stmt | while_stmt | funcdef\n    if_stmt -\u003e ”if” test ”:” suite (”elif” test ”:” suite)* (”else” ”:” suite)?\n    while_stmt -\u003e ”while” test ”:” suite (”else” ”:” suite)?\n    suite -\u003e simple_stmt, NEWLINE INDENT stmt+ DEDENT\n    test -\u003e or_test\n    test_nocond -\u003e or_test\n    or_test -\u003e and_test (”or” and_test)*\n    and_test -\u003e not_test (”and” not_test)*\n    not_test -\u003e comparison | (”not” not_test)\n    comparison -\u003e expr (comp_op expr)*\n    comp_op -\u003e ”\u003c” | ”\u003e” | ”==” | ”\u003e=” | ”\u003c=” | ”!=” | ”in” | ”not” ”in” | ”is” | ”is” ”not”\n    expr -\u003e xor_expr (”|” xor_expr)*\n    xor_expr -\u003e and_expr (”^” and_expr)*\n    and_expr -\u003e shift_expr (”\u0026” shift_expr)*\n    shift_expr -\u003e arith_expr | (arith_expr (shift_op arith_expr)+)\n    shift_op -\u003e ”\u003c\u003c” | ”\u003e\u003e”\n    arith_expr -\u003e term | (term (t_op term)+)\n    t_op -\u003e ”+” | ”-”\n    term -\u003e factor (f_op factor)*\n    f_op -\u003e ”*” | ”@” | ”/” | ”%” | ”//”\n    factor -\u003e (”+” factor) | (”-” factor) | (”~” factor) | power\n    power -\u003e atom_expr (”**” factor)?\n    atom_expr -\u003e (NAME ”(” arglist? ”)”) | (NAME ”[” test ”]”) | atom\n    atom -\u003e (”(” test ”)”) | (”” dictorsetmaker? ””) | NUMBER | STRING+ | ”...”\n            ”None” | ”True” | ”False” | NAME\n    dictorsetmaker -\u003e dictorsetmaker_t (”,” dictorsetmaker_t)* ”,”?\n    dictorsetmaker_t -\u003e test ”:” test\n    arglist -\u003e test (”,” test)* ”,”?\n\n**Notes**\n\nAn input has to be organised like the template, so best is to fill it out with\nthe values for hh, hm, ph, pm like in a ZPAQ configuration to define the size of\nH and M in hcomp and pcomp sections. In the dict which serves for calculation of\nn (i.e. number of context mixing components) you have to specify the components\nas in a ZPAQ configuration file, arguments are documented in the specification\n(see `--info-zpaq` for link).\n\nOnly valid Python programmes without exceptions are supported as input, so run\nthem standalone before compiling.\nFor the arrays on top of H or M there is no boundary check, please make sure\nthe Python version works correct. If you need a ringbuffer on H or M, you have\nto use `% len(hH)` or `\u0026((1\u003c\u003chh)-1)` and can not rely on integer overflows or the\nmodulo-array-length operation on indices in H or M like in plain ZPAQL because\nH is expanded to contain the stack (and also due to the lack of overflows when\nrunning the plain Python script)\n\nOnly positive 32-bit integers can be used, no strings, lists, arbitrary big\nnumbers, classes, closures and (function) objects.\n\n**Input File**\n\nMust be a runnable Python 3.5 file in form of the template and encoded as UTF-8\nwithout a BOM (Byte-Order-Mark). The definitions at the beginning should be\naltered and own code inserted only behind. The other two editable sections can\nrefer to definitions in the first section.\n\n            Template Sections (--emit-template \u003e source.py)         |   Editable?\n    ----------------------------------------------------------------|--------------\n      Definition of the ZPAQ configuration header data (memory size, context mixing components) and optionally functions and variables used by both hcomp and pcomp                        |      yes\n      API functions for input and output, initialization of memory  |       no\n      function hcomp and associated global variables and functions  |      yes\n      function pcomp and associated global variables and functions  |      yes\n      code for standalone execution of the Python file analog to running a ZPAQL configuration with zpaqd `r [cfg] p|h`          |       no\n\n**Exposed API**\n\nThe 32- or 8-bit memory areas H and M are available as arrays `hH`, `pH`, `hM`, `pM`\ndepending on being a hcomp or pcomp section with size `2**hh` , `2**hm` , `2**ph`,\n`2**pm` defined in the header as available constants hh, hm, ph, pm.\nThere is support for `len(hH)`, `len(pH)`, `len(hM)`, `len(pM)` instead of calculating\n`2**hh`. But in general len() is not supported, see `len_hH()` below for dynamic\narrays. `NONE` is a shortcut for 0 - 1 = 4294967295.\n\n          Other functions       |                   Description\n    ----------------------------|--------------------------------------------------\n    c = read_b()                | Read one input byte, might leave VM execution and return to get next\n    push_b(c)                   | Put read byte c back, overwrites if already present (no buffer)\n    c = peek_b()                | Read but do not consume next byte, might leave VM execution and return to get next\n    out(c)                      | In pcomp: write c to output stream\n    error()                     | Execution fails with ”Bad ZPAQL opcode”\n    aref = alloc_pH(asize), …   | Allocate an array of size asize on pH/pM/hH/hM\n    aref = array_pH(intaddr), … | Cast an integer address back to a reference\n    len_pH(aref), …             | Get the length of an array in pH/pM/hH/hM\n    free_pH(aref), …            | Free the memory in pH/pM/hH/hM again by\n                                | destructing the array\n\nIf backend implementations `addr_alloc_pH(size)`, `addr_free_pH(addr)`, … are\ndefined then dynamic memory management is available though the API functions\n`alloc_pM` and `free_pM`. The cast `array_pH(numbervar)` is sometimes needed when the\narray reference is passed between functions because then it is just treated as\ninteger again because no boxed types are used in general.\n\nThe template provides sample implementations of `addr_alloc_pM`, `addr_free_pM` , ….\nThe returned pointer is expected to point at the first element of the array. One\nentry before the first element is used to store whether this memory section is\nfree or not. Before that the length of the array is store, i.e.\nH[arraypointer-2] for arrays in H and the four bytes\nM[arraypointer-5]…M[arraypointer-2] of the 32-bit length for arrays in M.\n\nThe last addressable starting point for any list is 2147483647 == (1\u003c\u003c31) - 1\nbecause the compiler uses the 32nd bit to distinguish between pointers to M/H.\n\nTutorial: Writing new code\n==========================\n\nA context mixing model with a preprocessor for run length encoding.\nThree components are used to form the network.\n\nCreate a new template which will then be modified at the beginning and the pcomp/hcomp sections:\n\n    ./zpaqlpy --emit-template \u003e rle_model.py\n    chmod +x rle_model.py\n\nFirst the size of the arrays H and M for each section, hcomp and pcomp needs to be specified:\n\n    hh = 2  # i.e. size is 2**2 = 4, because H[0], H[1], H[2] are the inputs for the components\n\nThe first component should give predictions based on the byte value and the second component based on the run length,\nboth give predictions for the next count and the next value.\nThen the context-mixing components are combined to a network:\n\n    n = len({\n      0: \"cm 19 22\",  # context table size 2*19 with partly decoded byte as 9 bit hash xored with the context, count limit 22\n      1: \"cm 19 22\",\n      2: \"mix2 1 0 1 30 0\",  # will mix 0 and 1 together, context table size 2**1 with and-0 masking of the partly decoded byte which is added to the context, learning rate 30\n    })\n\nEach component i gets its context input from the entry in H[i] after each run of\nthe hcomp function, which is called for each input byte of the preprocessed data,\nwhich either is to be stored through arithmetic coding in compression phase\nor is retrieved through decoding in decompression phase with following\npostprocessing done by calls of the pcomp function.\n\nThen we specify a preprocessor:\n\n    pcomp_invocation = \"./simple_rle\"\n\nThe context-mixing network is written to the archive in byte representation\nas well as the bytecode for hcomp and pcomp (if they are used).\nThe preprocessor command is needed when the compiled file is used with zpaqd\nif a pcomp section is present.\nAs the preprocessor might be any external programme or also included in the\ncompressing archiver and is of no use for decompression it is therefore not\nmentioned in the archive anymore.\n\nCreate the preprocessor file and fill it:\n\n    $ chmod +x simple_rle\n    $ cat ./simple_rle\n    #!/usr/bin/env python3\n    import sys\n    input = sys.argv[1]\n    output = sys.argv[2]\n    with open(input, mode='rb') as fi:\n      with open(output, mode='wb') as fo:\n          last = None\n          count = 0\n          data = []\n          for a in fi.read():\n            if a != last or count == 255:  # count only up to 255 to use one byte\n              if last != None:  # write out the pair\n                data.append(last)\n                data.append(count)\n              last = a  # start counting\n              count = 1\n            else:\n              count += 1  # continue counting\n          if last != None:\n            data.append(last)\n            data.append(count)\n          fo.write(bytes(data))\n\nThen we need code in the pcomp section to undo this transform:\n\n    case_loading = False\n    last = NONE\n    \n    def pcomp(c):\n      global case_loading, last\n      if c == NONE:  # start of new segment, so restart our code\n        case_loading = False\n        last = NONE\n        return\n      if not case_loading:  # c is byte to load\n        case_loading = True\n        last = c\n      else:  # write out content of last c times\n        case_loading = False\n        while c \u003e 0:\n          c-= 1\n          out(last)\n\nSo now it should produce the same file as the input file:\n\n    ./simple_rle INPUTFILE input.rle\n    ./rle_model.py pcomp input.rle input.norle\n    cmp INPUTFILE input.norle\n\nAnd we can already try it, even if hcomp does not compute the context data yet (so compression is not really good):\n\n    ./zpaqlpy rle_model.py\n    ./zpaqd c rle_model.cfg archive.zpaq FILE FILE FILE\n\nNow we can add hcomp code to improve compression by adaptive prediction:\n\n    at_counter = False  # if false, then c is byte, otherwise c is a counter\n    last_value = 0\n    last_counter = 0\n    \n    def hcomp(c):  # pcomp bytecode is passed first (or 0 if there is none)\n      global at_counter, last_value, last_counter\n      if at_counter:\n        last_counter = c\n      else:\n        last_value = c\n      # first part of the context for the first CM is the byte replicated and\n      # the second part is whether we are at a counter (then we predict for a byte) or vice versa\n      hH[0] = (last_value \u003c\u003c 1) + at_counter  # at_counter will occupy one bit, therefore shift\n      hH[0] \u003c\u003c= 9  # again shift to the side because of the xor with the partially decoded byte\n      # second CM same but uses the counter for prediction\n      hH[1] = (last_counter \u003c\u003c 1) + at_counter\n      hH[1] \u003c\u003c= 9\n      hH[2] = at_counter + 0  # context for mixer: is at counter (1) or not (0)\n      at_counter = not at_counter\n\nWe need to compile again before we run the final ZPAQ configuration file:\n\n    ./zpaqlpy rle_model.py\n    ./zpaqd c rle_model.cfg archive.zpaq FILE FILE FILE\n\nzpaqd needs to have simple_rle in the same folder because we specified `pcomp_invocation = \"./simple_rle\"`\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpothos%2Fzpaqlpy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpothos%2Fzpaqlpy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpothos%2Fzpaqlpy/lists"}