An open API service indexing awesome lists of open source software.

https://github.com/pothos/zpaqlpy

Compiles a zpaqlpy source file (a Python-subset) to a ZPAQ configuration file for usage with zpaqd
https://github.com/pothos/zpaqlpy

bytecode compiler compression python-subset zpaq zpaql

Last synced: 10 months ago
JSON representation

Compiles a zpaqlpy source file (a Python-subset) to a ZPAQ configuration file for usage with zpaqd

Awesome Lists containing this project

README

          

zpaqlpy compiler
================

Compiles a zpaqlpy source file (a Python-subset) to a ZPAQ configuration file for usage with zpaqd.

That way it is easy to develop new compression algorithms with ZPAQ.

Or to bring a decompression algorithm to the ZPAQ format so that the compressed data can be stored in a ZPAQ archive without breaking compatibility.

An example is the `brotlizpaq` wrapper around `zpaqd` which compresses the input files with brotli and stores them as valid blocks in a ZPAQ archive (which will decompress slower than native brotli decompression due to the less efficient ZPAQL implementation).

The Python source files are standalone executable with Python 3 (tested: 3.4, 3.5).

Jump to the end for a tutorial or look into [test/lz1.py](https://github.com/pothos/zpaqlpy/tree/master/test/lz1.py), [test/pnm.py](https://github.com/pothos/zpaqlpy/tree/master/test/pnm.py) or [test/brotli.py](https://github.com/pothos/zpaqlpy/tree/master/test/brotli.py) for an example.

Download from [releases](https://github.com/pothos/zpaqlpy/releases)
or install with

git clone https://github.com/pothos/zpaqlpy.git
cd zpaqlpy
cargo install # will build and copy the binary to ~/.cargo/bin/

Build in place with: `make zpaqlpy`

To build again: `make clean`

[B.Sc. Thesis](https://pothos.github.io/papers/BSc_thesis_ZPAQL_compiler.pdf)

Copyright (C) 2016 Kai Lüke kailueke at@ riseup.net

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see .

The ZPAQ format and the zpaq archiver
=====================================

**The ZPAQ Open Standard Format for Highly Compressed Data**

Based on the idea to deliver the decompression algorithm together with
the compressed data this archive format wants to solve the problem that
changes to the algorithm need new software at the recipient's device.
Also it acknowledges the fact that different input data should be
handled with different compression techniques.

The PAQ compression programmes typically use context mixing i.e.
mixing different predictors which are context-aware for usage in an
arithmetic encoder, and thus often achieve the best known compression
results. The ZPAQ archiver is the successor to them and also supports
more simple models like LZ77 and BWT depending on the input data.

It is only specified how decompression takes place. The format makes
use of predefined context model components which can be woven into
a network, a binary code for context computation for components and a
postprocessor which reverts a transformation on the input data that
took place before the data was passed to the context mixing and
encoding phase. The postprocessor is also delivered as a bytecode
like the context computation code before the compressed data begins.

Specification: http://mattmahoney.net/dc/zpaq206.pdf

**zpaq - Incremental Journaling Backup Utility and Archiver**

The end user archiver supports incremental backups with deduplication as
well as flat streaming archives (ZPAQ format Level 1). It picks simple
or more complex depending on whether they perform for the input data
and which compression level was specified for the files to append
to the archive. Arbitrary algorithms are not supported, but a good
variety of specialised and universal methods is available.

Homepage: http://mattmahoney.net/dc/zpaq.html

Working principle: http://mattmahoney.net/dc/zpaq_compression.pdf

**zpaqd - development tool for new algorithms**

The zpaqd development tool only allows creation of streaming mode
archives, but therefore accepts a ZPAQ configuration file containing
information on the used context mixing components, the ZPAQL programme
for context computation and the ZPAQL postprocessing programme in order
to revert a possible transformation that took place (LZ77, BWT,
E8E9 for x86 files or any custom transformation), which is applied
before compression an externally called programme named in the
configuration. There are special configurations for JPG, BMP and more.

Homepage: http://mattmahoney.net/dc/zpaqutil.html

The zpaqlpy Python-subset
=========================

**Grammar**

For user-defined sections of the template. Not all is supported but anyway
included for specific error messages instead of parser errors (e.g. nonlocal,
dicts, strings or the @-operator for matrix multiplication).

Listed here are productions with NUMBER, NAME, ”symbols”, NEWLINE, INDENT,
DEDENT or STRING as terminals, nonterminals are defined on the left side of the -> arrow.

Prog -> (NEWLINE* stmt)* ENDMARKER?
funcdef -> ”def” NAME Parameters ”:” suite
Parameters -> ”(” Typedargslist? ”)”
Typedargslist -> Tfpdef (”=” test)? (”,” Tfpdef (”=” test)?)* (”,” (”**” Tfpdef)?)?
Tfpdef -> NAME (”:” test)?
stmt -> simple_stmt | compound_stmt
simple_stmt -> small_stmt (”;” small_stmt)* ”;”? NEWLINE
small_stmt -> expr_stmt, pass_stmt, flow_stmt, global_stmt, nonlocal_stmt
expr_stmt -> (store_assign augassign test) | ((store_assign ”=”)? test)
store_assign -> NAME (”[” test ”]”)?
augassign -> ”+=” | ”-=” | ”*=” | ”@=” | ”//=” | ”/=” | ”%=” | ”&=” | ”|=” | ”^=” | ”<<=” | ”>>=” | ”**=”
pass_stmt -> ”pass”
flow_stmt -> break_stmt | continue_stmt | return_stmt
break_stmt -> ”break”
continue_stmt -> ”continue”
return_stmt -> ”return” test
global_stmt -> ”global” NAME (”,” NAME)*
nonlocal_stmt -> ”nonlocal” NAME (”,” NAME)*
compound_stmt -> if_stmt | while_stmt | funcdef
if_stmt -> ”if” test ”:” suite (”elif” test ”:” suite)* (”else” ”:” suite)?
while_stmt -> ”while” test ”:” suite (”else” ”:” suite)?
suite -> simple_stmt, NEWLINE INDENT stmt+ DEDENT
test -> or_test
test_nocond -> or_test
or_test -> and_test (”or” and_test)*
and_test -> not_test (”and” not_test)*
not_test -> comparison | (”not” not_test)
comparison -> expr (comp_op expr)*
comp_op -> ”<” | ”>” | ”==” | ”>=” | ”<=” | ”!=” | ”in” | ”not” ”in” | ”is” | ”is” ”not”
expr -> xor_expr (”|” xor_expr)*
xor_expr -> and_expr (”^” and_expr)*
and_expr -> shift_expr (”&” shift_expr)*
shift_expr -> arith_expr | (arith_expr (shift_op arith_expr)+)
shift_op -> ”<<” | ”>>”
arith_expr -> term | (term (t_op term)+)
t_op -> ”+” | ”-”
term -> factor (f_op factor)*
f_op -> ”*” | ”@” | ”/” | ”%” | ”//”
factor -> (”+” factor) | (”-” factor) | (”~” factor) | power
power -> atom_expr (”**” factor)?
atom_expr -> (NAME ”(” arglist? ”)”) | (NAME ”[” test ”]”) | atom
atom -> (”(” test ”)”) | (”” dictorsetmaker? ””) | NUMBER | STRING+ | ”...”
”None” | ”True” | ”False” | NAME
dictorsetmaker -> dictorsetmaker_t (”,” dictorsetmaker_t)* ”,”?
dictorsetmaker_t -> test ”:” test
arglist -> test (”,” test)* ”,”?

**Notes**

An input has to be organised like the template, so best is to fill it out with
the values for hh, hm, ph, pm like in a ZPAQ configuration to define the size of
H and M in hcomp and pcomp sections. In the dict which serves for calculation of
n (i.e. number of context mixing components) you have to specify the components
as in a ZPAQ configuration file, arguments are documented in the specification
(see `--info-zpaq` for link).

Only valid Python programmes without exceptions are supported as input, so run
them standalone before compiling.
For the arrays on top of H or M there is no boundary check, please make sure
the Python version works correct. If you need a ringbuffer on H or M, you have
to use `% len(hH)` or `&((1< source.py) | Editable?
----------------------------------------------------------------|--------------
Definition of the ZPAQ configuration header data (memory size, context mixing components) and optionally functions and variables used by both hcomp and pcomp | yes
API functions for input and output, initialization of memory | no
function hcomp and associated global variables and functions | yes
function pcomp and associated global variables and functions | yes
code for standalone execution of the Python file analog to running a ZPAQL configuration with zpaqd `r [cfg] p|h` | no

**Exposed API**

The 32- or 8-bit memory areas H and M are available as arrays `hH`, `pH`, `hM`, `pM`
depending on being a hcomp or pcomp section with size `2**hh` , `2**hm` , `2**ph`,
`2**pm` defined in the header as available constants hh, hm, ph, pm.
There is support for `len(hH)`, `len(pH)`, `len(hM)`, `len(pM)` instead of calculating
`2**hh`. But in general len() is not supported, see `len_hH()` below for dynamic
arrays. `NONE` is a shortcut for 0 - 1 = 4294967295.

Other functions | Description
----------------------------|--------------------------------------------------
c = read_b() | Read one input byte, might leave VM execution and return to get next
push_b(c) | Put read byte c back, overwrites if already present (no buffer)
c = peek_b() | Read but do not consume next byte, might leave VM execution and return to get next
out(c) | In pcomp: write c to output stream
error() | Execution fails with ”Bad ZPAQL opcode”
aref = alloc_pH(asize), … | Allocate an array of size asize on pH/pM/hH/hM
aref = array_pH(intaddr), … | Cast an integer address back to a reference
len_pH(aref), … | Get the length of an array in pH/pM/hH/hM
free_pH(aref), … | Free the memory in pH/pM/hH/hM again by
| destructing the array

If backend implementations `addr_alloc_pH(size)`, `addr_free_pH(addr)`, … are
defined then dynamic memory management is available though the API functions
`alloc_pM` and `free_pM`. The cast `array_pH(numbervar)` is sometimes needed when the
array reference is passed between functions because then it is just treated as
integer again because no boxed types are used in general.

The template provides sample implementations of `addr_alloc_pM`, `addr_free_pM` , ….
The returned pointer is expected to point at the first element of the array. One
entry before the first element is used to store whether this memory section is
free or not. Before that the length of the array is store, i.e.
H[arraypointer-2] for arrays in H and the four bytes
M[arraypointer-5]…M[arraypointer-2] of the 32-bit length for arrays in M.

The last addressable starting point for any list is 2147483647 == (1<<31) - 1
because the compiler uses the 32nd bit to distinguish between pointers to M/H.

Tutorial: Writing new code
==========================

A context mixing model with a preprocessor for run length encoding.
Three components are used to form the network.

Create a new template which will then be modified at the beginning and the pcomp/hcomp sections:

./zpaqlpy --emit-template > rle_model.py
chmod +x rle_model.py

First the size of the arrays H and M for each section, hcomp and pcomp needs to be specified:

hh = 2 # i.e. size is 2**2 = 4, because H[0], H[1], H[2] are the inputs for the components

The first component should give predictions based on the byte value and the second component based on the run length,
both give predictions for the next count and the next value.
Then the context-mixing components are combined to a network:

n = len({
0: "cm 19 22", # context table size 2*19 with partly decoded byte as 9 bit hash xored with the context, count limit 22
1: "cm 19 22",
2: "mix2 1 0 1 30 0", # will mix 0 and 1 together, context table size 2**1 with and-0 masking of the partly decoded byte which is added to the context, learning rate 30
})

Each component i gets its context input from the entry in H[i] after each run of
the hcomp function, which is called for each input byte of the preprocessed data,
which either is to be stored through arithmetic coding in compression phase
or is retrieved through decoding in decompression phase with following
postprocessing done by calls of the pcomp function.

Then we specify a preprocessor:

pcomp_invocation = "./simple_rle"

The context-mixing network is written to the archive in byte representation
as well as the bytecode for hcomp and pcomp (if they are used).
The preprocessor command is needed when the compiled file is used with zpaqd
if a pcomp section is present.
As the preprocessor might be any external programme or also included in the
compressing archiver and is of no use for decompression it is therefore not
mentioned in the archive anymore.

Create the preprocessor file and fill it:

$ chmod +x simple_rle
$ cat ./simple_rle
#!/usr/bin/env python3
import sys
input = sys.argv[1]
output = sys.argv[2]
with open(input, mode='rb') as fi:
with open(output, mode='wb') as fo:
last = None
count = 0
data = []
for a in fi.read():
if a != last or count == 255: # count only up to 255 to use one byte
if last != None: # write out the pair
data.append(last)
data.append(count)
last = a # start counting
count = 1
else:
count += 1 # continue counting
if last != None:
data.append(last)
data.append(count)
fo.write(bytes(data))

Then we need code in the pcomp section to undo this transform:

case_loading = False
last = NONE

def pcomp(c):
global case_loading, last
if c == NONE: # start of new segment, so restart our code
case_loading = False
last = NONE
return
if not case_loading: # c is byte to load
case_loading = True
last = c
else: # write out content of last c times
case_loading = False
while c > 0:
c-= 1
out(last)

So now it should produce the same file as the input file:

./simple_rle INPUTFILE input.rle
./rle_model.py pcomp input.rle input.norle
cmp INPUTFILE input.norle

And we can already try it, even if hcomp does not compute the context data yet (so compression is not really good):

./zpaqlpy rle_model.py
./zpaqd c rle_model.cfg archive.zpaq FILE FILE FILE

Now we can add hcomp code to improve compression by adaptive prediction:

at_counter = False # if false, then c is byte, otherwise c is a counter
last_value = 0
last_counter = 0

def hcomp(c): # pcomp bytecode is passed first (or 0 if there is none)
global at_counter, last_value, last_counter
if at_counter:
last_counter = c
else:
last_value = c
# first part of the context for the first CM is the byte replicated and
# the second part is whether we are at a counter (then we predict for a byte) or vice versa
hH[0] = (last_value << 1) + at_counter # at_counter will occupy one bit, therefore shift
hH[0] <<= 9 # again shift to the side because of the xor with the partially decoded byte
# second CM same but uses the counter for prediction
hH[1] = (last_counter << 1) + at_counter
hH[1] <<= 9
hH[2] = at_counter + 0 # context for mixer: is at counter (1) or not (0)
at_counter = not at_counter

We need to compile again before we run the final ZPAQ configuration file:

./zpaqlpy rle_model.py
./zpaqd c rle_model.cfg archive.zpaq FILE FILE FILE

zpaqd needs to have simple_rle in the same folder because we specified `pcomp_invocation = "./simple_rle"`