https://github.com/christopherosthues/recompression

Recompression technique of Artur Jez
https://github.com/christopherosthues/recompression

context-free-grammar grammar-compression longest-common-extension parallel-recompression recompression run-length-straight-line-program straight-line-program

Last synced: 7 months ago
JSON representation

Recompression technique of Artur Jez

Host: GitHub
URL: https://github.com/christopherosthues/recompression
Owner: christopherosthues
License: apache-2.0
Created: 2018-11-22T18:16:17.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2023-12-18T19:33:39.000Z (almost 2 years ago)
Last Synced: 2025-03-26T07:03:42.464Z (8 months ago)
Topics: context-free-grammar, grammar-compression, longest-common-extension, parallel-recompression, recompression, run-length-straight-line-program, straight-line-program
Language: C++
Size: 1.89 MB
Stars: 7
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: COPYING

Awesome Lists containing this project

README

# Recompression
This library contains various implementations of the recompression technique invented by Artur Jez, including sequential and parallel variants. The recompression technique is quiet similar to the grammar based compression algorithm called Recursive Pairing (RePair), but it differs in two points. First, the recompression technique uses two compression schemes: the block compression *bcomp* which compresses maximal blocks of same symbols and the pair compression *pcomp* which compresses non overlapping pairs of symbols based on a partitioning of the symbols in the text. The second difference is that *pcomp* compresses multiple different pairs in one iteration instead of only the most frequent one like RePair does. This two compressions are executed alternately, starting with *bcomp*. The generated context-free grammar is a run-length straight-line program where productions of the form *X->Y^d* are allowed where *Y^d* is the *d*-th repetition of *Y*. Given an uncompressed text *T* of length *N* the data structure uses only *O(z lg(N/z))* words where *z* is the number of LZ77 factors without self reference.

## Documentation
We provide a doxygen generated API reference which lists all types, classes and functions of this library.

## Requirements
This library requires:

* A C++14 compiler such as g++ 7.3 or higher
* A 64-bit operating system
* The cmake build system

### Dependencies
The build system checks if all dependencies are installed on the target system. If some dependencies are not found these dependencies will be automatically downloaded and installed.

Dependencies are:

* [In-place Parallel Super Scalar Samplesort (IPS⁴o)](https://github.com/SaschaWitt/ips4o.git)
* [tlx](https://github.com/tlx/tlx.git)
* for testing purposes:
* [Google test](https://github.com/google/googletest.git)
* for benchmark puposes:
* [Succinct Data Structure Library 2.0](https://github.com/simongog/sdsl-lite.git)
* [Pezza-LCE](https://github.com/al-xyz/prezza-lce)

For the parallel algorithms, you need to enable OpenMP (``-fopenmp``). To support the in-place super scalar samplesort algorithm please read the [IPS⁴o](https://github.com/SaschaWitt/ips4o.git) 's documentation.

## Installation
To download and build the library use to following commands.

```bash
git clone https://github.com/christopherosthues/recompression.git
cd recompression
mkdir build
cd build
cmake ..
make
```

To enable the benchmark output, set ``CMAKE_BUILD_TYPE=Bench`` at build time.

```
cmake .. -DCMAKE_BUILD_TYPE=Bench
make
```

To build the documentation, tests and/or benchmark executables you can enable them by setting the corresponding option to ``ON`` in the [CMakeList.txt](CMakeLists.txt) file. For all benchmarks the directory path to the input files must be specified. The benchmark scripts are able to read in more than one files and execute one or more algorithms by enquoting them with "".

```bash
./bench/bench_strong_scale_recompression / " "
"parallel parallel_lp parallel_ls"
```
The directory path must finish with a simple dash ``/``. Use ``./`` for the current directory.

To print the help simply execute the benchmarks without specifying any argument.

```bash
./bench/bench_strong_scale_recompression
```
The benchmark scripts are:

* **lce_query_bench.cpp**: The benchmarks for random lce queries.
* **long_lce_query_bench.cpp**: The benchmarks with precomputed positions that generate lce queries with long results.
* **random_access_bench.cpp**: The benchmarks for random generated random access queries.
* **recompression_all_bench.cpp**: Benchmarks for fixed number of cores, the sequential variants are also available here.
* **recompression_mem_bench.cpp**: Benchmarks for RAM usage, the sequential variants are also available here (The option RECOMPRESSION_ENABLE_MALLOC_COUNT in [CMakeLists.txt](CMakeLists.txt) has to be enabled).
* **store_rlslp_bench.cpp**: Benchmarks for compression. Writes the generated RLSLP to file using the specified coder.
* **strong_scale_recompression_bench.cpp**: Strong scaling benchmarks
* **weak_scale_recompression_bench.cpp**: Weak scaling benchmarks

## Getting started

The recompression.hpp header in the include directory provides all functionality of the library.

```cpp
#include
#include

using namespace recomp;

int main() {
std::vector text = {0, 0, 3, 0, 4, 2, 5, 12};
size_t cores = 4;
size_t alphabet_size = 13;
parallel::parallel_recompression recompression{"dataset"};

// or use this code to create an instance of one of the recompression classes:
std::string data = "dataset";
std::string recomp_class = "parallel_rnd10";
std::string parhip = ""; // Only needed for the recompression version using parhip
std::string dir = ""; // Only needed for parhip
std::unique_ptr> recompression = create_recompression(recomp_class, data,
parhip, dir);

rlslp slp;
recompression.recomp(text, slp, alphabet_size, cores);
}
```

There are some different versions available. This version are:

* **fast_seq**: A fast sequential version of the recompression.
* **hash**: A sequential version using hash tables to store the blocks/pairs to replace them with a single text scan.
* **parallel**: A parallel version. The undirected cut is not parallelized due to reasons of data dependencies.
* **parallel_lp**: A parallel recompression version which counts the number of possibly new production introduced by the combinations of the partition sets choosing the combination that generates less productions if the values of the directed cut are equal. The undirected cut is not parallelized due to reasons of data dependencies.
* **parallel_rnd*k***: A full parallel version using a random generated partitioning of the symbols for *pcomp*. The computation of the *undirected maximum cut* will be repeated *k* times and the best cut will be used.
* **parallel_ls**: A full parallel version using a parallel local search for the partition.
* **parallel_gr**: A full parallel version using a parallel variant of the greedy MaxCut algorithm for the partition.
* **parallel_rnddir*k***: A full parallel version using a random generated partitioning of the symbols for *pcomp*. The computation of the *directed maximum cut* will be repeated *k* times and the best cut will be used for *pcomp*. E.g. parallel_rnddir10 will repeat the *directed cut* 10 times.

The library also provides coders to encode and output the rlslp to files. The code below shows an example.

```cpp
#include

using namespace recomp;

int main() {
std::vector text = {0, 0, 3, 0, 4, 2, 5, 12};
size_t cores = 4;
size_t alphabet_size = 13;
parallel::parallel_recompression recompression{"dataset"};
rlslp slp;
recompression.recomp(text, slp, alphabet_size, cores);

// Stores the rlslp to testfile.
encode("sorted", "testfile", slp);

// Loads the rlslp from testfile.
rlslp in_slp = decode("sorted", "testfile");
}
```

The file extension depends on the used coder. The `sorted` coder for example uses the file extension `.rlslp`.

There are three coders available:

* **plain**: Writes plain integers using the full bit width of the underlying integer type.
* **fixed**: Uses a fixed bit width depending on the highest variables name of the rlslp
* **sorted**: Renames all variables depending on a permutation so the first elements of the right sides of the productions are sorted increasingly for all productions deriving pairs. The productions for the blocks are **not** sorted. Encodes the sorted values using delta coding and writes the unary codes of this values. All other components are encoded using fixed bit widths.

## License
The library is published under the [Apache License Version 2.0](LICENSE) license.

## Contributors

* [Christopher Osthues](https://github.com/ChristopherOsthues) (author)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/christopherosthues/recompression

Awesome Lists containing this project

README