https://github.com/vahidzee/zcode

My own Unicode compression algorithm
https://github.com/vahidzee/zcode

arithmetic-coding compression huffman-coding lzw-compression unicode

Last synced: about 1 month ago
JSON representation

My own Unicode compression algorithm

Host: GitHub
URL: https://github.com/vahidzee/zcode
Owner: vahidzee
License: mit
Created: 2021-10-16T18:20:27.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2021-10-16T22:32:57.000Z (over 3 years ago)
Last Synced: 2025-05-03T02:28:24.839Z (about 2 months ago)
Topics: arithmetic-coding, compression, huffman-coding, lzw-compression, unicode
Language: Python
Homepage:
Size: 7.81 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Zee Code

ZCode is a custom compression algorithm I originally developed for a competition held for the Spring 2019 Datastructures
and Algorithms course of [Dr. Mahdi Safarnejad-Boroujeni](https://scholar.google.com/citations?user=TNfL9SIAAAAJ&hl=en) at [Sharif University of Technology](http://ce.sharif.edu/), at which I became
first-place. The code is pretty slow and has a lot of room for optimization, but it is pretty readable. It can be an
excellent educational resource for whoever is starting on compression algorithms.

The algorithm is a cocktail of classical compression algorithms mixed and served for Unicode documents. It hinges around
the [LZW algorithm](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch) to create a finite size symbol dictionary; the results are then byte-coded into variable-length custom
symbols, which I call `zee` codes! Finally, the symbol table is truncated accordingly, and the compressed document is
encoded into a byte stream.

[Huffman trees](https://en.wikipedia.org/wiki/Huffman_coding) highly inspire `zee` codes, but because in normal texts, symbols are usually much more uniformly distributed
than the original geometrical (or exponential) distribution assumption for effective Huffman coding, the gains of using
variable-sized byte-codes both from an implementation and performance perspective outweighed bit Huffman encodings.
Results may vary, but my tests showed a steady ~4-5x compression ratio on Farsi texts, which is pretty nice!

## Installation

ZCode is available on pip, and only requires a 3.6 or higher python installation beforehand.

```shell
pip install -U zcode
```

## Usage

You can run the algorithm for any `utf-8` encoded file using the `zcode` command. It will automatically decompress files
ending with a `.zee` extensions and compress others into `.zee` files, but you can always override the default behavior
by providing optional arguments like:

```shell
zcode INPUTFILE [--output OUTPUT_FILE --action compress/decompress --symbol-size SYMBOL_SIZE --code-size CODE_SIZE]
```

The `symbol-size` argument controls the algorithms' buffer size for processing symbols (in bytes). It is automatically
set depending on your input file size but you can change it as you wish. `code-size` controls the maximum length for
coded bytes while encoding symbols (this equals to 2 by default and needs to be provided to the algorithm upon
decompression).

## LICENSE

MIT LICENSE, see [vahidzee/zcode/LICENSE](https://github.com/vahidzee/zcode/blob/main/LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vahidzee/zcode

Awesome Lists containing this project

README