https://github.com/vahidzee/zcode
My own Unicode compression algorithm
https://github.com/vahidzee/zcode
arithmetic-coding compression huffman-coding lzw-compression unicode
Last synced: about 1 month ago
JSON representation
My own Unicode compression algorithm
- Host: GitHub
- URL: https://github.com/vahidzee/zcode
- Owner: vahidzee
- License: mit
- Created: 2021-10-16T18:20:27.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-10-16T22:32:57.000Z (over 3 years ago)
- Last Synced: 2025-05-03T02:28:24.839Z (about 2 months ago)
- Topics: arithmetic-coding, compression, huffman-coding, lzw-compression, unicode
- Language: Python
- Homepage:
- Size: 7.81 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Zee Code
ZCode is a custom compression algorithm I originally developed for a competition held for the Spring 2019 Datastructures
and Algorithms course of [Dr. Mahdi Safarnejad-Boroujeni](https://scholar.google.com/citations?user=TNfL9SIAAAAJ&hl=en) at [Sharif University of Technology](http://ce.sharif.edu/), at which I became
first-place. The code is pretty slow and has a lot of room for optimization, but it is pretty readable. It can be an
excellent educational resource for whoever is starting on compression algorithms.The algorithm is a cocktail of classical compression algorithms mixed and served for Unicode documents. It hinges around
the [LZW algorithm](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch) to create a finite size symbol dictionary; the results are then byte-coded into variable-length custom
symbols, which I call `zee` codes! Finally, the symbol table is truncated accordingly, and the compressed document is
encoded into a byte stream.[Huffman trees](https://en.wikipedia.org/wiki/Huffman_coding) highly inspire `zee` codes, but because in normal texts, symbols are usually much more uniformly distributed
than the original geometrical (or exponential) distribution assumption for effective Huffman coding, the gains of using
variable-sized byte-codes both from an implementation and performance perspective outweighed bit Huffman encodings.
Results may vary, but my tests showed a steady ~4-5x compression ratio on Farsi texts, which is pretty nice!## Installation
ZCode is available on pip, and only requires a 3.6 or higher python installation beforehand.
```shell
pip install -U zcode
```## Usage
You can run the algorithm for any `utf-8` encoded file using the `zcode` command. It will automatically decompress files
ending with a `.zee` extensions and compress others into `.zee` files, but you can always override the default behavior
by providing optional arguments like:```shell
zcode INPUTFILE [--output OUTPUT_FILE --action compress/decompress --symbol-size SYMBOL_SIZE --code-size CODE_SIZE]
```The `symbol-size` argument controls the algorithms' buffer size for processing symbols (in bytes). It is automatically
set depending on your input file size but you can change it as you wish. `code-size` controls the maximum length for
coded bytes while encoding symbols (this equals to 2 by default and needs to be provided to the algorithm upon
decompression).## LICENSE
MIT LICENSE, see [vahidzee/zcode/LICENSE](https://github.com/vahidzee/zcode/blob/main/LICENSE)