https://github.com/zpoint/model-compressor

compressor json like orm models to binary format before cache to backend such as redis
https://github.com/zpoint/model-compressor

Last synced: about 2 months ago
JSON representation

compressor json like orm models to binary format before cache to backend such as redis

Host: GitHub
URL: https://github.com/zpoint/model-compressor
Owner: zpoint
License: mit
Created: 2020-06-03T14:57:15.000Z (over 5 years ago)
Default Branch: develop
Last Pushed: 2020-08-04T10:09:34.000Z (over 5 years ago)
Last Synced: 2025-01-20T15:53:58.867Z (about 1 year ago)
Language: Python
Homepage:
Size: 123 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# model-compressor

## Problem

For relation database, we may build a big table and need to cache as much data as possible

We can directly stringify our model and store to cache backend such as redis

```json
{ "firstName":"Bill" , "lastName":"Gates", "house": "111", "married": true, "has_child": false, "id": "063dc500-cbb4-4512-acdd-240596567e65"}
```

Or we can use the help of language built in serializer such as `pickle` in python

```python3
pickle.dumps(a)
'(dp0\nVfirstName\np1\nVBill\np2\nsVhas_child\np3\nI00\nsVlastName\np4\nVGates\np5\nsVmarried\np6\nI01\nsVhouse\np7\nV111\np8\nsVid\np9\nV063dc500-cbb4-4512-acdd-240596567e65\np10\ns.'
```

But what if we have a few hundreds of different fields ? And we have millions of hot records need to be cached ?

We need a few hundred GB of memory to store these hot data for each copy, Under the modern HA system, more than one copy need to be stored in different server

If we back up our data in different region, our bandwidth may suffer

## Solution

What if we compress our data to binary format before storing to cache backend ?

Since our record is stored in database, the column is fixed, and half of the column is of type `boolean`, `datetime`, and `uuid`

1. leave all the key fields, stores only the values
2. `boolean` can be represented as 0 and 1, and 8 `boolean` fields can be grouped together as 1 byte
3. `datetime` can be represented as a timestamp, a 4 bytes integer instead of `2020-06-01 12:11:00` 19 bytes characters
4. `uuid` can be represented as two 8 bytes long integer, instead of `"063dc500-cbb4-4512-acdd-240596567e65"` 36 bytes string
5. for string/unicode field
* For non english text, scan all words in database, get the high frequency words to build a word-binary map, and design a binary-unicode mixed format to represent these values(we can't lose any data)
* we can try deflate/LZW or other popular compress algorithm, but they may not work well in this single record-field situation
* For english text, huffman tree will be enough

For a real record, I come out with the result of 25 times smaller than origin stringify record after doing manual compression

## RoadMap

* [ ] framework to support multiply ORM models and multiply cache backend in python

* [ ] design pattern file, it can be generated from ORM models automatically or written manually
* [ ] core compress model according to pattern file and source data, written in C/C++ to gain performance

![design](./compress.png)

## Usage

To be continued...

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zpoint/model-compressor

Awesome Lists containing this project

README