Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kootenpv/shrynk

Using Machine Learning to learn how to Compress :zap:
https://github.com/kootenpv/shrynk

Last synced: about 2 months ago
JSON representation

Using Machine Learning to learn how to Compress :zap:

Host: GitHub
URL: https://github.com/kootenpv/shrynk
Owner: kootenpv
Created: 2019-08-31T07:47:17.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2023-05-01T20:36:19.000Z (over 1 year ago)
Last Synced: 2024-07-28T17:38:48.871Z (about 2 months ago)
Language: Python
Size: 3.07 MB
Stars: 109
Watchers: 4
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        


  



[![Build Status](https://travis-ci.org/kootenpv/shrynk.svg?branch=master)](https://travis-ci.org/kootenpv/shrynk)

[![PyPI](https://img.shields.io/pypi/pyversions/shrynk.svg?style=flat-square&logo=python)](https://pypi.python.org/pypi/shrynk/)

[![PyPI](https://img.shields.io/pypi/v/shrynk.svg?style=flat-square&logo=pypi)](https://pypi.python.org/pypi/shrynk/)

[![HitCount](http://hits.dwyl.io/kootenpv/shrynk.svg)](http://hits.dwyl.io/kootenpv/shrynk)

You can read the [introductory blog post](https://vks.ai/2019-12-05-shrynk-using-machine-learning-to-learn-how-to-compress) or try it live at https://shrynk.ai

### Features

- ✓ Compress your data smartly based on **Machine Learning**

- ✓ Takes **User Requirements** in the form of weights for `size`, `write_time` and `read_time`

- ✓ Trains & caches a model based on **compression methods available** in the system, using packaged data

- ✓ **CLI** for compressing and decompressing

- ✓ Works with `CSV`, `JSON` and `Bytes` in general

### CLI

    shrynk compress myfile.json       # will yield e.g. myfile.json.gz or myfile.json.bz2

    shrynk decompress myfile.json.gz  # will yield myfile.json

    shrynk compress myfile.csv --size 0 --write 1 --read 0

    shrynk benchmark myfile.csv                  # shows benchmark results

    shrynk benchmark --predict myfile.csv        # will also show the current prediction

    shrynk benchmark --save --predict myfile.csv # will add the result to the training data too

### Usage in Docker

To test shrynk out quickly yourself, you can use the official docker image from DockerHub. It is great not to interfere with an existing python installation.

You can also build the image from scratch by going to [the docker folder here](./docker/) and doing `docker build -t shrynk .` and use `shrynk` instead of `kootenpv/shrynk` above.

In the following commands, replace `~/Downloads` with the folder you want to share with the container (where the file you want to compress is).

```bash

# To see help

docker run --rm -v ~/.shrynk:/root/.shrynk -v ~/Downloads:/data kootenpv/shrynk shrynk --help

# To compress a file called train.csv in your ~/Downloads folder

docker run --rm -v ~/.shrynk:/root/.shrynk -v ~/Downloads:/data kootenpv/shrynk \

   shrynk compress /data/train.csv

# To benchmark and predict the train.csv file in your ~/Downloads folder

docker run --rm -v ~/.shrynk:/root/.shrynk -v ~/Downloads:/data kootenpv/shrynk \

   shrynk benchmark --predict /data/train.csv

```

### Usage in Python

Installation:

    pip install shrynk

Then in Python:

```python

import pandas as pd

from shrynk import save, load

# save dataframe compressed

my_df = pd.DataFrame({"a": [1]})

file_path = save(my_df, "mypath.csv")

# e.g. mypath.csv.bz2

# load compressed file

loaded_df = load(file_path)

```

If you just want the prediction, you can also:

```python

import pandas as pd

from shrynk import infer

infer(pd.DataFrame({"a": [1]}))

# {"engine": "csv", "compression": "bz2"}

```

### Add your own data

If you want more control you can do the following:

```python

import pandas as pd

from shrynk import PandasCompressor

df = pd.DataFrame({"a": [1, 2, 3]})

pdc = PandasCompressor("default")

pdc.run_benchmarks(df) # adds data to the default

pdc.train_model(size=3, write=1, read=1)

pdc.predict(df)

```