Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jxmorris12/gptzip

Losslessly encode text natively with arithmetic coding and HuggingFace Transformers
https://github.com/jxmorris12/gptzip

Last synced: 4 days ago
JSON representation

Losslessly encode text natively with arithmetic coding and HuggingFace Transformers

Host: GitHub
URL: https://github.com/jxmorris12/gptzip
Owner: jxmorris12
License: apache-2.0
Created: 2024-07-11T20:13:34.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-07-28T23:22:12.000Z (5 months ago)
Last Synced: 2024-12-18T09:09:20.702Z (11 days ago)
Language: Python
Homepage:
Size: 41 KB
Stars: 71
Watchers: 1
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # gptzip

#### Losslessly encode text natively with arithmetic coding and HuggingFace Transformers

Did you know that every time you download a language model to your computer, you're downloading powerful compression technology as well?

`gptzip` is a python library that uses pre-trained language models as string compressors. It's compatible out-of-the-box with language models from HuggingFace transformers and uses arithmetic coding (which is theoretically optimal) to compress strings based on language model probability distributions. 

This all works because of [Shannon's source coding theorem](https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem) which connects probability distributions and compression. Since language models like GPT-3 give us probabilities over strings, we can literally use them as compressors. gptzip makes this trivial.

### Installation

```pip install gptzip```

### Encoding

You can use gptzip to check the number of bytes a language model requires to encode a string (to compare against e.g. gzip or the original byte count):

```python

model = "gpt2"

lm = transformers.AutoModelForCausalLM.from_pretrained(model)

tokenizer = transformers.AutoTokenizer.from_pretrained(model)

string = "Sailing on the seven seas"

coder = gptzip.ArithmeticCoder(lm=lm, tokenizer=tokenizer)

code, num_padded_bits = coder.encode(

    string, 

    return_num_padded_bits=True, 

)

assert len(code) == 5

```

### Lossless encoding-and-decoding

Perhaps even more useful is to use gptzip as a true file compressor. In this case, `code`

```python

model = "gpt2"

lm = transformers.AutoModelForCausalLM.from_pretrained(model)

tokenizer = transformers.AutoTokenizer.from_pretrained(model)

string = "How much would could a woodchuck chuck?"

coder = gptzip.ArithmeticCoder(lm=lm, tokenizer=tokenizer)

code, num_padded_bits = coder.encode(

    string, 

    return_num_padded_bits=True, 

)

print(f"Code: {to_binary(code)} ({len(code)} bytes)")

decoded_string = coder.decode(code, num_padded_bits=num_padded_bits)

assert decoded_string == string

```

### Roadmap

Some features that would be nice to add:

- [ ] Other compression techniques such as Huffman

- [ ] Benchmarking against other compressions and add numbers to README

- [ ] Support for other language modeling softwares such as VLLM

- [ ] Compress multiple strings in batch

### Citation

Thanks to DeepMind implementation for helping me implement Arithmetic coding in Python. I learned a lot from their [implementation](https://github.com/google-deepmind/language_modeling_is_compression) and paper, [Language Modeling Is Compression](https://deepmind.google/research/publications/39768/). 

I also am indebted to Mark Nelson for his incredibly blog post [Data Compression With Arithmetic Coding](https://marknelson.us/posts/2014/10/19/data-compression-with-arithmetic-coding.html). It was invaluable for me while learning about this topic, especially the lossless implementation of arithmetic coding using binary fractions. It's one of the best blog posts that I have ever read.