https://github.com/roman-koshchei/compact-huffman-coding

Huffman coding which is based not only on 1 character frequency, but also combinations of 2 characters. Oriented on producing more compact results rather than speed.
https://github.com/roman-koshchei/compact-huffman-coding

encoding huffman huffman-coding text-analysis

Last synced: 3 months ago
JSON representation

Huffman coding which is based not only on 1 character frequency, but also combinations of 2 characters. Oriented on producing more compact results rather than speed.

Host: GitHub
URL: https://github.com/roman-koshchei/compact-huffman-coding
Owner: roman-koshchei
License: mit
Created: 2024-10-17T06:47:27.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-10-17T18:46:12.000Z (8 months ago)
Last Synced: 2024-10-19T11:28:07.856Z (8 months ago)
Topics: encoding, huffman, huffman-coding, text-analysis
Language: C#
Homepage:
Size: 6.26 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Compact Huffman Coding

Huffman coding which is based not only on 1 character frequency, but also combinations of 2 characters.

Oriented on producing more compact results rather than speed of encoding/decoding. Which can be beneficial when sending data over the network.

Idea comes from Keyboard Layouts development, which counts not only single character usage, but also continuous pair of characters to optimize typing speed.

## Datasets

First of all I need to analyze plain english text data to figure out frequencies of single and double characters usage. I use datasets available for free, here is a list:

- [Yelp reviews](https://www.yelp.com/dataset) - informal language usage
- [Plain Text Wikipedia](https://www.kaggle.com/datasets/ltcmdrdata/plain-text-wikipedia-202011) - articles
- [Books on General Works](https://www.gutenberg.org/ebooks/results) - literature

Datasets aren't placed inside of the repository, because of licensing and other legal stuff. I may create script to download all of these datasets.

## Results

In around half of the times, my encoding solution produces smaller results

But unfortunately, average difference is only 4.2 units (which will be bits in real use scenario)

Currently, I don't think it's a good tradeoff, because calculations for my encoding are more expensive

```bash
Using frequencies from: ../datasets/results/index.json
Compact is smaller on average :)
Compact is smaller 13300 times
Compact is bigger 6693 times
Difference: 6607 times
Difference in size: 84816
Difference in size on average: 4.242284799679888
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/roman-koshchei/compact-huffman-coding

Awesome Lists containing this project

README