Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sloganking/rs-text-compression
https://github.com/sloganking/rs-text-compression
Last synced: 24 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/sloganking/rs-text-compression
- Owner: sloganking
- License: mit
- Created: 2021-12-13T21:49:25.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2022-04-09T23:35:29.000Z (over 2 years ago)
- Last Synced: 2023-03-04T05:22:46.276Z (almost 2 years ago)
- Language: Rust
- Size: 63.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# rs-text-compression
## Summary
A Rust based english text compression algorithm inspired by a [python algorithm](https://github.com/sloganking/text-compression). This Rust implementation's compression (single core) is around 6000x faster than it's python counterpart, and reduces filesize by around 60% instead of 40%.
This repository's text compression algorithm is optimized for the english language. The key insight to this algorithm is that there are around 400,000 English words and the average English word is 5.1 characters long (not including the spaces next to it). as a result, we can map all english words to a 19 bit integer.
Due to [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law), all spoken languages have some words appear exponentially more often than others. Generally for any language, the top 100 most common words will make up around 50% of all words spoken. Using this insight, this algorithm encodes the most common 32 words in 1 byte, the most common 2048 words in 2 bytes, and the most common 524288 words (This repo's english dictionary only has 466550) in 3 bytes. If a combination of characters is not in the known dictionary of words, or if encoding it would not decrease file size, it is stored as plain-text / 7 bit ascii.
## Encoding
There are four types of encodings
### ASCII
``0XXXXXXX``A non-compressed plaintext ASCII (<127) character.
### 1 byte
``101EEEEE``A word with a space character before it.
### 2 byte
``110CDEEE EEEEEEEE``A word who's previous character and case are determined by ``B`` and ``C``.
### 3 byte
``111CDEEE EEEEEEEE EEEEEEEE``A word who's previous character and case are determined by ``B`` and ``C``.
## Encoding Key:
All byte encodings start with ``AXXXXXXX``. If ``A`` is 0, the byte is plaintext ASCII. If ``A`` is 1, that byte represents a compressed word and the ``B`` bytes ``ABBXXXXX`` are used to determine what type and thus the rest of the encoding.
- ``A`` - Is this a compressed word or an ASCII character? (0 = char, 1 = compressed)
- ``B`` - 2 bit integer storing the byte length of the compressed word
- ``C`` - What character came before this word? (0 = ' ', 1 = '\n')
- ``D`` - Case of the first character in the word (0 = lower, 1 = upper)
- ``E`` - Integer storing the index of the compressed word
- ``X`` - A bit that can be either a 0 or a 1
## Notes
- The compression algorithm currently only works with input text files containing ASCII characters with values 127 and lower.
- This algorithm does not compress words in all uppercase characters.
- While decompression is straightforward, the compression algorithm must identify compressable words in order to compress them. Currently it may not identify some words that end in ``'s`` Improving the identification of words may result in slightly greater compression.
- This algorithm could easilly be optimized for other languages by using a words.txt mapping for their language. So long as that language uses ASCII characters and words.txt does not exceed 524288 (2^19) words.