https://github.com/dito97/alphacodings
base26 and base52 encodings
https://github.com/dito97/alphacodings
encodings natural-language-processing tokenization uv vocabulary
Last synced: 6 months ago
JSON representation
base26 and base52 encodings
- Host: GitHub
- URL: https://github.com/dito97/alphacodings
- Owner: DiTo97
- License: mit
- Created: 2024-08-24T19:52:25.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-22T14:19:56.000Z (7 months ago)
- Last Synced: 2025-04-09T04:38:13.295Z (6 months ago)
- Topics: encodings, natural-language-processing, tokenization, uv, vocabulary
- Language: Python
- Homepage:
- Size: 1.76 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Alphacodings
base26 ([A-Z]) and base52 ([A-Za-z]) encodings
## 🌟 overview
transform any string to alphabetic-only with base26 ([A-Z]) and base52 ([A-Za-z]) lossless encodings; useful for transmitting textual data over restrictive channels or for training AI models and tokenizers on simpler vocabularies.
**Alphacodings** is a fast and lightweight library using [GMP arithmetic](https://gmplib.org).
## ⚙️ installation
```python
python -m pip install alphacodings
```## 🚀 usage
```python
from alphacodings import base26_encode, base26_decode, base52_encode, base52_decodestring = """\
sample page
welcome!
you are reading a sample HTML string.
"""
if __name__ == "__main__":
encoding_base26 = base26_encode(string)
print(encoding_base26)
# >>> ["YBPNLKVNQWZQCMDHMLNDTVQCCRKQLNCFGMQPNGQCIXHUUPHFUNKUFEPDLKIGARFOKTDEZKQHXGCPYHDZKKVIUDNFOAYYAUOQFBJFFGSTKAXNWGDPVUJNBARPNXBASHZBXIBSSEFTAIQRPEADSOVVNXUMQXVDWTAIVCIVWQZAHAGYAVZYKGMETJOOUQNOEXMSOOGSKVMFBYZIBZDAITICYVXMJTTCCHPMSCABLYUMFDUNLVSLNKHSBPKCGASXJSFYDHZFAOEQTUACEBIFKQGYC"]encoding_base52 = base52_encode(string)
print(encoding_base52)
# >>> ["EgcgYRPxckylMQWRLDADNZxPJiJcHaVwYHLnicahBgaotGGANZuvsvcpSSOJFLXvKPjRlNQCJqqdviiIdtnwJyDOnWojsrpkWSTZFHbMIREvREjpsODtSxoLlLjQZOoehsGFzawGQecyuomgpZQNyFnZQLWPiDhzClwxBFCCwdqduGJoshrwFdwHWMtJpSTmjxzaYmNvzOIOwLkJvyQHCaFtrODPhbhBpPBmC"]assert base26_decode(encoding_base26) == string
assert base52_decode(encoding_base52) == string
```## 🧠 motivation
The library is inspired by [@robert](https://github.com/robert)'s base26 implementation and his story of manipulating data transmission in restrictive network channels on long-distance flights using alphabetic-only encodings and tokenization.
## 📊 benchmarking
our implementation is orders of magnitude more efficient on 100k+ strings:
![]()
*Figure 1: runtime and memory usage performance against Heaton's original implementation with and without automatic chunking and SIMD on variable-length strings with a strict 60-second timeout; average over 5 trials.*
## 🤝 contributing
contributions to **Alphacodings** are welcome!
feel free to submit pull requests or open issues on our repository.
## 📄 license
see the [LICENSE](LICENSE) file for more details.