{"id":29567145,"url":"https://github.com/satvikg7/llm-text-compressor","last_synced_at":"2025-12-31T14:20:20.563Z","repository":{"id":303209763,"uuid":"866911457","full_name":"SatvikG7/LLM-Text-Compressor","owner":"SatvikG7","description":"LLM based text compressor which performs better than SOTA text compression algorithms","archived":false,"fork":false,"pushed_at":"2025-07-06T10:34:42.000Z","size":27,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-06T11:31:51.065Z","etag":null,"topics":["arithmetic-coding","llm","text-compression"],"latest_commit_sha":null,"homepage":"Reference: https://arxiv.org/abs/2306.04050","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SatvikG7.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-03T05:30:15.000Z","updated_at":"2025-07-06T11:01:43.000Z","dependencies_parsed_at":"2025-07-06T11:31:53.990Z","dependency_job_id":"830f7c5f-7f83-4538-bbc5-8c8c088771df","html_url":"https://github.com/SatvikG7/LLM-Text-Compressor","commit_stats":null,"previous_names":["satvikg7/llm-text-compressor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/SatvikG7/LLM-Text-Compressor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SatvikG7%2FLLM-Text-Compressor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SatvikG7%2FLLM-Text-Compressor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SatvikG7%2FLLM-Text-Compressor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SatvikG7%2FLLM-Text-Compressor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SatvikG7","download_url":"https://codeload.github.com/SatvikG7/LLM-Text-Compressor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SatvikG7%2FLLM-Text-Compressor/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265849315,"owners_count":23838250,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arithmetic-coding","llm","text-compression"],"created_at":"2025-07-18T23:10:18.993Z","updated_at":"2025-12-31T14:20:20.515Z","avatar_url":"https://github.com/SatvikG7.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLM Text Compressor\n\nA novel text compression system that leverages Large Language Models (specifically GPT-2) to achieve high compression ratios by predicting token sequences and storing only prediction ranks instead of raw tokens.\n\n## Overview\n\nThis project implements an innovative approach to text compression that combines the predictive power of modern language models with traditional compression techniques. Instead of storing the actual tokens, the system stores the rank of each token within the language model's probability-ordered predictions, which are then further compressed using arithmetic coding.\n\n## Key Features\n\n- **High Compression Ratio**: Achieves ~74% compression on test data (Alice in Wonderland: 41,790 bytes → 10,620 bytes)\n- **Lossless Compression**: Perfect reconstruction of original text\n- **LLM-Powered**: Uses GPT-2's predictive capabilities for intelligent compression\n- **Sliding Window Approach**: Maintains context with configurable memory window size\n- **Arithmetic Coding**: Secondary compression layer for optimal storage efficiency\n\n## How It Works\n\n### Compression Process\n\n1. **Tokenization**: Input text is tokenized using GPT-2's tokenizer\n2. **Sliding Window**: A memory window of size M (default: 16) slides through the token sequence\n3. **Prediction**: For each position, GPT-2 predicts the probability distribution of the next token\n4. **Rank Calculation**: Instead of storing the actual token, the system stores its rank in the sorted probability distribution\n5. **Arithmetic Coding**: The sequence of ranks is further compressed using arithmetic coding\n6. **Storage**: The compressed data is stored in a binary file\n\n### Decompression Process\n\n1. **Decode**: Arithmetic coding is reversed to recover the rank sequence\n2. **Reconstruction**: For each rank, GPT-2 generates predictions and selects the token at that rank\n3. **Sliding Window**: The context window is updated with each predicted token\n4. **Detokenization**: The token sequence is converted back to readable text\n\n## Installation\n\n### Prerequisites\n\n- Python 3.8+\n- CUDA-compatible GPU (recommended for performance)\n- ~2GB free disk space for GPT-2 model\n\n### Dependencies\n\nInstall the required packages:\n\n```bash\npip install -r requirements.txt\n```\n\nThe main dependencies include:\n- `torch` - PyTorch framework\n- `transformers` - Hugging Face Transformers for GPT-2\n- `numpy` - Numerical computations\n- Standard library modules for file I/O and data structures\n\n## Usage\n\n### Basic Usage\n\nThe system is designed to work out-of-the-box with the provided sample text:\n\n```bash\npython main.py\n```\n\nThis will:\n1. Load the GPT-2 model and tokenizer\n2. Read `alice_in_wonderland.txt`\n3. Compress it to `compressed.bin`\n4. Decompress it to `decompressed.txt`\n\n### Custom Text Compression\n\nTo compress your own text file:\n\n1. Replace `alice_in_wonderland.txt` with your text file, or\n2. Modify the filename in `main.py` (line 17):\n\n```python\nwith open(\"your_text_file.txt\", \"r\") as file:\n    text = file.read()\n```\n\n### Configuration Options\n\n#### Memory Window Size (M)\n\nAdjust the context window size by modifying the `M` parameter in `main.py`:\n\n```python\nM = 16  # Default value, can be adjusted for different compression/quality tradeoffs\n```\n\n- **Larger M**: Better context, potentially better compression, but slower processing\n- **Smaller M**: Faster processing, but may reduce compression efficiency\n\n#### Arithmetic Coding Precision\n\nModify the precision in `arithmetic_coding.py`:\n\n```python\nclass ArithmeticCoder:\n    def __init__(self, precision=32):  # Adjust precision as needed\n```\n\n## Project Structure\n\n```\nLLM-Text-Compressor/\n│\n├── main.py                 # Main entry point and orchestration\n├── compress.py             # Core compression logic\n├── decompress.py           # Core decompression logic\n├── arithmetic_coding.py    # Arithmetic coding implementation\n├── requirements.txt        # Python dependencies\n├── alice_in_wonderland.txt # Sample input text\n├── compressed.bin          # Compressed output (generated)\n├── decompressed.txt        # Decompressed output (generated)\n└── README.md              # This documentation\n```\n\n## Technical Details\n\n### Algorithm Components\n\n#### 1. LLM Rank Compression (`compress.py`)\n\n```python\ndef compress(input_ids, model, M=4) -\u003e List[int]:\n```\n\n- Uses sliding window approach with memory size M\n- For each token position, generates GPT-2 predictions\n- Computes rank of actual token in sorted prediction probabilities\n- Returns list of ranks instead of original tokens\n\n#### 2. LLM Rank Decompression (`decompress.py`)\n\n```python\ndef decompress(ranks, input_ids, tokenizer, model, M=4) -\u003e str:\n```\n\n- Reconstructs text by using ranks to select tokens from GPT-2 predictions\n- Maintains sliding window context during reconstruction\n- Returns fully reconstructed text string\n\n#### 3. Arithmetic Coding (`arithmetic_coding.py`)\n\nImplements standard arithmetic coding with:\n- **Encoding**: Converts rank sequence to single compressed integer\n- **Decoding**: Recovers original rank sequence from compressed data\n- **File I/O**: Handles binary storage with frequency tables\n\n### Performance Metrics\n\nBased on the included Alice in Wonderland sample:\n\n| Metric | Value |\n|--------|-------|\n| Original Size | 41,790 bytes |\n| Compressed Size | 10,620 bytes |\n| Compression Ratio | ~74.6% |\n| Space Savings | ~25.4% of original |\n\n### Memory Requirements\n\n- **GPU Memory**: ~2GB for GPT-2 model\n- **System RAM**: ~1GB for processing\n- **Disk Space**: Original text size + ~25% for compressed output\n\n### Processing Time\n\nProcessing time scales with:\n- Text length (linear)\n- Memory window size M (linear)\n- GPU performance (significant impact)\n\n## Examples\n\n### Compression Example\n\n```python\nfrom compress import compress\nfrom transformers import GPT2LMHeadModel, GPT2Tokenizer\n\n# Load model\nmodel = GPT2LMHeadModel.from_pretrained(\"gpt2\")\ntokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n\n# Compress text\ntext = \"Your text here...\"\ninput_ids = tokenizer.encode(text, return_tensors=\"pt\")\nranks = compress(input_ids, model, M=16)\n```\n\n### Decompression Example\n\n```python\nfrom decompress import decompress\n\n# Decompress ranks back to text\nreconstructed_text = decompress(ranks, input_ids[0][:M], tokenizer, model, M)\n```\n\n## Limitations and Considerations\n\n### Current Limitations\n\n1. **GPU Dependency**: Requires CUDA-compatible GPU for practical performance\n2. **Model Size**: GPT-2 model requires significant disk space and memory\n3. **Processing Speed**: Compression/decompression is slower than traditional algorithms\n4. **Text Domain**: Performance may vary significantly across different text types\n\n### Best Use Cases\n\n- **Academic Research**: Novel compression algorithm research\n- **Long-form Text**: Books, articles, documents with rich linguistic structure\n- **Educational Purposes**: Understanding LLM applications in compression\n\n### Not Recommended For\n\n- **Real-time Applications**: Due to processing overhead\n- **Binary Data**: Designed specifically for natural language text\n- **Short Text Snippets**: Overhead may exceed benefits\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests if applicable\n5. Submit a pull request\n\n### Development Setup\n\n```bash\ngit clone https://github.com/SatvikG7/LLM-Text-Compressor.git\ncd LLM-Text-Compressor\npip install -r requirements.txt\n```\n\n## License\n\nThis project is open source. Please refer to the repository license for specific terms.\n\n## Research and References\n\nThis implementation is based on the concept of using language model predictions for text compression. The approach demonstrates how modern NLP models can be applied to traditional computer science problems like data compression.\n\n## Troubleshooting\n\n### Common Issues\n\n1. **CUDA Out of Memory**: Reduce batch size or use smaller memory window (M)\n2. **Model Download Issues**: Ensure stable internet connection for initial GPT-2 download\n3. **Performance Issues**: Verify CUDA installation and GPU availability\n\n### Getting Help\n\n- Check that all dependencies are correctly installed\n- Verify GPU drivers and CUDA installation\n- Ensure sufficient disk space for model and output files\n\n---\n\n*For questions, issues, or contributions, please visit the project repository.*","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsatvikg7%2Fllm-text-compressor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsatvikg7%2Fllm-text-compressor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsatvikg7%2Fllm-text-compressor/lists"}