{"id":24567149,"url":"https://github.com/justnunuz/beyond_orion","last_synced_at":"2025-10-12T23:56:08.452Z","repository":{"id":225340665,"uuid":"511175939","full_name":"JustNunuz/Beyond_Orion","owner":"JustNunuz","description":"Unmatched compression.","archived":false,"fork":false,"pushed_at":"2022-10-27T12:54:01.000Z","size":456,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-17T03:41:41.987Z","etag":null,"topics":["compression","cpython","iot","python"],"latest_commit_sha":null,"homepage":"https://journals.mmupress.com/index.php/jiwe/article/view/782","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JustNunuz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-07-06T14:42:19.000Z","updated_at":"2024-07-23T10:55:54.000Z","dependencies_parsed_at":"2024-03-01T14:55:01.324Z","dependency_job_id":null,"html_url":"https://github.com/JustNunuz/Beyond_Orion","commit_stats":null,"previous_names":["justnunuz/beyond_orion"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/JustNunuz/Beyond_Orion","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustNunuz%2FBeyond_Orion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustNunuz%2FBeyond_Orion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustNunuz%2FBeyond_Orion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustNunuz%2FBeyond_Orion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JustNunuz","download_url":"https://codeload.github.com/JustNunuz/Beyond_Orion/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustNunuz%2FBeyond_Orion/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279013473,"owners_count":26085274,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compression","cpython","iot","python"],"created_at":"2025-01-23T13:16:54.861Z","updated_at":"2025-10-12T23:56:08.413Z","avatar_url":"https://github.com/JustNunuz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Beyond Orion \n\n![LZHW(1)](https://user-images.githubusercontent.com/59164172/198283332-2c466fba-a603-4693-b804-888bde68a896.gif)\n\n[![forthebadge](https://forthebadge.com/images/badges/made-with-python.svg)](https://forthebadge.com) [![forthebadge](https://forthebadge.com/images/badges/open-source.svg)](https://forthebadge.com) ![maintained -yes](https://user-images.githubusercontent.com/59164172/195039473-7f725d9c-01fb-4b5e-90e3-367c3000f9e3.svg)\n\n## About the project\n### Traditional IoT compression schemes fail to:\n\n1. Minimise energy consumption within IoT nodes\n2. Provide high scalability\n3. Ensure fault tolerance (resilence to errors)\n4. Robustness\n5. High complexity\n\n**Data Frames compression and decompression can work in parallel**.\n\n**Compatibile with Excel workbooks, textfiles, comma seperated values, data frames , arrays and lists.**\n\n\n## Quick Start\n\n```bash\npip install requirements.txt\n```\n\n```python\nimport lzhw\n\nsample_data = [\"Sunny\", \"Sunny\", \"Overcast\", \"Rain\", \"Rain\", \"Rain\", \"Overcast\",\n               \"Sunny\", \"Sunny\", \"Rain\", \"Sunny\", \"Overcast\", \"Overcast\", \"Rain\",\n               \"Rain\", \"Rain\", \"Sunny\", \"Sunny\", \"Overcaste\"]\n\ncompressed = lzhw.LZHW(sample_data)\n## let's see how the compressed object looks like:\nprint(compressed.compressed)\n# (506460, 128794, 112504)\n\n## its size\nprint(compressed.size())\n# 72\n\n## size of original\nfrom sys import getsizeof\nprint(getsizeof(sample_data))\n# 216\n\nprint(compressed.space_saving())\n# space saving from original to compressed is 67%\n\n## Let's decompress and check whether there is any information loss\ndecomp = compressed.decompress()\nprint(decomp == sample_data)\n# True\n```\n\nAs we saw, the LZHW class has saved 67% of the space used to store the original list without any loss. This percentage can get better with bigger data that may have repeated sequences.\nThe class has also some useful helper methods as **space_saving**, **size**, and **decompress()** to revert back to original.\n\nAnother example with numeric data.\n\n```python\nfrom random import sample, choices\n\nnumbers = choices(sample(range(0, 5), 5), k = 20)\ncomp_num = lzhw.LZHW(numbers)\n\nprint(getsizeof(numbers) \u003e comp_num.size())\n# True\n\nprint(numbers == list(map(int, comp_num.decompress()))) ## make it int again\n# True\n\nprint(comp_num.space_saving())\n# space saving from original to compressed is 73%\n```\n\nLet's look at how the compressed object is stored and how it looks like when printed:\nLZHW class has an attribute called **compressed** which is a tuple of integers representing the encoded triplets.\n\n```python\nprint(comp_num.compressed) # how the compressed is saved (as tuple of 3 integers)\n# (8198555, 620206, 3059308)\n```\n\nWe can also write the compressed data to files using **save_to_file** method,\nand read it back and decompress it using **decompress_from_file** function.\n\n```python\nstatus = [\"Good\", \"Bad\", \"Bad\", \"Bad\", \"Good\", \"Good\", \"Average\", \"Average\", \"Good\",\n          \"Average\", \"Average\", \"Bad\", \"Average\", \"Good\", \"Bad\", \"Bad\", \"Good\"]\ncomp_status = lzhw.LZHW(status)\ncomp_status.save_to_file(\"status.txt\")\ndecomp_status = lzhw.decompress_from_file(\"status.txt\")\nprint(status == decomp_status)\n# True\n```\n\n## Compressing DataFrames (in Parallel)\n\nlzhw doesn't work only on lists, it also compress pandas dataframes and save it into compressed files to decompress them later.\n\n```python\nimport pandas as pd\n\ndf = pd.DataFrame({\"a\": [1, 1, 2, 2, 1, 3, 4, 4],\n                   \"b\": [\"A\", \"A\", \"B\", \"B\", \"A\", \"C\", \"D\", \"D\"]})\ncomp_df = lzhw.CompressedDF(df)\n# 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00\u003c00:00, 2003.97it/s]\n```\n\nLet's check space saved by compression\n\n```python\ncomp_space = 0\nfor i in range(len(comp_df.compressed)):\n\tcomp_space += comp_df.compressed[i].size()\n\nprint(comp_space, getsizeof(df))\n# 296 712\n\n## Test information loss\nprint(list(map(int, comp_df.compressed[0].decompress())) == list(df.a))\n# True\n```\n\n#### Saving and Loading Compressed DataFrames\n\nWith lzhw we can save a data frame into a compressed file and then read it again\nusing **save_to_file** method and **decompress_df_from_file** function.\n\n```python\n## Save to file\ncomp_df.save_to_file(\"comp_df.txt\")\n\n## Load the file\noriginal = lzhw.decompress_df_from_file(\"comp_df.txt\")\n# 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00\u003c00:00, 2004.93it/s]\n\nprint(original)\n#   a  b\n#0  1  A\n#1  1  A\n#2  2  B\n#3  2  B\n#4  1  A\n#5  3  C\n#6  4  D\n#7  4  D\n```\n\n#### Compressing Bigger DataFrames\n\nLet's try to compress a real-world dataframe **german_credit.xlsx** file from [UCI Machine Learning Repository](\u003chttps://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)\u003e) [1].\n\nOriginal txt file is **219 KB** on desk.\n\nLet's have a look at how to use parallelism in this example:\n\n```python\ngc_original = pd.read_excel(\"examples/german_credit.xlsx\")\ncomp_gc = lzhw.CompressedDF(gc_original, parallel = True, n_jobs = 2) # two CPUs\n# 100%|█████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00\u003c00:00, 257.95it/s]\n\n## Compare sizes in Python:\ncomp_space = 0\nfor i in range(len(comp_gc.compressed)):\n\tcomp_space += comp_gc.compressed[i].size()\n\nprint(comp_space, getsizeof(gc_original))\n# 4488 548852\n\nprint(list(map(int, comp_gc.compressed[0].decompress())) == list(gc_original.iloc[:, 0]))\n# True\n```\n\n**Huge space saving, 99%, with no information loss!**\n\nLet's now write the compressed dataframe into a file and compare the sizes of the files.\n\n```python\ncomp_gc.save_to_file(\"gc_compressed.txt\")\n```\n\nChecking the size of the compressed file, it is **44 KB**. Meaning that in total we saved around **79%**.\nFuture versions will be optimized to save more space.\n\nLet's now check when we reload the file, will we lose any information or not.\n\n```python\n## Load the file\ngc_original2 = lzhw.decompress_df_from_file(\"gc_compressed.txt\")\n# 100%|█████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00\u003c00:00, 259.46it/s]\n\nprint(list(map(int, gc_original2.iloc[:, 13])) == list(gc_original.iloc[:, 13]))\n# True\n\nprint(gc_original.shape == gc_original2.shape)\n# True\n```\n\n**Perfect! There is no information loss at all.**\n\n## More Functionalities\n\nWith **lzhw** also you can choose what columns you are interested in compressing from a data frame.\n**CompressedDF** class has an argument **selected_cols**. And how many rows you want to decompress with **n_rows** argument.\n\n**You can also determine **sliding_window** argument to control more compression speed or compressing to a smaller size.**\nDefault value is 256, meaning that the algorithm will search in previous 256 values for similar sequences. Increasing this number can give smaller compressed size but can slow down a little bit the algorithm but not so much as **lz77_compress is able to scale up reasonably.**\n\nAlso one can compress large csv files in chunks while reading them in chunks without opening the whole file in memory using **CompressedFromCSV** class which reads a file in chunks using _pandas chunksize_ and compress each chunk separately.\n\n**Please see [documentation](https://mnoorfawi.github.io/lzhw/) for deeper look**\n\n## LZHW Comparison with joblib algorithms\n\nI love [joblib](https://joblib.readthedocs.io/en/latest/index.html). I usually use it for **parallelism** for its great performance coming with a smooth simplicity.\n\nI once saw this [article](https://joblib.readthedocs.io/en/latest/auto_examples/compressors_comparison.html#sphx-glr-auto-examples-compressors-comparison-py) in its documentation and it is about measuring the performance between different compressors available in it.\n\nBecause I am developing a compression library, I wanted to extend the code available in this article adding **lzhw** to the comparison, just to know where my library stands.\n\njoblib uses three main techniques in this article **Zlib, LZMA and LZ4**.\n\nI will use [1500000 Sales Records Data](http://eforexcel.com/wp/wp-content/uploads/2017/07/1500000%20Sales%20Records.zip).\n\n**We will look at Compression and Decompression Duration and The compressed file sizes.**\n\n_The downloaded compressed file is 53MB on the websites_\n\nI will reproduce the code in joblib documentation\n\n```python\ndata = pd.read_csv(\"1500000 Sales Records.csv\")\nprint(data.shape)\n\npickle_file = './pickle_data.joblib'\nstart = time.time()\nwith open(pickle_file, 'wb') as f:\n    dump(data, f)\nraw_dump_duration = time.time() - start\nprint(\"Raw dump duration: %0.3fs\" % raw_dump_duration)\n\nraw_file_size = os.stat(pickle_file).st_size / 1e6\nprint(\"Raw dump file size: %0.3fMB\" % raw_file_size)\n\nstart = time.time()\nwith open(pickle_file, 'rb') as f:\n    load(f)\nraw_load_duration = time.time() - start\nprint(\"Raw load duration: %0.3fs\" % raw_load_duration)\n\n## ZLIB\nstart = time.time()\nwith open(pickle_file, 'wb') as f:\n    dump(data, f, compress='zlib')\nzlib_dump_duration = time.time() - start\nprint(\"Zlib dump duration: %0.3fs\" % zlib_dump_duration)\n\nzlib_file_size = os.stat(pickle_file).st_size / 1e6\nprint(\"Zlib file size: %0.3fMB\" % zlib_file_size)\n\nstart = time.time()\nwith open(pickle_file, 'rb') as f:\n    load(f)\nzlib_load_duration = time.time() - start\nprint(\"Zlib load duration: %0.3fs\" % zlib_load_duration)\n\n## LZMA\nstart = time.time()\nwith open(pickle_file, 'wb') as f:\n    dump(data, f, compress=('lzma', 3))\nlzma_dump_duration = time.time() - start\nprint(\"LZMA dump duration: %0.3fs\" % lzma_dump_duration)\n\nlzma_file_size = os.stat(pickle_file).st_size / 1e6\nprint(\"LZMA file size: %0.3fMB\" % lzma_file_size)\n\nstart = time.time()\nwith open(pickle_file, 'rb') as f:\n    load(f)\nlzma_load_duration = time.time() - start\nprint(\"LZMA load duration: %0.3fs\" % lzma_load_duration)\n\n## LZ4\nstart = time.time()\nwith open(pickle_file, 'wb') as f:\n    dump(data, f, compress='lz4')\nlz4_dump_duration = time.time() - start\nprint(\"LZ4 dump duration: %0.3fs\" % lz4_dump_duration)\n\nlz4_file_size = os.stat(pickle_file).st_size / 1e6\nprint(\"LZ4 file size: %0.3fMB\" % lz4_file_size)\n\nstart = time.time()\nwith open(pickle_file, 'rb') as f:\n    load(f)\nlz4_load_duration = time.time() - start\nprint(\"LZ4 load duration: %0.3fs\" % lz4_load_duration)\n\n## LZHW\nstart = time.time()\nlzhw_data = lzhw.CompressedDF(data)\nlzhw_data.save_to_file(\"lzhw_data.txt\")\nlzhw_compression_duration = time.time() - start\nprint(\"LZHW compression duration: %0.3fs\" % lzhw_compression_duration)\n\nlzhw_file_size = os.stat(\"lzhw_data.txt\").st_size / 1e6\nprint(\"LZHW file size: %0.3fMB\" % lzhw_file_size)\n\nstart = time.time()\nlzhw_d = lzhw.decompress_df_from_file(\"lzhw_data.txt\", parallel = True, n_jobs = -3)\n# decompression is slower than compression\nlzhw_d_duration = time.time() - start\nprint(\"LZHW decompression duration: %0.3fs\" % lzhw_d_duration)\n\n# (1500000, 14)\n# Raw dump duration: 1.294s\n# Raw dump file size: 267.591MB\n# Raw load duration: 1.413s\n# Zlib dump duration: 6.583s\n# Zlib file size: 96.229MB\n# Zlib load duration: 2.430s\n# LZMA dump duration: 76.526s\n# LZMA file size: 72.476MB\n# LZMA load duration: 9.240s\n# LZ4 dump duration: 1.984s\n# LZ4 file size: 152.374MB\n# LZ4 load duration: 2.135s\n# LZHW compression duration: 53.958s\n# LZHW file size: 41.816MB\n# LZHW decompression duration: 56.687s\n```\n\nNow let's visualize the new results:\n\n```python\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nN = 5\nload_durations = (raw_load_duration, zlib_load_duration,\n                  lzma_load_duration, lz4_load_duration, lzhw_d_duration)\ndump_durations = (raw_dump_duration, zlib_dump_duration,\n                  lzma_dump_duration, lz4_dump_duration, lzhw_compression_duration)\nfile_sizes = (raw_file_size, zlib_file_size, lzma_file_size, lz4_file_size, lzhw_file_size)\nind = np.arange(N)\nwidth = 0.5\n\nplt.figure(1, figsize=(5, 4))\np1 = plt.bar(ind, dump_durations, width)\np2 = plt.bar(ind, load_durations, width, bottom=dump_durations)\nplt.ylabel('Time in seconds')\nplt.title('Compression \u0026 Decompression durations\\nof different algorithms')\nplt.xticks(ind, ('Raw', 'Zlib', 'LZMA', \"LZ4\", \"LZHW\"))\nplt.legend((p1[0], p2[0]), ('Compression duration', 'Decompression duration'))\n```\n\n![](./img/lzhw_duration2.jpg)\n\n```python\nplt.figure(2, figsize=(5, 4))\nplt.bar(ind, file_sizes, width, log=True)\nplt.ylabel('File size in MB')\nplt.xticks(ind, ('Raw', 'Zlib', 'LZMA', \"LZ4\", \"LZHW\"))\nplt.title('Compressed data size\\nof different algorithms')\nfor index, value in enumerate(file_sizes):\n    plt.text(index, value, str(round(value)) + \"MB\")\n```\n\n![](./img/lzhw_size2.jpg)\n\n**By far LZHW outperforms others with acceptable time difference**, especially with all other functionalities it enables to deal with compressed data.\n\n#### DEFLATE Note\n\nThe techniques may seem similar to the [**DEFLATE**](https://en.wikipedia.org/wiki/DEFLATE) algorithm which uses both LZSS, which is a variant of LZ77, and huffman coding, but I am not sure how the huffman coding further compresses the triplets. I believe it compresses the triplets altogether not as 3 separate lists as lzhw.\nAnd also it doesn't use the lempel-ziv-welch for further compression.\n\nLZHW also uses a **modified version of LZ77**, in which it uses a dictionary, **key-value data structure, to store the already-seen patterns with their locations during the compression process, so that the algorithm instead of blindly going back looking for matching, it knows where exactly to go**. This **speeds up the compression process**.\n\nFor example, let's say the algorithm now has found \"A\", it needs to see in previous sequences where is the longest match. It will do so using the dictionary {\"A\": [1, 4, 5, 8]}. So it will go and start looking starting from these locations instead of blindly looking for \"A\"'s indices.\n\n#Contributing\n\n\n# References\n\n[1] P. Kumar Singh, B. K Bhargava, M. Paprzycki, N. Chand Kaushal and W. Hong, \"Handbook of Wireless Sensor Networks:  Issues and Challenges in Current Scenario's\", Advances in Intelligent Systems and Computing, vol. 1132, 2020.  Available: https://www.springer.com/gp/book/9783030403041. [Accessed 22 October 2021]. \n[2] B. Gaur Sanjay, M. Purohit and O. Vyas, \"Recent Advances in Wireless Sensor Network for Secure and Energy Efficient  Routing Protocol\", Advances in Intelligent Systems and Computing, pp. 260-274, 2020. Available: 10.1007/978- 3-030-40305-8_13 [Accessed 25 October 2021]. \n[3]S. Shah, D. Seker, S. Hameed and D. Draheim, \"The Rising Role of Big Data Analytics and IoT in Disaster Management: Recent Advances, Taxonomy and Prospects\", IEEE Access, vol. 7, pp. 54595-54614, 2019. Available: 10.1109/access.2019.2913340 [Accessed 19 September 2022].\n[4]A. Fang, W. Lim and T. Balakrishnan, \"Early warning score validation methodologies and performance metrics: a systematic review\", BMC Medical Informatics and Decision Making, vol. 20, no. 1, 2020. Available: 10.1186/s12911-020-01144-8 [Accessed 19 September 2022].\n[5]B. Farahani, F. Firouzi, V. Chang, M. Badaroglu, N. Constant and K. Mankodiya, \"Towards fog-driven IoT eHealth: Promises and challenges of IoT in medicine and healthcare\", Future Generation Computer Systems, vol. 78, pp. 659-676, 2018. Available: 10.1016/j.future.2017.04.036 [Accessed 19 September 2022].\n[6]S. Selvaraj and S. Sundaravaradhan, \"Challenges and opportunities in IoT healthcare systems: a systematic review\", SN Applied Sciences, vol. 2, no. 1, 2019. Available: 10.1007/s42452-019-1925-y [Accessed 19 September 2022].\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjustnunuz%2Fbeyond_orion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjustnunuz%2Fbeyond_orion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjustnunuz%2Fbeyond_orion/lists"}