{"id":20632530,"url":"https://github.com/dylan-profiler/compressio","last_synced_at":"2025-04-15T19:09:45.527Z","repository":{"id":46822672,"uuid":"284797665","full_name":"dylan-profiler/compressio","owner":"dylan-profiler","description":"Lossless in-memory compression of pandas DataFrames and Series powered by the visions type system. Up to 10x less RAM needed for the same data.","archived":false,"fork":false,"pushed_at":"2022-11-10T17:51:47.000Z","size":1422,"stargazers_count":28,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-15T19:09:40.677Z","etag":null,"topics":["compression","data-science","dtype","hacktoberfest","pandas","python","types"],"latest_commit_sha":null,"homepage":"https://dylan-profiler.github.io/visions/visions/applications/compression.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dylan-profiler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-08-03T20:13:59.000Z","updated_at":"2023-11-28T18:24:59.000Z","dependencies_parsed_at":"2022-08-22T23:20:26.843Z","dependency_job_id":null,"html_url":"https://github.com/dylan-profiler/compressio","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dylan-profiler%2Fcompressio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dylan-profiler%2Fcompressio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dylan-profiler%2Fcompressio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dylan-profiler%2Fcompressio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dylan-profiler","download_url":"https://codeload.github.com/dylan-profiler/compressio/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249135809,"owners_count":21218365,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compression","data-science","dtype","hacktoberfest","pandas","python","types"],"created_at":"2024-11-16T14:16:28.230Z","updated_at":"2025-04-15T19:09:45.509Z","avatar_url":"https://github.com/dylan-profiler.png","language":"Python","readme":"![Compressio Logo](https://github.com/dylan-profiler/compressio/raw/master/images/logo/compressio-logos_transparent_banner.png)\n\n# Compress*io*\n\n\u003e Compressio provides lossless in-memory compression of pandas DataFrames and Series powered by the [visions](https://github.com/dylan-profiler/visions) type system. Use up to 10x less RAM with the same data!\n\nGetting started is as easy as\n\n```python\nfrom compressio import Compress\n\ncompress = Compress()\ncompressed_data = compress.it(data)\n\n```\n\n\n## The Framework\n\nCompressio is a general framework for automated data compression and representation management not limited to any specific compression algorithm or implementation.\nYou have complete control, define your own types, write your own compression algorithms, or get started with the large library of types provided by [visions](https://dylan-profiler.github.io/visions/visions/api/types.html) and the suite of powerful algorithms included in compressio by default.\n\n\nThese algorithms can be subdivided around three basic optimization strategies:\n\n1. Prefer smaller dtypes where possible without loss of information\n2. Consider (more efficient) data representations*\n3. Compress data using more efficient data structures\n\n\\* this is where things get messy without visions\n\n### 1. Smaller dtypes\n\nUnder the hood, pandas leverages numpy arrays to store data.\nEach numpy array has an associated `dtype` specifying a physical, on disk, representation of the data.\nFor instance, a sequence of integers might be stored as 64-bit integers (`int64`), 8-bit unsigned integers (`uint8`) or even 32-bit floating point number (`float32`).\nAn overview of the numpy type system can be found [here](https://numpy.org/doc/stable/user/basics.types.html).\n\nThese type differences have numerous computational implications, for example, where an 8 bit integer can represent numbers between 0 and 255, the range of a 64 bit integer is  between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807 at the cost of an 8x larger memory footprint.\nThere can also be computational performance implications for different sizes.\n\n```python\nimport numpy as np\n\narray_int_64 = np.ones((1000, 1000), dtype=np.int64)\nprint(array_int_64.nbytes)\n8000000\n\narray_int_8 = np.ones((1000, 1000), dtype=np.int8)\nprint(array_int_8.nbytes)\n1000000\n```\n\nAs you can see, the 8-bit integer array decreases the memory usage by 87.5%.\n\n### 2. Appropriate machine representation\n\nCompressio uses visions to infer the semantic type of data and coerce it into alternative computational representations which minimize memory impact while maintaining it's semantic meaning.\n\n\nFor instance, although pandas can use the generic object dtype to store boolean sequences, it comes at the cost of a 4x memory footprint.\nVisions can automatically handle these circumstances to find an appropriate representation for your data.\n\n```python\n\u003e\u003e\u003e\u003e import pandas as pd\n\u003e\u003e\u003e\u003e # dtype: object\n\u003e\u003e\u003e\u003e series = pd.Series([True, False, None, None, None, None, True, False] * 1000)\n\u003e\u003e\u003e\u003e print(series.nbytes)\n64000\n\n\u003e\u003e\u003e\u003e # dtype: boolean (pandas' nullable boolean)\n\u003e\u003e\u003e\u003e new_series = series.astype(\"boolean\")\n\u003e\u003e\u003e\u003e print(new_series.nbytes)\n16000\n```\n\nFurther background information is available in the [visions documentation](https://dylan-profiler.github.io/visions/visions/applications/compression.html), [github repository](https://github.com/dylan-profiler/visions) and [JOSS publication](https://joss.theoj.org/papers/10.21105/joss.02145).\n\n### 3. Efficient data structures\n\nWithout additional instructions, pandas represents your data as *dense* arrays. This is a good all-round choice. \n\nWhen your data is not randomly distributed, it can be compressed ([Theory](https://simonbrugman.nl/2020/04/02/searching-for-neural-networks-with-low-kolmogorov-complexity.html#kolmogorov-complexity)).\n\nLow cardinality data can often be more efficiently stored using [sparse data structures](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.arrays.SparseArray.html#pandas.arrays.SparseArray), which are provided by pandas by default. \nThese structures offer efficiency by storing the predominant values only once and instead keeping indices for all other values.\n\n[This notebook](https://github.com/dylan-profiler/compressio/raw/master/examples/notebooks/Sparse%20Data.ipynb) shows how to use compressio with sparse data structures.\n\nData structure optimization is not limited to sparse arrays but instead include numerous domain specific opportunities such as [run-length encoding (RLE)](https://www.dlsi.ua.es/~carrasco/papers/RLE%20-%20Run%20length%20Encoding.html) which can be applied to compress sequential data. \nWe note that a pandas-specific third-party implementation is currently under development: [RLEArray](https://github.com/JDASoftwareGroup/rle-array).\n\n## Usage\n\n### Installation\n\nYou can easily install compressio with pip:\n\n```\npip install compressio\n```\n\nOr, alternatively, install from source.\n\n```\ngit clone https://github.com/dylan-profiler/compressio.git\n```\n\n### Examples\n\n[![Code example](https://github.com/dylan-profiler/compressio/raw/master/images/notebook-example.png)](examples/notebooks/Compressio.ipynb).\n\nThere is a collection of example notebooks to play with in the [examples directory](https://github.com/dylan-profiler/compressio/raw/master/examples/notebooks/) with a quick start notebook available [here](https://github.com/dylan-profiler/compressio/raw/master/examples/notebooks/Compressio.ipynb).\n\n## Optimizing strings in pandas\n\nPandas allows for multiple ways of storing strings: as string objects or as `pandas.Category`. Recent version of pandas have a `pandas.String` type.\n\nHow you store strings in pandas can significantly impact the RAM required. \n\n[![Memory usage of string representations in pandas](https://github.com/dylan-profiler/compressio/raw/master/images/str-type-1.1.0.png)](examples/notebooks/pandas%20string%20type%20analysis.ipynb)\n\nThe key insights from this analysis are:\n- The Category is more memory efficient when values are recurring and the String representation the percentage of distinct values. \n- The size of the Series is _not_ decisive for the string representation choice.\n\nYou can find the full analysis [here](https://github.com/dylan-profiler/compressio/raw/master/examples/notebooks/pandas%20string%20type%20analysis.ipynb).\n\n## Gotcha's\n\nCompressing DataFrames can be helpful in many situations, but not all.\nBe mindful of how to apply it in the following cases:\n\n- _Overflow_: compression by dropping precision can lead to overflows if the array is manipulated afterwards. \nThis can be an issue for instance for [numpy integers](https://mortada.net/can-integer-operations-overflow-in-python.html). In case this is a problem for your application, you can explicitly choose a precision.\n\n- _Compatibility_: other libraries may make different decisions to how to handle your compressed data.\nOne example where code needs to be adjusted to the compressed data is when the sparse data structure is used in combination with [`.groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html). (`observed` must be set to `True`).\nThis [article](https://pythonspeed.com/articles/numpy-memory-footprint/#when-these-strategies-wont-work) provides another example of scikit-image, which for some functions immediately converts a given array to a float64 dtype.\n","funding_links":[],"categories":["Feature Extraction"],"sub_categories":["General Feature Extraction"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdylan-profiler%2Fcompressio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdylan-profiler%2Fcompressio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdylan-profiler%2Fcompressio/lists"}