Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Blosc/python-blosc2
https://github.com/Blosc/python-blosc2
Last synced: 1 day ago
JSON representation
- Host: GitHub
- URL: https://github.com/Blosc/python-blosc2
- Owner: Blosc
- License: other
- Created: 2021-03-29T08:59:55.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-10-29T00:09:46.000Z (16 days ago)
- Last Synced: 2024-10-29T17:18:56.771Z (15 days ago)
- Language: Jupyter Notebook
- Homepage: https://www.blosc.org/python-blosc2
- Size: 12.4 MB
- Stars: 82
- Watchers: 9
- Forks: 19
- Open Issues: 18
-
Metadata Files:
- Readme: README.rst
- Contributing: CONTRIBUTING.rst
- License: LICENSE.txt
- Code of conduct: code_of_conduct.md
Awesome Lists containing this project
README
=============
Python-Blosc2
=============A Python wrapper for the extremely fast Blosc2 compression library
==================================================================:Author: The Blosc development team
:Contact: [email protected]
:Github: https://github.com/Blosc/python-blosc2
:Actions: |actions|
:PyPi: |version|
:NumFOCUS: |numfocus|
:Code of Conduct: |Contributor Covenant|.. |version| image:: https://img.shields.io/pypi/v/blosc2.svg
:target: https://pypi.python.org/pypi/blosc2
.. |Contributor Covenant| image:: https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg
:target: https://github.com/Blosc/community/blob/master/code_of_conduct.md
.. |numfocus| image:: https://img.shields.io/badge/powered%20by-NumFOCUS-orange.svg?style=flat&colorA=E1523D&colorB=007D8A
:target: https://numfocus.org
.. |actions| image:: https://github.com/Blosc/python-blosc2/actions/workflows/build.yml/badge.svg
:target: https://github.com/Blosc/python-blosc2/actions/workflows/build.ymlWhat it is
==========`C-Blosc2 `_ is the latest major version of
`C-Blosc `_, and it is backward compatible with
both the C-Blosc1 API and its in-memory format. Python-Blosc2 is a Python package
that wraps C-Blosc2, the most recent version of the Blosc compressor.Starting with version 3.0.0, Python-Blosc2 includes a powerful computing engine
capable of operating on compressed data stored in-memory, on-disk, or across the
network. This engine also supports advanced features such as reductions, filters,
user-defined functions, and broadcasting (the latter is still in beta).You can read some of our tutorials on how to perform advanced computations at the
following links:* https://github.com/Blosc/python-blosc2/blob/main/doc/getting_started/tutorials/03.lazyarray-expressions.ipynb
* https://github.com/Blosc/python-blosc2/blob/main/doc/getting_started/tutorials/03.lazyarray-udf.ipynbAdditionally, Python-Blosc2 aims to fully leverage the functionality of C-Blosc2, supporting
super-chunks (`SChunk `_),
multi-dimensional arrays (`NDArray `_),
metadata, serialization, and other features introduced in C-Blosc2.**Note:** Blosc2 is designed to be backward compatible with Blosc(1) data.
This means it can read data generated by Blosc, but the reverse is not true
(i.e. there is no *forward* compatibility).NDArray: an N-Dimensional store
===============================One of the most useful abstractions in Python-Blosc2 is the
`NDArray `_ object.
It enables highly efficient reading and writing of n-dimensional datasets through
a two-level n-dimensional partitioning system. This allows for more fine-grained slicing
and manipulation of arbitrarily large and compressed data:.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/b2nd-2level-parts.png?raw=true
:width: 75%To pique your interest, here is how the ``NDArray`` object performs when retrieving slices
orthogonal to the different axis of a 4-dimensional dataset:.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/Read-Partial-Slices-B2ND.png?raw=true
:width: 75%We have written a blog post on this topic:
https://www.blosc.org/posts/blosc2-ndim-introWe also have a ~2 min explanatory video on `why slicing in a pineapple-style (aka double partition)
is useful `_:.. image:: https://github.com/Blosc/blogsite/blob/master/files/images/slicing-pineapple-style.png?raw=true
:width: 50%
:alt: Slicing a dataset in pineapple-style
:target: https://www.youtube.com/watch?v=LvP9zxMGBngOperating with NDArrays
=======================The ``NDArray`` objects are easy to work with in Python-Blosc2.
Here it is a simple example:.. code-block:: python
import numpy as np
import blosc2N = 10_000
na = np.linspace(0, 1, N * N, dtype=np.float32).reshape(N, N)
nb = np.linspace(1, 2, N * N).reshape(N, N)
nc = np.linspace(-10, 10, N * N).reshape(N, N)# Convert to blosc2
a = blosc2.asarray(na)
b = blosc2.asarray(nb)
c = blosc2.asarray(nc)# Expression
expr = ((a**3 + blosc2.sin(c * 2)) < b) & (c > 0)# Evaluate and get a NDArray as result
out = expr.compute()
print(out.info)As you can see, the ``NDArray`` instances are very similar to NumPy arrays, but behind the scenes,
they store compressed data that can be processed efficiently using the new computing
engine included in Python-Blosc2.To pique your interest, here is the performance (measured on a MacBook Air M2 with 24 GB of RAM)
you can achieve when the operands fit comfortably in memory:.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/eval-expr-full-mem-M2.png?raw=true
:width: 100%
:alt: Performance when operands fit in-memoryIn this case, the performance is somewhat below that of top-tier libraries like Numexpr or Numba,
but it is still quite good. Using CPUs with more cores than the M2 could further reduce the
performance gap. One important point to note is that the memory consumption when
using the ``LazyArray.compute()`` method is very low because the output is an ``NDArray`` object, which
is compressed and stored in memory by default. On the other hand, the ``LazyArray.__getitem__()``
method returns an actual NumPy array, so it is not recommended for large datasets, as it can consume
a significant amount of memory (though it may still be convenient for small outputs).It is also important to note that the ``NDArray`` object can utilize memory-mapped files, and the
benchmark above actually uses a memory-mapped file for operand storage. Memory-mapped files are
particularly useful when the operands do not fit in-memory, while still maintaining good
performance.And here is the performance when the operands do not fit well in memory:
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/eval-expr-scarce-mem-M2.png?raw=true
:width: 100%
:alt: Performance when operands do not fit in-memoryIn this latter case, the memory consumption figures may seem a bit extreme, but this is because
the displayed values represent actual memory consumption, not virtual memory. During evaluation,
the OS may need to swap some memory to disk. In this scenario, the performance compared to
top-tier libraries like Numexpr or Numba is quite competitive.You can find the benchmark for the examples above at:
https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-expr.ipynbInstalling
==========Blosc2 now provides Python wheels for the major OS (Win, Mac and Linux) and platforms.
You can install the binary packages from PyPi using ``pip``:.. code-block:: console
pip install blosc2
We are in the process of releasing 3.0.0, along with wheels for various
beta versions. For example, to install the first beta version, you can use:.. code-block:: console
pip install blosc2==3.0.0b1
Documentation
=============The documentation is available here:
https://blosc.org/python-blosc2/python-blosc2.html
Additionally, you can find some examples at:
https://github.com/Blosc/python-blosc2/tree/main/examples
Building from sources
=====================``python-blosc2`` includes the C-Blosc2 source code and can be built in place:
.. code-block:: console
git clone https://github.com/Blosc/python-blosc2/
cd python-blosc2
pip install . # add -e for editable modeThat's it! You can now proceed to the testing section.
Testing
=======After compiling, you can quickly verify that the package is functioning
correctly by running the tests:.. code-block:: console
pip install .[test]
pytest (add -v for verbose mode)Benchmarking
============If you are curious, you may want to run a small benchmark that compares a plain
NumPy array copy against compression using different compressors in
your Blosc build:.. code-block:: console
python bench/pack_compress.py
License
=======This software is licensed under a 3-Clause BSD license. A copy of the
python-blosc2 license can be found in
`LICENSE.txt `_.Mailing list
============Discussion about this module are welcome on the Blosc mailing list:
https://groups.google.es/group/blosc
Mastodon
========Please follow `@Blosc2 `_ to stay updated on the latest
developments. We recently moved from Twitter to Mastodon.Citing Blosc
============You can cite our work on the various libraries under the Blosc umbrella as follows:
.. code-block:: console
@ONLINE{blosc,
author = {{Blosc Development Team}},
title = "{A fast, compressed and persistent data store library}",
year = {2009-2024},
note = {https://blosc.org}
}**Make compression better!**