https://github.com/blosc/b2h5py

Transparent optimized reading of n-dimensional Blosc2 slices for h5py
https://github.com/blosc/b2h5py

Last synced: about 1 month ago
JSON representation

Transparent optimized reading of n-dimensional Blosc2 slices for h5py

Host: GitHub
URL: https://github.com/blosc/b2h5py
Owner: Blosc
License: bsd-3-clause
Created: 2023-11-29T12:13:32.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-01-11T13:02:08.000Z (over 1 year ago)
Last Synced: 2024-10-29T14:53:12.066Z (7 months ago)
Language: Python
Size: 172 KB
Stars: 5
Watchers: 6
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

README

        b2h5py

======

``b2h5py`` provides h5py_ with transparent, automatic optimized reading of n-dimensional slices of Blosc2_-compressed datasets. This optimized slicing leverages direct chunk access (skipping the slow HDF5 filter pipeline) and 2-level partitioning into chunks and then smaller blocks (so that less data is actually decompressed).

.. _h5py: https://www.h5py.org/

.. _Blosc2: https://www.blosc.org/

Benchmarks of this technique show 2x-5x speed-ups compared with normal filter-based access. Comparable results are obtained with a similar technique in PyTables, see `Optimized Hyper-slicing in PyTables with Blosc2 NDim`_.

.. image:: doc/benchmark.png

.. _Optimized Hyper-slicing in PyTables with Blosc2 NDim: https://www.blosc.org/posts/pytables-b2nd-slicing/

Usage

-----

This optimized access works for slices with step 1 on Blosc2-compressed datasets using the native byte order. It is enabled by monkey-patching the ``h5py.Dataset`` class to extend the slicing operation. The easiest way to do this is::

    import b2h5py.auto

After that, optimization will be attempted for any slicing of a dataset (of the form ``dataset[...]`` or ``dataset.__getitem__(...)``). If the optimization is not possible in a particular case, normal h5py slicing code will be used (which performs HDF5 filter-based access, backed by hdf5plugin_ to support Blosc2).

.. _hdf5plugin: https://github.com/silx-kit/hdf5plugin

You may instead just ``import b2h5py`` and explicitly enable the optimization globally by calling ``b2h5py.enable_fast_slicing()``, and disable it again with ``b2h5py.disable_fast_slicing()``. You may also enable it temporarily by using a context manager::

    with b2h5py.fast_slicing():

        # ... code that will use Blosc2 optimized slicing ...

Finally, you may explicitly enable optimizations for a given h5py dataset by wrapping it in a ``B2Dataset`` instance::

    b2dset = b2h5py.B2Dataset(dset)

    # ... slicing ``b2dset`` will use Blosc2 optimization ...

Building

--------

Just install PyPA build (e.g. ``pip install build``), enter the source code directory and run ``pyproject-build`` to get a source tarball and a wheel under the ``dist`` directory.

Installing

----------

To install as a wheel from PyPI, run ``pip install b2h5py``.

You may also install the wheel that you built in the previous section, or enter the source code directory and run ``pip install .`` from there.

Running tests

-------------

If you have installed ``b2h5py``, just run ``python -m unittest discover b2h5py.tests``.

Otherwise, just enter its source code directory and run ``python -m unittest``.

You can also run the h5py tests with the patched ``Dataset`` class to check that patching does not break anything. You may install the ``h5py-test`` extra (e.g. ``pip install b2h5py[h5py-test]`` and run ``python -m b2h5py.tests.test_patched_h5py``.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/blosc/b2h5py

Awesome Lists containing this project

README