https://github.com/djeada/hdf5-examples

This repository contains a collection of code examples demonstrating various techniques and methods for working with HDF5 (Hierarchical Data Format version 5) files. These examples are designed to help developers and data scientists efficiently manage, process, and analyze large datasets stored in HDF5 format.
https://github.com/djeada/hdf5-examples

compression hdf5-format serialization

Last synced: 10 months ago
JSON representation

Host: GitHub
URL: https://github.com/djeada/hdf5-examples
Owner: djeada
License: mit
Created: 2021-01-08T15:01:10.000Z (about 5 years ago)
Default Branch: main
Last Pushed: 2022-05-08T23:19:17.000Z (over 3 years ago)
Last Synced: 2025-02-05T11:51:56.306Z (11 months ago)
Topics: compression, hdf5-format, serialization
Language: Python
Homepage:
Size: 6.06 MB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # HDF5

Code examples for processing HDF5 files.

About HDF5 

HDF5 is a file format for storing data that is highly extensible and flexible.

For example, you can store a large number of images in a single HDF5 file.

* Stands for "Hierarichal Data Format".

* Current version is 5.

* It is a file format for storing data that is highly extensible and flexible.

* Open-source and free.

* We may directly use the core implementation in C, C++, and Java. There are wrappers for several other languages, including Python. 

Using HDF5 

To use HDF5, you need to install the h5py module.

Then you can use it to read and write HDF5 files.

For example, to read a file called "myfile.h5"

```Python

  import h5py

  f = h5py.File('myfile.h5', 'r')

  print(f.keys())

  print(f['data'].shape)

  print(f['data'][:])

  f.close()

```

To save a file, you need to create a new file object.

For example, to create a new file called "myfile.h5"

```Python

    import h5py

    f = h5py.File('myfile.h5', 'w')

    data_set = f.create_dataset('data', (100,), dtype='i')

    data_set[:] = np.arange(100)

    f.close()

```

Structure 


* Groups (a concept similar to directories)

  - Groups can contain datasets and other groups.

  

* Datasets (a concept similar to files)

  - Shape (ex. 1D, 2D, 5D)

  - Datatype (ex. float, int32)

  - Attributes (ex. compression, chunking, compression)

  - Data (ex. data[:])

  - Subdatasets (ex. subdataset[:])

Linear vs Chunked 

This concept diffrentiaties HDF5 from other data formats. 

Chunked datasets are stored in a more compact way.

It allows for faster access to data.

Linear:

  - Data is stored in a single file.

  - Data is stored in a single chunk.

  - Data is stored in a single block.

Chunked:

  - Data is stored in multiple chunks.

  - Data is stored in multiple blocks.

  - Data is stored in multiple files.

Chunk size must strike a balance:

 - maximizing i/o speed.

 - minimizing non-used data i/o.

 - minimizing chunking i/o overhead cost.

Filter


Filter is a way to compress data.

  - Can be applied to datasets.

  - It is a layer betwen program and data.

Program <- Filter (CPU) <- data (Disk).

Examples:

* Gzip (compression filter)

* ScaleOffset (stores data subtracted by median, then while reading median is added back)

* Szip (compression filter)

* Shuffle (shuffles data)

* Fletcher32 (checksum)

Code Samples


* Basic IO

* Groups

* Compression

* Attributes

* Custom Class

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/djeada/hdf5-examples

Awesome Lists containing this project

README

About HDF5

Using HDF5

Structure

Linear vs Chunked

Filter

Code Samples