https://github.com/djeada/hdf5-examples
This repository contains a collection of code examples demonstrating various techniques and methods for working with HDF5 (Hierarchical Data Format version 5) files. These examples are designed to help developers and data scientists efficiently manage, process, and analyze large datasets stored in HDF5 format.
https://github.com/djeada/hdf5-examples
compression hdf5-format serialization
Last synced: 10 months ago
JSON representation
This repository contains a collection of code examples demonstrating various techniques and methods for working with HDF5 (Hierarchical Data Format version 5) files. These examples are designed to help developers and data scientists efficiently manage, process, and analyze large datasets stored in HDF5 format.
- Host: GitHub
- URL: https://github.com/djeada/hdf5-examples
- Owner: djeada
- License: mit
- Created: 2021-01-08T15:01:10.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2022-05-08T23:19:17.000Z (over 3 years ago)
- Last Synced: 2025-02-05T11:51:56.306Z (11 months ago)
- Topics: compression, hdf5-format, serialization
- Language: Python
- Homepage:
- Size: 6.06 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HDF5
Code examples for processing HDF5 files.
About HDF5
HDF5 is a file format for storing data that is highly extensible and flexible.
For example, you can store a large number of images in a single HDF5 file.
* Stands for "Hierarichal Data Format".
* Current version is 5.
* It is a file format for storing data that is highly extensible and flexible.
* Open-source and free.
* We may directly use the core implementation in C, C++, and Java. There are wrappers for several other languages, including Python.
Using HDF5
To use HDF5, you need to install the h5py module.
Then you can use it to read and write HDF5 files.
For example, to read a file called "myfile.h5"
```Python
import h5py
f = h5py.File('myfile.h5', 'r')
print(f.keys())
print(f['data'].shape)
print(f['data'][:])
f.close()
```
To save a file, you need to create a new file object.
For example, to create a new file called "myfile.h5"
```Python
import h5py
f = h5py.File('myfile.h5', 'w')
data_set = f.create_dataset('data', (100,), dtype='i')
data_set[:] = np.arange(100)
f.close()
```
Structure
* Groups (a concept similar to directories)
- Groups can contain datasets and other groups.
* Datasets (a concept similar to files)
- Shape (ex. 1D, 2D, 5D)
- Datatype (ex. float, int32)
- Attributes (ex. compression, chunking, compression)
- Data (ex. data[:])
- Subdatasets (ex. subdataset[:])
Linear vs Chunked
This concept diffrentiaties HDF5 from other data formats.
Chunked datasets are stored in a more compact way.
It allows for faster access to data.
Linear:
- Data is stored in a single file.
- Data is stored in a single chunk.
- Data is stored in a single block.
Chunked:
- Data is stored in multiple chunks.
- Data is stored in multiple blocks.
- Data is stored in multiple files.
Chunk size must strike a balance:
- maximizing i/o speed.
- minimizing non-used data i/o.
- minimizing chunking i/o overhead cost.
Filter
Filter is a way to compress data.
- Can be applied to datasets.
- It is a layer betwen program and data.
Program <- Filter (CPU) <- data (Disk).
Examples:
* Gzip (compression filter)
* ScaleOffset (stores data subtracted by median, then while reading median is added back)
* Szip (compression filter)
* Shuffle (shuffles data)
* Fletcher32 (checksum)
Code Samples
* Basic IO
* Groups
* Compression
* Attributes
* Custom Class