Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Vindaar/nimhdf5

Wrapper and some simple high-level bindings for the HDF5 library for the Nim language
https://github.com/Vindaar/nimhdf5

Last synced: about 2 months ago
JSON representation

Wrapper and some simple high-level bindings for the HDF5 library for the Nim language

Awesome Lists containing this project

README

        

* nimhdf5
[[https://github.com/Vindaar/nimhdf5/workflows/nimhdf5%20CI/badge.svg]]

This repository contains thin high-level bindings for the [[https://www.hdfgroup.org/HDF5/][HDF5 data
format]] for the Nim programming language. It also provides a wrapper of
the full HDF5 C library, importable using:
#+BEGIN_SRC nim
import nimhdf5/hdf5_wrapper
#+END_SRC

The raw wrapper dynamically links the libhdf5.so (the main library)
and libhdf5_hl.so (a library containing high-level convenience
functions) libraries at runtime. All public functions of the two
libraries are callable by their corresponding C names and C
arguments. That means the Nim datatypes need to be manually cast to
the corresponding compatible types, if only the wrapper is used.

The high-level bindings, while covering most general HDF5 features,
are still in a rough state, due to limited testing by actual
users. Most features should work fine (at least using linux), but some
known bugs are still there. See
[[file:examples/h5_high_level_example.nim]] as an overview (soon to be
cleaned up, split and put into a tutorial form) on the available
features and their usage. Also take a look at the tests for simple
examples of specific features. For a more indepth example of usage of
this library, take a look at:
- [[https://github.com/Vindaar/TimepixAnalysis/blob/master/InGridDatabase/src/ingridDatabase]]
as an example using HDF5 as a simple database
- https://github.com/Vindaar/TimepixAnalysis/blob/master/Analysis/ingrid/raw_data_manipulation.nim
for usage of more advanced features like variable length data,
writing hyperslabs, creating hard links, etc. in a bigger project.

The wrapper was built making heavy use of [[https://www.github.com/nim-lang/c2nim][c2nim]] and the main goal was
to have a usable interface for the HDF5 data format. More advanced
features (e.g. single writer / multiple reader) were of lower
priority. Via the wrapper all features should in principle work, but
many have not been tested.

** Compatibility

The wrapper is currently tested using Nim version =0.18.0= and the
current devel branch (=0.18.1=).

The wrapper is built from HDF5 version =1.10.1=.

Linking against the HDF5 =1.8= library is reasonably supported as
well, but requires to use an additional compiler flag for now:
#+BEGIN_SRC sh
-d:H5_LEGACY
#+END_SRC
With the recent release of version =1.10.3= / =1.10.4=, support for
this version currently is available under the
#+BEGIN_SRC
-d:H5_FUTURE
#+END_SRC
flag. Soon this will become the default version! The
=H5Oget/visit_...2= procedures are wrapped under a name without a =2=
suffix. These however add a =fields= argument. Therefore an overload
is available, which maps the =fields= to =H5O_INFO_ALL=.

*** Troubleshooting
If you compile a nim program using =nimhdf5= without any of those
flags and try to run it on a system with a HDF5 shared library of
version =1.10.3= or newer, you will be greeted by:
#+BEGIN_SRC sh
could not import: H5Oget_info
#+END_SRC
In that case, add the =-d:H5_FUTURE= flag to the compilation command
(or probably add it to your =nim.cfg= or =config.nims= of the
project).

On the other hand if you try to run such a compiled binary on a system
with a HDF5 library of version =1.8=, you will probably see:
#+BEGIN_SRC sh
could not import: H5P_LST_FILE_CREATE_g
#+END_SRC
add the =-d:H5_LEGACY= flag.

In case neither of these work, please open an issue!

*** Version checks in the wrapper
Currently no checks are done, which compare the library this wrapper
is built upon with the library linking against using some of the
provided HDF5 macros (e.g. H5check, H5get_libversion etc.). The main
reason is explained below. However, as far as the high-level
functionality is concerned at the moment, the only differences arise
in a few constant definitions, whose names slightly changed from =1.8=
to =1.10=. This is what the compiler flags sets accordingly.

The HDF5 headers contain macros for many variables, such as
#+BEGIN_SRC C
#define H5F_ACC_RDONLY (H5CHECK H5OPEN 0x0000u)
#+END_SRC
where
#+BEGIN_SRC C
#define H5CHECK H5check(),
#+END_SRC
and
#+BEGIN_SRC C
#define H5OPEN H5open(),
#+END_SRC
i.e. it makes use of C's comma operator. However, c2nim currently
[[https://nim-lang.org/docs/c2nim.html#limitations][has no support for it]]. Instead of porting them in some reasonable way,
these macros were converted to simple replacements with the values,
dropping the calls to H5check() and H5open().

The call to H5check() is currently not used at all. Compiling a Nim
program with this wrapper (based on version =1.10.1=) would normally
fail to check against the linked library, if that version is different.

As H5open() is important, the calls are replaced by a single call to
initialize the library at the beginning upon the first call of the
library via [[file:src/nimhdf5/wrapper/H5niminitialize.nim]].

As HDF5 is a very macro heavy library, other important macros may not
have been correctly wrapped to Nim, e.g. determination of correct
sizes of data types. This may cause some weird side-effects (to be
fair, I haven't noticed any!).

Additionally, Windows support is unknown at this time. The library
name is correctly set for Windows, however an additional header file
=H5FDwindows.h= might have to be wraped.

** Installation

Installation can either be done via nimble:
#+BEGIN_SRC sh
nimble install nimhdf5
#+END_SRC

or manually by cloning this git repository:
#+BEGIN_SRC sh
git clone https://github.com/vindaar/nimhdf5
#+END_SRC
in a folder of your choice and call nimble install afterwards:
#+BEGIN_SRC sh
cd nimhdf5
nimble install
#+END_SRC

Or simply make use of nimble's Github interfacing capabilities:
#+BEGIN_SRC sh
nimble install https://github.com/vindaar/nimhdf5
#+END_SRC

** Files

The folder [[file:c_headers/][c_headers]] contains the modified HDF5 headers in the state
they were in for a successful c2nim conversion. In some cases the C
header file had to be modified, in others modification to the
resulting .nim file was still necessary.

The folder [[file:examples/][examples]] contains the basic HDF5 C examples (see here:
[[https://support.hdfgroup.org/HDF5/examples/intro.html#c]]) converted to
Nim utilizing the wrapper.

[[file:examples/h5_high_level_example.nim][h5_high_level_example.nim]] serves as a replacement for a tutorial for
now (tutorial will be added soon!), showcasing (almost) all available
features and their usage.

** Known bugs and quirks

The high level bindings come with several quirks which are good to
know.

- when reading back a dataset with dimension > 1, the returned data is
returned in a flat =seq=, instead of e.g. a nested
=seq[seq[]]= as one might expect.
To get the data in the correct shape, use the =reshape= or
(=reshape2D=, =reshape3D=) procs from =util.nim=. See the example
file or the following tests: [[file:tests/tutil.nim][tutil.nim]], [[file:tests/treshape.nim][treshape.nim]] for the usage.
The exception is variable length data in case of a 1D dataset
containing seqs of varying sizes. Here a nested seq of the correct
elements is returned.
- when grabbing a group or dataset from a H5FileObj via =[](name:
string)=, a conversion of the string to a distinct =string= type
=grp_str= or =dset_str= is used to provide a uniform interface for
both from a file object.
- 1D datasets do not have shape =(N, )= as one would see in Python,
but are represented by =(N, 1)= instead.
- and many more

** Implemented HDF5 features
- groups
- creating (nested) groups
- iterating over groups (recursively)
- datasets
- writing / reading static sized N-D arrays of any type
- writing / reading variable length data
- chunked storage
- data types:
- any basic nim type, that is:
- SomeNumber (all ints and floats)
- string (not for datasets atm)
- compound datatypes of objects / tuples, where the fields have to
be of the above mentioned basic types.
- hyperslabs
- writing / reading hyperslabs using H5 notation
- compression / filters
- zlib compression
- szip compression
- blosc compression (external)
User needs to compile / install:
- https://github.com/Blosc/c-blosc
[[https://github.com/Vindaar/nblosc]]
Note: Windows / OSX not yet supported, due to wrong name of
=libblosc.so= in [[https://github.com/Vindaar/nblosc/blob/master/blosc.nim#L6][blosc.nim#L6]]. Change it appropriately.
- _sort of soon:_ fletcher32, shuffle, nbits
- attributes
- writing / reading on datasets, groups
- all types supported
- basic types (int, float, ...)
- seqs of basic types
- strings
- reading variable length strings
(different from static length strings in H5 attributes!)
- hardlink datasets and groups within a file
- iterators over:
- groups
- datasets
- attributes
- Single Writer Multiple Reader (SWMR). See for more info below.

*** Single Writer Multiple Reader

This wrapper fully supports the Single Writer Multiple Reader feature
of the HDF5 library, but it is still in an experimental state, as I've
never really needed it.

It allows to access a single HDF5 file from multiple threads or
processes, where one of these is a writer process and all others are
readers. When using this feature the user does not have to worry about
locks etc. between the different processes.

**** Usage
Open an HDF5 file in write mode and hand the =swmr= flag:
#+begin_src nim
# writer.nim
import nimhdf5
var h5f = H5open("/tmp/test.h5", "rw", swmr = true)
# do writing stuff
#+end_src
and in all reader threads / processes, simply do the same, but do
*not* hand a write:
#+begin_src nim
# reader.nim
import nimhdf5
var h5f = H5open("/tmp/test.h5", "r", swmr = true)
#+end_src
This *should* be all that is required.

I'm not sure if the writer process should make sure to flush the file
regularly or not. Feel free to tell me if you know. :)

***** Alternative writer

An alternative to the above for the writer process is to first open
the file in write mode without ~swmr = true~ and then later put it
into =swmr= mode via:
#+begin_src nim
import nimhdf5
var h5f = H5open("/tmp/test.h5", "rw")
# do some regular stuff
# and then activate SWMR later
h5f.activateSWMR()
#+end_src

*** Threadsafe HDF5 library & file locks

This wrapper can also be used with a HDF5 library that was compiled
with the =--enable-threadsafe= compilation flag.

Once the library has been compiled with it, in principle the user can
try to open a single file in write mode from multiple processes or
threads. For safe handling in these contexts, it may be up to the user
to lock access to the file / writing to individual datasets via some
locking mechanism.
In principle the threadsafe option of the library adds its own mutex
logic, so in theory it should work without them.

See these notes about the threadsafe library:
https://support.hdfgroup.org/HDF5/doc/TechNotes/ThreadSafeLibrary.html

One issue a user might encounter is that the second opening of a file
yields an error saying that the resource is temporarily unavailable.
In HDF5 version starting from =1.10=, file locking was added as a
feature.

This behavior is controlled via an environment variable:
#+begin_src sh
export HDF5_USE_FILE_LOCKING=FALSE
#+end_src
If set to false, file locking is *disabled*. With it multiple
processes may open the same file.

When doing this, keep in mind that each thread / process will receive
their own =FileID=. Some HDF5 functions may either give the user
information based on the specific file ID and others based on the
actual file. In the cases where one can choose, it is supported via
the =okLocal= (=ObjectKind= enum) or =fkLocal= (=FlushKind= enum).

Relevant part of the documentation:
https://docs.hdfgroup.org/hdf5/develop/_h5public_8h.html#title31

** Blosc support

To use blosc as a filter you need to import:
#+begin_src nim
import nimhdf5/blosc
#+end_src
Before =v0.3.12= this was done automatically if the =nblosc= library
is installed.