Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design new HDF5 interface #5033

Open
jngrad opened this issue Jan 27, 2025 · 1 comment
Open

Design new HDF5 interface #5033

jngrad opened this issue Jan 27, 2025 · 1 comment
Assignees
Labels

Comments

@jngrad
Copy link
Member

jngrad commented Jan 27, 2025

TL;DR: ESPResSo needs to rewrite its hdf5 bridge to replace the aging h5xx library.

Background

Here are some of the main hdf5 C++ APIs available:

  • h5md/h5xx is now longer actively maintained and doesn't compile on modern toolchains due to SFINAE bugs
  • HighFive is no longer funded and is currently maintained in a fork by a single person
  • HDFGroup/hdf5 has C++11 serial bindings and C bindings

Requirements:

  • either header-only, or easily installable on HPC via EESSI/Spack, or supports CMake integration via FetchContent
  • preferably C++ API, although C is also an option
  • must support parallel I/O

The hdf5 project is actively maintained. They are planning release 2.0.0 for end of 2025. It will use CMake, which is a requirement for integration in ESPResSo since these bindings are not header-only. Its C++ bindings currently do not support parallel I/O, which is a hard requirement for ESPResSo.

Problem statement

ESPResSo needs to interface with efficient parallel I/O libraries to read and write data in a portable file format, such as HDF51 and GSD2 for particle data and VTK for lattice data. File formats for particle data need to support variable particle numbers for Monte Carlo, and metadata specifications (e.g. H5md3) for applications that need SI units (e.g. VOTCA4). The HDF5 file format fulfills these criteria and is already supported by ESPResSo, however its scaling is much poorer than MPI-IO in the current implementation, and it relies on a header-only C++ interface that cannot be compiled in HPC due to known portability bugs.

Roadmap

Investigate suitable hdf5 C++ API replacements for h5xx. When none is found, consider using the hdf5 C API directly. This has the advantage that we don't need CMake integration, don't need an extra library, and header files are already packaged on HPC (both EESSI and Spack) and on most Linux distributions.

The new bridge could be designed like this:

  • create a new CMake shared object target that depends on the chosen hdf5 API
  • implement a hdf5 File class that can write basic particle properties to a file, like masses and positions
  • implement a h5md specifications to store SI units of those properties
  • write a C++ unit test to check the interface (warning: beware of file system latency when reading a file that was just created)
  • implement all remaining particle and box properties and replace the existing espresso_hdf5 target with the new one

This project is part of MultiXscale Task 3.4 from Deliverable 3.3 due M30 (June 2025).

Footnotes

  1. HDF Group, API Reference, Introduction to HDF5.

  2. Glotzer Group , GSD documentation, HOOMD Schema.

  3. de Buyl et al. 2014, H5MD: A structured, efficient, and portable file format for molecular data, Computer Physics Communications 185(6):1546.

  4. Mashayak et al. 2015, Relative Entropy and Optimization-Driven Coarse-Graining Methods in VOTCA, PLoS ONE 10(7): e0131754.

@jngrad
Copy link
Member Author

jngrad commented Jan 27, 2025

Here is a MWE for the hdf5 C++ serial bindings:

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 * Copyright by The HDF Group.                                               *
 * All rights reserved.                                                      *
 *                                                                           *
 * This file is part of HDF5.  The full HDF5 copyright notice, including     *
 * terms governing use, modification, and redistribution, is contained in    *
 * the LICENSE file, which can be found at the root of the source code       *
 * distribution tree, or in https://www.hdfgroup.org/licenses.               *
 * If you do not have access to either file, you may request a copy from     *
 * [email protected].                                                        *
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

#define BOOST_TEST_MODULE hdf5 test
#define BOOST_TEST_DYN_LINK
#include <boost/test/unit_test.hpp>

#include "H5Cpp.h" // C++ API header file

#include <array>
#include <vector>

BOOST_AUTO_TEST_CASE(attr_basic_write_then_read) {
  const int     SPACE1_RANK = 3;
  const hsize_t SPACE1_DIM1 = 3;
  const hsize_t SPACE1_DIM2 = 15;
  const hsize_t SPACE1_DIM3 = 13;
  const int          ATTR1_RANK          = 1; // rank of the 1st data stream
  const hsize_t      ATTR1_DIM1          = 3; // shape of the 1st data stream
  std::vector<int>   attr_data1          = {512, -234, 98123}; // Test data for 1st attribute
  const int          ATTR2_RANK          = 2; // rank of the 2nd data stream
  const hsize_t      ATTR2_DIM1          = 3; // shape of the 2nd data stream
  const hsize_t      ATTR2_DIM2          = 2;
  std::vector<std::array<int, ATTR2_DIM2>> attr_data2{
    {7614, -416}, {197814, -3}, {-4, 12}}; // Test data for 2nd attribute

  /* Write .h5 file */
  {
    // Create file
    H5::FileAccPropList fapl;
    fapl.setLibverBounds(H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
    H5::H5File fid1("attr_basic.h5", H5F_ACC_TRUNC, H5::FileCreatPropList::DEFAULT, fapl);

    hsize_t dims1[] = {SPACE1_DIM1, SPACE1_DIM2, SPACE1_DIM3};
    hsize_t dims2[] = {ATTR1_DIM1};
    hsize_t dims3[] = {ATTR2_DIM1, ATTR2_DIM2};

    /* Test attribute with dataset */
    // Create a dataset with a dataspace for attributes
    H5::DataSpace ds_space(SPACE1_RANK, dims1);
    H5::DataSet dataset = fid1.createDataSet("Dataset1", H5::PredType::NATIVE_UCHAR, ds_space);
    H5::DataSpace att_space(ATTR1_RANK, dims2);
    // Create file and dataset attributes, but only write to one attribute
    H5::Attribute ds_attr1 = dataset.createAttribute(
        "dataset/attributes:units:particles/atoms/mass/value",
        H5::PredType::NATIVE_INT, att_space);
    H5::Attribute ds_attr2 = dataset.createAttribute(
        "dataset/attributes:units:software/release/major",
        H5::PredType::NATIVE_INT, att_space);
    // Write attribute information and read attribute immediately, without closing
    std::vector<int> read_data1(attr_data1.size()); // buffer
    ds_attr1.write(H5::PredType::NATIVE_INT, attr_data1.data());
    ds_attr1.read(H5::PredType::NATIVE_INT, read_data1.data());
    for (hsize_t i = 0; i < attr_data1.size(); ++i)
      BOOST_CHECK_EQUAL(attr_data1[i], read_data1[i]);
    // Close attributes
    ds_attr1.close();
    ds_attr2.close();

    /* Test attribute with group */
    // Create a group with a dataspace for attributes
    H5::Group group = fid1.createGroup("/Group1");
    H5::DataSpace sid3(ATTR2_RANK, dims3);
    // Create an attribute for the group
    H5::Attribute gr_attr = group.createAttribute("Attr2", H5::PredType::NATIVE_INT, sid3);
    // Check storage size for attribute
    hsize_t attr_size = gr_attr.getStorageSize();
    BOOST_CHECK_EQUAL(attr_size, (ATTR2_DIM1 * ATTR2_DIM2 * sizeof(int)));
    // Write attribute information
    gr_attr.write(H5::PredType::NATIVE_INT, attr_data2.data());
    // Check storage size for attribute
    attr_size = gr_attr.getStorageSize();
    BOOST_CHECK_EQUAL(attr_size, (ATTR2_DIM1 * ATTR2_DIM2 * sizeof(int)));
  }

  /* Read .h5 file */
  {
    // Open file
    H5::FileAccPropList fapl;
    fapl.setLibverBounds(H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
    H5::H5File fid1("attr_basic.h5", H5F_ACC_RDWR, fapl);

    /* Test attribute with dataset */
    // Verify the correct number of attributes
    H5::DataSet dataset = fid1.openDataSet("Dataset1");
    int num_file_attrs = dataset.getNumAttrs();
    BOOST_CHECK_EQUAL(num_file_attrs, 2);
    int num_ds_attrs = dataset.getNumAttrs();
    BOOST_CHECK_EQUAL(num_ds_attrs, 2);
    // Verify attribute information
    H5::Attribute ds_attr = dataset.openAttribute("dataset/attributes:units:particles/atoms/mass/value");
    std::vector<int> read_data1(attr_data1.size()); // buffer
    ds_attr.read(H5::PredType::NATIVE_INT, read_data1.data());
    for (hsize_t i = 0; i < ATTR1_DIM1; i++)
      BOOST_CHECK_EQUAL(attr_data1[i], read_data1[i]);

    /* Test attribute with group */
    // Verify the correct number of attributes
    H5::Group group = fid1.openGroup("/Group1");
    int num_gr_attrs = group.getNumAttrs();
    BOOST_CHECK_EQUAL(num_gr_attrs, 1);
    // Verify attribute information
    H5::Attribute gr_attr = group.openAttribute("Attr2");
    std::vector<std::array<int,ATTR2_DIM2>> read_data2(ATTR2_DIM1);
    gr_attr.read(H5::PredType::NATIVE_INT, read_data2.data());
    for (hsize_t i = 0; i < ATTR2_DIM1; i++)
      for (hsize_t j = 0; j < ATTR2_DIM2; j++)
        BOOST_CHECK_EQUAL(attr_data2[i][j], read_data2[i][j]);
  }
}

Compilation commands:

git clone -b develop https://github.com/HDFGroup/hdf5
cd hdf5
python3 -m venv venv
. venv/bin/activate
python3 -m pip install 'cmake==3.28' 'ninja==1.11'
mkdir build
cd build
CC=mpicc CXX=mpicxx cmake \
    -C ../config/cmake/cacheinit.cmake \
    -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DHDF5_ENABLE_PARALLEL:BOOL=ON \
    -DHDF5_BUILD_FORTRAN:BOOL=OFF \
    -DHDF5_BUILD_JAVA:BOOL=OFF \
    -DHDF5_BUILD_CPP_LIB:BOOL=ON \
    -DHDF5_ALLOW_UNSUPPORTED:BOOL=ON \
    -DHDF5_PACK_EXAMPLES:BOOL=ON \
    -DHDF5_EXPORTED_TARGETS:STRING="hdf5-targets" \
    -DMPIEXEC_MAX_NUMPROCS:STRING="2" ..

cmake --build .
./bin/cpp_testhdf5
# here copy-paste the MWE into a file
vi mwe.cpp
mpicxx mwe.cpp -Wall -Wextra -I../src -I../src/H5FDsubfiling/ -I./src -I../c++/src \
    -L./bin -lhdf5 -lhdf5_cpp -lboost_unit_test_framework -Wl,-rpath,./bin
./a.out

Reading the file in Python:

import h5py
with h5py.File("attr_basic.h5", "r") as cur:
    print(f"{cur.keys()}")
    print(f"{cur['Dataset1']=}")
    print(f"cur['Dataset1'].attrs={dict(cur['Dataset1'].attrs.items())}")
    print(f"cur['Group1'].attrs={dict(cur['Group1'].attrs.items())}")

Output:

<KeysViewHDF5 ['Dataset1', 'Group1']>
cur['Dataset1']=<HDF5 dataset "Dataset1": shape (3, 15, 13), type "|u1">
cur['Dataset1'].attrs={
    'dataset/attributes:units:particles/atoms/mass/value': array([  512,  -234, 98123], dtype=int32),
    'dataset/attributes:units:software/release/major': array([0, 0, 0], dtype=int32)
}
cur['Group1'].attrs={'Attr2': array([[  7614,   -416],
                                     [197814,     -3],
                                     [    -4,     12]], dtype=int32)}

Reading the file in the terminal:

h5dump attr_basic.h5

Output:

HDF5 "attr_basic.h5" {
GROUP "/" {
   DATASET "Dataset1" {
      DATATYPE   H5T_STD_U8LE
      DATASPACE  SIMPLE { ( 3, 15, 13 ) / ( 3, 15, 13 ) }
      DATA {
          (0,0,0):  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          (0,1,0):  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          ...
          (2,13,0): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          (2,14,0): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
      }
      ATTRIBUTE "dataset/attributes:units:particles/atoms/mass/value" {
         DATATYPE   H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA { (0): 512, -234, 98123 }
      }
      ATTRIBUTE "dataset/attributes:units:software/release/major" {
         DATATYPE   H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA { (0): 0, 0, 0 }
      }
   }
   GROUP "Group1" {
      ATTRIBUTE "Attr2" {
         DATATYPE   H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 3, 2 ) / ( 3, 2 ) }
         DATA {
             (0,0): 7614, -416,
             (1,0): 197814, -3,
             (2,0): -4, 12
         }
      }
   }
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants