Optimizing Data Layout

HDF5 supports advanced I/O features for optimizing the layout of large data arrays via Chunking and I/O Filters. Below we discuss these features in more detail. The main classes involved in configuring chunking and I/O filters in HDF5 via AQNWB are:

ArrayDataSetConfig for configuring datasets for write, including chunking settings.
- HDF5ArrayDataSetConfig then extends the basic ArrayDataSetConfig to add HDF5-specific configurations, such as, support for configuring I/O filters (e.g., for compression)
- HDF5FilterConfig is used with HDF5ArrayDataSetConfig to configure a single I/O filter.
HDF5IO::createArrayDataSet is the main method used for creating n-dimensional array datasets

Chunking

For datasets intended for recording, AqNWB uses chunking to ensure the dataset can be extended as new data arrives during the recording process. Using chunking in HDF5, a dataset is divided into fixed-size blocks (called chunks), which are stored separately in the file. This technique is particularly beneficial for large datasets and offers several advantages:

Extend datasets: Chunked datasets can be easily extended in any dimension. This flexibility is crucial for recording datasets where the size of the dataset is not known in advance.
Performance Optimization: By carefully choosing the chunk size, you can optimize performance based on your particular read/write access patterns. When only a portion of a chunked dataset is accessed, only the relevant chunks are read or written, reducing the amount of I/O operations.
Compression: Data within each chunk can be compressed independently, which can help to significant reduce data size, especially for datasets with redundancy (see I/O Filters and Compression).

Warning: Choosing a chunking configuration that does not align well with the desired read/write pattern may lead to reduced performance due to repeated read, decompression, and update to the same chunk or read of extra data as chunks are always read fully.

I/O Filters and Compression

HDF5 filters are used to transform data as it is written to and read from the file. Filters are applied on a per-chunk basis (i.e., chunking is required to use filters), and are typically used for compression or error detection.

Compression filters reduce the amount of storage space required for datasets by eliminating redundancy in the data. One commonly used compression filter is GZIP (DEFLATE) which provides a good balance between compression ratio and speed. Multiple filters may be applied simultaneously. E.g., the Shuffle filter can be used to rearrange the bytes in the data to improve the effectiveness of compression filters. By shuffling the data, redundancy within the byte stream is increased, which can help improve compression ratios. For more information on the available filters in HDF5, please refer to the HDF5 Data Pipeline Filters Documentation by the HDF5 project.

Using Chunking and I/O Filters in AqNWB

When creating datasets, you can specify the filters to be applied using the HDF5ArrayDataSetConfig class. The following example demonstrates how to configure and apply the GZIP and Shuffle filters to a dataset. The HDF5ArrayDataSetConfig class allows you to specify the data type, shape, chunking, and filters for the dataset. The HDF5IO::createArrayDataSet method then creates the dataset with the specified configuration.

    // Create the HDF5IO object and open the file as usual
    std::string path = getTestFilePath("testWithFilters.h5");
    std::unique_ptr<AQNWB::IO::HDF5::HDF5IO> hdf5io =
        std::make_unique<AQNWB::IO::HDF5::HDF5IO>(path);
    hdf5io->open();
 
    // Define the data type, shape, and chunking
    AQNWB::IO::BaseDataType type(AQNWB::IO::BaseDataType::Type::T_I32, 1);
    SizeArray shape = {100, 100};
    SizeArray chunking = {10, 10};
 
    // Create HDF5ArrayDataSetConfig and add filters
    AQNWB::IO::HDF5::HDF5ArrayDataSetConfig config(type, shape, chunking);
    unsigned int gzip_level = 4;
    config.addFilter(
        AQNWB::IO::HDF5::HDF5FilterConfig::createGzipFilter(gzip_level));
    config.addFilter(AQNWB::IO::HDF5::HDF5FilterConfig::createShuffleFilter());
 
    // Create the dataset
    auto baseDataset = hdf5io->createArrayDataSet(config, "/filtered_dataset");
 
    // [Optional/Testing] Verify the dataset properties
    auto dataset =
        dynamic_cast<AQNWB::IO::HDF5::HDF5RecordingData*>(baseDataset.get());
    const H5::DataSet* h5Dataset = dataset->getDataSet();
    H5::DSetCreatPropList dcpl = h5Dataset->getCreatePlist();
    REQUIRE(dcpl.getNfilters() == 2);

Note: HDF5FilterConfig provides convenient factory methods to setup common filters, e.g., createGzipFilter. Alternatively, we can also use any of the HDF5 filters directly via HDF5ArrayDataSetConfig::addFilter,
e.g., in the case of GZIP via config.addFilter(H5Z_FILTER_DEFLATE, {4});

Single-Writer Multiple-Reader (SWMR) Mode

The HDF5IO I/O backend uses by default SWMR mode while recording data. Using SWMR, one process can write to the HDF5 file and multiple other processes can read from the file concurrently while ensuring that the readers see a consistent view of the data.

Warning: There are known issues using SWMR mode on Windows due to file locking by the reader processes. One workaround is to set the environment variable HDF5_USE_FILE_LOCKING=FALSE to prevent file access errors when using a writer process with other reader processes.

Why does AqNWB use SMWR mode?

Using SWMR has several key advantages for data acquisition applications:

Concurrent Access: Enables one writer process to update the file while multiple reader processes read from it without blocking each other.
Data Consistency and Integrity: Ensures that readers see a consistent view of the data, even as it is being written. Readers will only see data that has been completely written and flushed to disk. Hence, SWMR mode, maintains the integrity and consistency of the data, ensuring that the HDF5 file remains readable even if errors should occur during the data acquisition process.
Real-Time Data Access: Useful for applications that need to monitor and analyze data in real-time as it is being generated.
Simplified Workflow for Real Time Analyses: Simplifies the architecture of applications that require real-time data consumption during acquisition, avoiding the need for intermediate storage solutions and complex inter-process communication or file locking mechanisms.

Note: While SWMR mode ensures data integrity, some data loss may still occur if the application crashes. Only data that has been completely written and flushed to disk will be readable. To manually flush data to disk use HDF5IO::flush.

Writing an NWB file with SWMR mode

SWMR mode is enabled when calling HDF5IO::startRecording. Once SWMR mode is enabled, no new data objects (Datasets, Groups, Attributes etc.) can be created, but we can only add and set values to existing data objects. Since other processes may read from the HDF5 file, it is not possible to intermittently disable SWMR mode to add new objects, i.e., once SWMR mode is enabled, the only way to add new objects to the file is to close the file and reopen in read/write mode. As such, the typical workflow when using SWMR mode during data acquisition is to:

Open the HDF5 file
Create all elements of the NWB file
Start the recording process
Stop recording and close the file

This workflow is applicable to a wide range of data acquisition use-cases. However, for use cases that require creation of new Groups and Datasets during acquisition, you can disable the use of SWMR mode by setting disableSWMRMode=true when constructing the AQNWB::IO::HDF5::HDF5IO object.

Warning: While disabling SWMR mode allows Groups and Datasets to be created during and after recording, this comes at the cost of losing the concurrent access and data integrity features that SWMR mode provides.

Code Examples

This code snippet shows all the includes that are being used by the code examples shown in this section:

#include <filesystem>
#include <future>
#include <iostream>
#include <memory>
#include <numeric>
#include <vector>
 
#include <catch2/catch_test_macros.hpp>
 
#include "io/hdf5/HDF5ArrayDataSetConfig.hpp"
#include "io/hdf5/HDF5IO.hpp"
#include "io/hdf5/HDF5RecordingData.hpp"
#include "nwb/NWBFile.hpp"
#include "nwb/file/ElectrodeTable.hpp"
#include "testUtils.hpp"
 
namespace fs = std::filesystem;

Workflow with SWMR

    // create and open the HDF5 file. SWMR mode is used by default
    std::string path = getTestFilePath("testWithSWMRMode.h5");
    std::unique_ptr<AQNWB::IO::HDF5::HDF5IO> hdf5io =
        std::make_unique<AQNWB::IO::HDF5::HDF5IO>(path);
    hdf5io->open();
 
    // add a dataset
    std::vector<int> testData(10000);
    std::iota(testData.begin(), testData.end(), 1);  // Initialize testData
    std::string dataPath = "/data";
    SizeType numBlocks = 10;  // write 10 chunks of
    SizeType numSamples = testData.size();
    AQNWB::IO::ArrayDataSetConfig datasetConfig(
        BaseDataType::I32,  // type
        SizeArray {0},  // size. Initial size of the dataset
        SizeArray {1000}  // chunking. Size of a data chunk
    );
    std::unique_ptr<BaseRecordingData> dataset = hdf5io->createArrayDataSet(
        datasetConfig,
        dataPath);  // path. Path to the dataset in the HDF5 file
 
    // Start recording. Starting the recording places the HDF5 file in SWMR mode
    Status status = hdf5io->startRecording();
    REQUIRE(status == Status::Success);
 
    // Once in SWMR mode we can add data to the file but we can no longer create
    // new data objects (Groups, Datasets, Attributes etc.).
    REQUIRE(hdf5io->canModifyObjects() == false);
 
    // write the our testData to the file.
    for (SizeType b = 0; b <= numBlocks; b++) {
      // write a single 1D block of data and flush to file
      std::vector<SizeType> dataShape = {numSamples};
      dataset->writeDataBlock(dataShape, BaseDataType::I32, &testData[0]);
      // Optionally we can flush all data to disk
      status = hdf5io->flush();
      REQUIRE(status == Status::Success);
    }
 
    // stop recording. In SWMR mode the file is now closed and recording cannot
    // be restarted
    status = hdf5io->stopRecording();
    REQUIRE(hdf5io->isOpen() == false);
    REQUIRE(hdf5io->startRecording() == Status::Failure);

Workflow with SWMR disabled

    // create and open the HDF5 file. With SWMR mode explicitly disabled
    std::string path = getTestFilePath("testWithoutSWMRMode.h5");
    std::unique_ptr<AQNWB::IO::HDF5::HDF5IO> hdf5io =
        std::make_unique<AQNWB::IO::HDF5::HDF5IO>(path,
                                                  true  // Disable SWMR mode
        );
    hdf5io->open();
 
    // add a dataset
    std::vector<int> testData(10000);
    std::iota(testData.begin(), testData.end(), 1);  // Initialize testData
    std::string dataPath = "/data";
    SizeType numBlocks = 10;  // write 10 chunks of
    SizeType numSamples = testData.size();
    AQNWB::IO::ArrayDataSetConfig datasetConfig(
        BaseDataType::I32,  // type
        SizeArray {0},  // size. Initial size of the dataset
        SizeArray {1000}  // chunking. Size of a data chunk
    );
    std::unique_ptr<BaseRecordingData> dataset = hdf5io->createArrayDataSet(
        datasetConfig,
        dataPath);  // path. Path to the dataset in the HDF5 file
 
    // Start recording. Starting the recording places the HDF5 file in SWMR mode
    Status status = hdf5io->startRecording();
    REQUIRE(status == Status::Success);
 
    // With SWMR mode disabled we are still allowed to create new data objects
    // (Groups, Datasets, Attributes etc.) during the recording. However, with
    // SWMR mode disabled, we lose the data consistency and concurrent read
    // features that SWMR mode provides.
    REQUIRE(hdf5io->canModifyObjects() == true);
 
    // write the our testData to the file.
    for (SizeType b = 0; b <= numBlocks; b++) {
      // write a single 1D block of data and flush to file
      std::vector<SizeType> dataShape = {numSamples};
      dataset->writeDataBlock(dataShape, BaseDataType::I32, &testData[0]);
      // Optionally we can flush all data to disk
      status = hdf5io->flush();
      REQUIRE(status == Status::Success);
    }
 
    // stop recording.
    status = hdf5io->stopRecording();
 
    // Since SWMR mode is disabled, stopping the recording won't close the file
    // so that we can restart the recording if we want to
    REQUIRE(hdf5io->isOpen() == true);
 
    // Restart the recording
    REQUIRE(hdf5io->startRecording() == Status::Success);
 
    // Stop the recording and close the file
    hdf5io->stopRecording();
    hdf5io->close();
    REQUIRE(hdf5io->isOpen() == false);

Reading with SWMR mode

While the file is being written to in SWMR mode, readers must open the file with the H5F_ACC_RDONLY flag and then enable SWMR read mode using the H5Fstart_swmr_read function, e.g.:

hid_t file_id = H5Fopen("example.h5", H5F_ACC_RDONLY, H5P_DEFAULT);

H5Fstart_swmr_read(file_id);

Table of Contents