Unidata is in the process of developing a Zarr [] based variant of netcdf. As part of this effort, it was necessary to implement some support for chunking. Specifically, the problem to be solved was that of extracting a hyperslab of data from an n-dimensional variable (array in Zarr parlance) that has been divided into chunks (in the HDF5 sense). Each chunk is stored independently in the data storage -- Amazon S3, for example.
The algorithm takes a series of R slices of the form (first,stop,stride), where R is the rank of the variable. Note that a slice of the form (first, count, stride), as used by netcdf, is equivalent because stop = first + count*stride. These slices form a hyperslab.
The goal is to compute the set of chunks that intersect the hyperslab and to then extract the relevant data from that set of chunks to produce the hyperslab.
In May 2018, The HDF Group announced a new support strategy for the HDF5 libraries that are included in netCDF4. Because HDF5 libraries are needed by the netCDF4 libraries to create fully-featured netCDF files, the changes to The HDF group's support strategy have raised questions about netCDF's future path in the netCDF community.
Unidata and the netCDF team have been in close contact with The HDF Group since their announcement, and we reiterate our commitment to providing netCDF libraries that do not require any paid software licenses in order to create or read files that conform to the netCDF standard. Read on for details.
In part 1, we explained what data chunking is about in the context of scientific data access libraries such as netCDF-4 and HDF5, presented a 38 GB 3-dimensional dataset as a motivating example, discussed benefits of chunking, and showed with some benchmarks what a huge difference chunk shapes can make in balancing read times for data that will be accessed in multiple ways.
In this post, I'll continue looking at that example dataset to see how we can derive good chunk shapes, generalize to other datasets, look at how long it can take to rechunk a multidimensional dataset, and look at the use of Solid State Disk (SSD) for both accessing and rechunking data.
What is data chunking? How can chunking help to organize large multidimensional datasets for both fast and flexible data access? How should chunk shapes and sizes be chosen? Can software such as netCDF-4 or HDF5 provide better defaults for chunking? If you're interested in those questions and some of the issues they raise, read on ...
Unidata Program Center developer John Caron has been thinking a lot about HDF5's Dimension Scales, how they relate to netCDF's Shared Dimensions, and why data should be written with the netCDF-4 library using Shared Dimensions.
If you want to write HDF5 files directly without using the netCDF-4 library, or if you want to build a netCDF-4 compatible software layer on top of HDF5, read on.