Unidata is in the process of developing a Zarr [] based variant of netcdf. As part of this effort, it was necessary to implement some support for chunking. Specifically, the problem to be solved was that of extracting a hyperslab of data from an n-dimensional variable (array in Zarr parlance) that has been divided into chunks (in the HDF5 sense). Each chunk is stored independently in the data storage -- Amazon S3, for example.
The algorithm takes a series of R slices of the form (first,stop,stride), where R is the rank of the variable. Note that a slice of the form (first, count, stride), as used by netcdf, is equivalent because stop = first + count*stride. These slices form a hyperslab.
The goal is to compute the set of chunks that intersect the hyperslab and to then extract the relevant data from that set of chunks to produce the hyperslab.
From our support questions, it appears that the major feature of netCDF-4 attracting users to upgrade their libraries from netCDF-3 is compression. The netCDF-4 libraries inherit the capability for data compression from the HDF5 storage layer underneath the netCDF-4 interface. Linking a program that uses netCDF to a netCDF-4 library allows the program to read compressed data without changing a single line of the program source code. Writing netCDF compressed data only requires a few extra statements. And the nccopy utility program supports converting classic netCDF format data to or from compressed data without any programming.
In part 1, we explained what data chunking is about in the context of scientific data access libraries such as netCDF-4 and HDF5, presented a 38 GB 3-dimensional dataset as a motivating example, discussed benefits of chunking, and showed with some benchmarks what a huge difference chunk shapes can make in balancing read times for data that will be accessed in multiple ways.
In this post, I'll continue looking at that example dataset to see how we can derive good chunk shapes, generalize to other datasets, look at how long it can take to rechunk a multidimensional dataset, and look at the use of Solid State Disk (SSD) for both accessing and rechunking data.
What is data chunking? How can chunking help to organize large multidimensional datasets for both fast and flexible data access? How should chunk shapes and sizes be chosen? Can software such as netCDF-4 or HDF5 provide better defaults for chunking? If you're interested in those questions and some of the issues they raise, read on ...