At NSF Unidata, we have been supporting and developing netCDF standards and packages since the original release of netCDF in 1990. We strongly believe in the usefulness of netCDF Common Data Model for Earth Systems Science data, and for other types of data! NetCDF files can be used efficiently in machine learning modeling applications and can be used as a virtual Zarr datasets.
NSF Unidata has been urged by our community to investigate options to allow netCDF to work more easily with modern cloud-based infrastructure. Based on the strong interest and rapid adoption of Zarr by the community, the netCDF team decided to begin working with the Zarr community to ensure that these two widely used data storage mechanisms can interoperate if necessary.
Beginning with netCDF version 4.8.0, the Unidata NetCDF group has extended the netcdf-c library to provide access to cloud storage (e.g. Amazon S3) by providing a mapping from a subset of the full netCDF Enhanced (aka netCDF-4) data model to a variant of the Zarr data model that already has mappings to key-value pair cloud storage systems.
NetCDF has historically offered two different storage formats for the netCDF data model: files based on the original netCDF binary format, and files based on the HDF5 format. While this has proven effective in the past for traditional disk storage, it is less efficient for modern cloud-focused technologies such as those provided by Amazon S3, Microsoft Azure, IBM Cloud Object Storage, and other cloud service providers. To that end, the Unidata development team is happy to announce that we are expanding the storage solutions available through the netCDF software libraries.
Unidata is in the process of developing a Zarr [] based variant of netcdf. As part of this effort, it was necessary to implement some support for chunking. Specifically, the problem to be solved was that of extracting a hyperslab of data from an n-dimensional variable (array in Zarr parlance) that has been divided into chunks (in the HDF5 sense). Each chunk is stored independently in the data storage -- Amazon S3, for example.
The algorithm takes a series of R slices of the form (first,stop,stride), where R is the rank of the variable. Note that a slice of the form (first, count, stride), as used by netcdf, is equivalent because stop = first + count*stride. These slices form a hyperslab.
The goal is to compute the set of chunks that intersect the hyperslab and to then extract the relevant data from that set of chunks to produce the hyperslab.