Unidata Developer's Blog

Maintaining netCDF: Updating Java Tutorial Code and Performance Testing in Python

2021-08-04T13:55:13+00:00

Editor's Note:
Due to the COVID-19 pandemic, Unidata's 2021 summer interns did not travel to Boulder to work on their projects in person. Instead, they interacted with Unidata developers through Slack, Zoom, and other electronic means.

Izzy (Isabelle) Pfander

by Isabelle Pfander
2021 Unidata summer intern

I came into this summer internship with a goal of working on the Network Common Data Form (netCDF) libraries. NetCDF is a combination of software libraries and APIs describing a data model for scientific multidimensional arrays. I planned to improve the online user guide, write tutorial code, and learn about storage and efficiency.

Before this project, I had only used netCDF by calling high level functions to read and write data in MATLAB, which uses functionality from the netCDF-C library. The netCDF data model is a standard across languages, with programming interfaces in C, Java, Fortran, Python, MATLAB, R, and more. For the majority of this summer, I worked closely with the netCDF-Java library, updating and expanding the online user's guide.

A UML diagram for
netCDF-Java
(click to enlarge)

I maintained the netCDF-Java documentation by updating tutorial code, testing code snippets, and modernizing tutorial text to improve user understanding. I started by improving the documentation by replacing raw HTML with Markdown, changing formatting, linking to relevant sites, updating UML diagrams, and including updated screenshots. I next moved on to update and rewrite the tutorial code in Java. I created a tutorial class for each page with every code snippet contained in a method. Viewing the code snippets inside of IntelliJ, I was able to fix deprecations and update the code after some major changes were made to the structure of the netCDF-Java library. I then used netCDF-Java's jekyll plugin to insert the code snippets into the rendered html page. Finally, I created test classes to confirm the code was running properly. Changing the structure by moving code snippets to Java classes rather than inside the markdown file will ensure that when future changes are made, errors in the user guide will not go unnoticed. See one of my pull requests for user guide updates.

Performance with different chunk sizes.

After improving the user guide documentation, I embarked on the second focus of my internship: performance testing in Python. Because of my interest in data storage and efficiency, my mentors suggested that I look into comparing data formats, including HDF5 and Zarr. HDF5 is a file format used by netCDF-4 providing compression and chunking to the netCDF data model. I switched from working in Java to Python so that I could compare reading times with Zarr, a Python-based data storage format. I compared reads with netCDF-3, netCDF-4 Classic, netCDF-4, Zarr, and Zarr being read with Xarray. The completed performance testing demonstrated that read times increase at varying scales as chunk size decreases. When chunk size was large, a Zarr directory store read was faster; however, as chunk size decreases, reads of netCDF-4 became much faster. I learned that the difference in read times is due to how each format stores data differently. A netCDF-4 file stores all data in one .nc file, consequently more operations are needed to find the appropriate data, but only one open is required. Zarr directory stores save chunked data as many subdirectories and files, meaning the more chunks, the more individual files in one Zarr directory store. You can see the notebooks I created, testing data, and full results in my GitHub repository.

My internship with Unidata allowed me to explore my own interests with netCDF software while sharing findings with the public. I was able to contribute to the open source community for the first time, conduct testing of my own, and gain professional development skills through my mentors and UCAR/Unidata's community. I am very grateful for this summer opportunity and all the individuals who made this remote collaboration possible.

NCZarr Support for Zarr Filters

2021-05-23T16:11:43+00:00

Note: See github issue 2006 for additional comments.

To date, filters in the netcdf-c library referred to HDF5 style filters. This style of filter is represented in the netcdf-c/HDF5 file by the following information:

An unsigned integer, the "id", and
A vector of unsigned integers that encode the "parameters" for controlling the behavior of the filter.

The "id" is a unique number assigned to the filter by the HDF Group filter authority. It identifies a specific filter algorithm. The "parameters" of the filter are not defined explicitly but only by the implementation of the filter.

The inclusion of Zarr support in the netcdf-c library (called NCZarr) creates the need to provide a new representation consistent with the way that Zarr files store filter information. For Zarr, filters are represented using the JSON notation. Each filter is defined by a JSON dictionary, and each such filter dictionary is guaranteed to have a key named "id" whose value is a unique string defining the filter algorithm: "lz4" or "bzip2", for example.

The parameters of the filter are defined by additional -- algorithm specific -- keys in the filter dictionary. One commonly used filter is "blosc", which has a JSON dictionary of this form.

{  'id': 'blosc',  'cname': 'lz4',  'clevel': 5,  'shuffle': 1  }

So in HDF5 terms, it has three parameters:

"cname" -- the sub-algorithm used by the blosc compressor, LZ4 in this case.
"clevel" -- the compression level, 5 in this case.
"shuffle" -- is the input shuffled before compression, yes (1) in this case.

NCZarr (netcdf Zarr) is required to store its filter information in its metadata in the above JSON dictionary format. Simultaneously, NCZarr expects to use many of the existing HDF5 filter implementations. This means that some mechanism is needed to translate between the HDF5 id+parameter model and the Zarr JSON dictionary model.

The standardization authority for defining Zarr filters is the list supported by the NumCodecs project. Comparing the set of standard filters (aka codecs) defined by NumCodecs to the set of standard filters defined by HDF5, it can be seen that the two sets overlap, but each has filters not defined by the other.

Note also that it is undesirable that a specific set of filters/codecs be built into the NCZarr implementation. Rather, it is preferable for there be some extensible way to associate the JSON with the code implementing the codec. This mirrors the plugin model used by HDF5.

Currently, each HDF5 filter is implemented by a shared library that has certain well-defined entry points that allow the netcdf/HDF5 libraries to determine information about the filter, notably its id. In order to use the codec JSON format, these entry points must be extended in some way to obtain the corresponding defining JSON. But there is another desirable constraint. It should be possible to associate an existing HDF5 filter -- one without codec JSON information -- with the corresponding codec JSON. This association needs to be implemented by some mechanism external to the HDF5 filter.

Pre-Processing Filter Libraries

The process for using filters for NCZarr is defined to operate in several steps. First, as with HDF5, all shared libraries in a specified directory (HDF5_PLUGIN_PATH) are scanned. They are interrogated to see what kind of library they implement, if any. This interrogation operates by seeing if certain well-known (function) names are defined in this library.

There are two library types:

HDF5 -- exports a specific API: "H5Z_plugin_type" and "H5Z_get_plugin_info".
Codec -- exports a specific API: "NCZ_codec_type" and "NCZ_get_codec_info"

Note that a given library can export either or both of these APIs. This means that we can have three types of libraries:

HDF5 only
Codec only
HDF5 + Codec

Suppose that our HDF5_PLUGIN_PATH location has an HDF5-only library. Then by adding a corresponding, separate, Codec-only library to that same location, it is possible to make an HDF5 library usable by NCZarr. It is possible to do this without having to modify the HDF5-only library. Over time, it is possible to merge any given HDF5-only library with a Codec-only library to produce a single, combined library.

Using Plugin Libraries

The approach used by NCZarr is to have the netcdf-c library process all of the libraries by interrogating each one for the well-known APIs and recording the result. Any libraries that do not export one or both of the well-known APIs is ignored.

Internally, the netcdf-c library pairs up each HDF5 library API with a corresponding Codec API by invoking the relevant well-known functions (See Appendix A). This results in this table for associated codec and hdf5 libraries.

HDF5 API	Codec API	Action
Not defined	Not defined	Ignore
Defined	Not defined	Ignore
Defined	Defined	NCZarr usable

Using the Codec API

Given a set of filters for which the HDF5 API and the Codec API are defined, it is then possible to use the APIs to invoke the filters and to process the meta-data in Codec JSON format.

Writing an NCZarr Container

When writing, the user program invokes the NetCDF API function nc_def_var_filter. This function is currently defined to operate using HDF5-style id and parameters (unsigned ints). The netcdf-c library examines its list of known filters to find one matching the HDF5 id provided by nc_def_var_filter. The set of parameters provided is stored internally. Then during writing of data, the corresponding HDF5 filter is invoked to encode the data.

When it comes time to write out the meta-data, the stored HDF5-style parameters are passed to a specific Codec function to obtain the corresponding JSON representation. Again see Appendix A. This resulting JSON is then written in the NCZarr metadata.

Reading an NCZarr Container

When reading, the netcdf-c library reads the metadata for a given variable and sees that some set of filters are applied to this variable. The metadata is encoded as Codec-style JSON.

Given a JSON Codec, it is parsed to provide a JSON dictionary containing the string "id" and the set of parameters as various keys. The netcdf-c library examines its list of known filters to find one matching the Codec "id" string. The JSON is passed to a Codec function to obtain the corresponding HDF5-style unsigned int parameter vector. These parameters are stored for later use.

When it comes time to read the data, the stored HDF5-style filter is invoked with the parameters already produced and stored.

Supporting Filter Chains

HDF5 supports filter chains, which is a sequence of filters where the output of one filter is provided as input to the next filter in the sequence. When encoding, the filters are executed in the "forward" direction, while when decoding the filters are executed in the "reverse" direction.

In the Zarr meta-data, a filter chain is divided into two parts: the "compressor" and the "filters". The former is a single JSON codec as described above. The latter is an ordered JSON array of codecs. So if compressor is something like "compressor": {"id": "c"...} and the filters array is like this: "filters": [ {"id": "f1"...}, {"id": "f2"...}...{"id": "fn"...}] then the filter chain is (f1,f2,...fn,c) with f1 being applied first and c being applied last when encoding. On decode, the filter chain is executed in the order (c,fn...f2,f1).

So, an HDF5 filter chain is divided into two parts, where the last filter in the chain is assigned as the "compressor" and the remaining filters are assigned as the "filters". But independent of this, each codec, whether a compressor or a filter, is stored in the JSON dictionary form described earlier.

Extensions

The Codec style, using JSON, has the ability to provide very complex parameters that may be hard to encode as a vector of unsigned integers. It might be desirable to consider exporting a JSON-base API out of the netcdf-c API to support user access to this complexity. This would mean providing some alternate version of "nc_def_var_filter" that takes a string-valued argument instead of a vector of unsigned ints.

One bad side-effect of this is that we then may have two classes of plugins. One class can be used by both HDF5 and NCZarr, and a second class that is usable only with NCZarr.

This extension is unlikely to be implemented until a compelling use-case is encountered.

Summary

This document outlines the proposed process by which NCZarr utilizes existing HDF5 filters. At the same time, it describes the mechanisms to support storing filter metadata in the NCZarr container using the Zarr compliant Codec style representation of filters and their parameters.

Appendix A. Codec API

The Codec API mirrors the HDF5 API closely. It has one well-known function that can be invoked to obtain information about the Codec as well as pointers to special functions to perform conversions.

Note that this Appendix is only an initial proposal and is subject to change.

NCZ_get_codec_info

This function returns a pointer to a C struct that provides detailed information about the codec converter.

Signature

struct NCZ_codec_t NCZ_get_codec_info(void);

NCZ_codec_t

typedef struct NCZ_codec_t {      int version; /* Version number of the struct */      int sort; /* Format of remainder of the struct;                   Currently always NCZ_CODEC_HDF5 */      const char* id;            /* The name/id of the codec */      const unsigned int hdf5id; /* corresponding hdf5 id */      int (*NCZ_codec_to_hdf5)(const char* codec, int* nparamsp, unsigned* paramsp);      int (*NCZ_hdf5_to_codec)(int nparams, unsigned* params, char** codecp);  } NCZ_codec_t;

The key to this struct is the two function pointers that do the conversion between codec JSON and HDF5 parameters.

NCZ_codec_to_hdf5

Given a JSON Codec representation, it returns a corresponding vector of unsigned integers for use with HDF5.

Signature

int (*NCZ_codec_to_hdf5)(const char* codec, int* nparamsp, unsigned** paramsp);

Arguments

codec -- (in) ptr to JSON string representing the codec.
nparamsp -- (out) store the length of the converted HDF5 unsigned vector
paramsp -- (out) store a pointer to the converted HDF5 unsigned vector; caller must free the returned vector. Note the double indirection.

Return Value: a netcdf-c error code.

NCZ_hdf5_to_codec

Given an HDF5 vector of unsigned integers and it length, return the corresponding JSON codec representation.

Signature

int (*NCZ_hdf5_to_codec)(int nparamsp, unsigned* paramsp, char** codecp);

Arguments

nparams -- (in) the length of the HDF5 unsigned vector
params -- (in) pointer to the HDF5 unsigned vector.
codecp -- (out) store the string representation of the codec; caller must free.

Return Value: a netcdf-c error code.

Netcdf-4 Filter Support Changes

2020-10-15T16:24:17+00:00

The netcdf-c library filters API in version 4.7.4 has been deprecated in favor of a modified version that unfortunately may cause incompatibilties for users.

The initial reason for the incompatible changes was to support the use of filters in the new NCZarr code. The changes were not completely thought out so it was decided to remove them and revert to previous behaviors. At some future point, the filter mechanism will be extended to support filters for NCZarr, but these will be proper extensions: the existing, reverted, filter API will continue to be supported with no user-visible modifications.

Unfortunately, some advanced users of netcdf filters may experience some compilation or execution problems for previously working code because of these reversions. In that case, please revise your code. Apologies are extended for any inconvenience. Note that it is possible to detect which mechanism is in place at build time.

In summary, the changes are of the following kinds:

Some functions were renamed for consistency.
Revert the way that the function _nc_inq_var_filter_ was indicating no filters existed.
Some auxilliary functions for parsing textual filter specifications have been moved to _netcdf_aux.h_.
All of the "filterx" functions have been removed.
The undocumented function _nc_filter_remove_ was deleted.

See the Github document https://github.com/Unidata/netcdf/blob/master/NUG/filters.md for details.

Highlights From My Summer Internship With Unidata

2020-07-30T11:16:55+00:00

Editor's Note:
Due to the COVID-19 pandemic, Unidata's 2020 summer interns did not travel to Boulder to work on their projects in person. Instead, they interacted with Unidata developers through Slack, Zoom, and other electronic means.

Lauren Prox

by Lauren Prox
2020 Unidata summer intern

During the beginning of my internship, I devoted a great deal of time learning how to use Git and Github to collaborate on software development projects. After gaining this experience, I began improving documentation for a variety of Unidata remote repositories. I started with the netCDF-C repository and then moved on to the MetPy, Siphon, and Python Training remote repositories. This work was significant as it ensured that software users were able to locate resources, properly download software, and learn how to operate the software via informational materials.

I also provided feedback regarding the installation process for NetCDF and its various libraries. This led me to investigate how users access and work with NetCDF data using tools such as MATLAB and AWS S3. From this work, I found that there are several resources available for users who wish to use NetCDF data and AWS S3 buckets including AWS S3's Developer Guide and MATLAB's Getting Started Guide. In mentioning these great resources, I would like to take the opportunity to mention a Unidata resource that provides tutorials and resources concerning python skills and atmospheric science education. Though I may be slightly biased after working on this website, Unidata's Python Training website is a wonderful resource for people wishing to gain more coding experience while using real-word atmospheric science data.

An example of one of my merged pull requests
(click to enlarge)

While working on these projects, I simultaneously participated in educational courses that broadened my knowledge of software development, data stewardship, and scientific research methods. In addition to completing a software development mini-course with my institution, I also completed a Structured Query Language (SQL) course offered by UCAR. I had frequently heard of SQL in my engineering courses, but I never actually took the time to explore this language until this summer. Because my internship was remote, I was also able to attend virtual conferences such as EarthCube, ESIP, and PEARC. These conferences were a great way to network with users of Unidata software and to learn of new developments within the fields of scientific data storage and sharing. The flexibility of my internship position allowed me to take advantage of countless opportunities, which helped me gain invaluable knowledge that I will continue to use as a scientific researcher.

This internship was such a rewarding experience because I actually applied what I learned, in my computer science courses, to fix real-world problems. The moment when my first pull request — a software documentation correction in which I replaced a few broken web links — was merged onto a Unidata repository's master branch felt amazing. When I successfully revised the script to include the proper links, my changes were approved and are now included in the official documentation for the netCDF Github repository. What made this accomplishment even sweeter was the way that my coworkers celebrated my progress. I am so appreciative for the Unidata and greater NCAR community because they provided a supportive, engaging, and fun environment despite the remote setting of this internship. It is because of this, that I look forward to working with NCAR and Unidata in the future. Additionally, I would highly recommend the Unidata Summer Internship program to any student seeking to bridge the gap between their environmental science coursework and their computer science skills.

Overview of Zarr Support in netCDF-C

2020-06-25T12:06:11+00:00

Note: This document is obsolete. Please refer to this document

Beginning with netCDF version 4.8.0, the Unidata NetCDF group has extended the netcdf-c library to provide access to cloud storage (e.g. Amazon S3 [1] ) by providing a mapping from a subset of the full netCDF Enhanced (aka netCDF-4) data model to a variant of the Zarr [4] data model that already has mappings to key-value pair cloud storage systems. The NetCDF version of this storage format is called NCZarr [4].

The NCZarr Data Model

NCZarr uses a data model [4] that is, by design, similar to, but not identical with the Zarr Version 2 Specification [6].
Briefly, the data model supported by NCZarr is netcdf-4 minus the user-defined types and the String type. As with netcdf-4 it supports chunking. Eventually it will also support filters in a manner similar to the way filters are supported in netcdf-4.

Specifically, the model supports the following.

"Atomic" types: char, byte, ubyte, short, ushort, int, uint, int64, uint64.
Shared (named) dimensions
Attributes with specified types -- both global and per-variable
Chunking
Fill values
Groups
N-Dimensional variables
Per-variable endianness (big or little)

With respect to full netCDF-4, the following concepts are currently unsupported.

String type
User-defined types (enum, opaque, VLEN, and Compound)
Unlimited dimensions
Contiguous or compact storage

Note that contiguous and compact are not actually supported because they are HDF5 specific. When specified, they are treated as chunked where the file consists of only one chunk. This means that testing for contiguous or compact is not possible, the _nc_inq_varchunking function will always return NC_CHUNKED and the chunksizes will be the same as the dimension sizes of the variable's dimensions.

Enabling NCZarr Support

NCZarr support is enabled if the --enable-nczarr option is used with './configure'. If NCZarr support is enabled, then a usable version of libcurl must be specified using the LDFLAGS environment variable (similar to the way that the HDF5 libraries are referenced). Refer to the installation manual for details. NCZarr support can be disabled using the --disable-dap.

Accessing Data Using the NCZarr Prototocol

In order to access a NCZarr data source through the netCDF API, the file name normally used is replaced with a URL with a specific format. Note specifically that there is no NC_NCZARR flag for the mode argument of _nccreate or _ncopen. In this case, it is indicated by the URL path.

URL Format

The URL is the usual scheme:://host:port/path?query#fragment format. There are some details that are important.

Scheme: this should be https or s3,or file. The s3 scheme is equivalent to "https" plus setting "mode=nczarr,s3" (see below). Specifying "file" is mostly used for testing, but is used to support directory tree or zipfile format storage.
Host: Amazon S3 defines two forms: Virtual and Path.
- Virtual: the host includes the bucket name as in bucket.s3.<region>.amazonaws.com
- Path: the host does not include the bucket name, but rather the bucket name is the first segment of the path. For example s3.<region>.amazonaws.com/bucket
- Other: It is possible to use other non-Amazon cloud storage, but that is cloud library dependent.
Query: currently not used.
Fragment: the fragment is of the form key=value&key=value&.... Depending on the key, the =value part may be left out and some default value will be used.

Client Parameters

The fragment part of a URL is used to specify information that is interpreted to specify what data format is to be used, as well as additional controls for that data format. For NCZarr support, the following key=value pairs are allowed.

mode=nczarr|zarr|s3|file|zip... -- The mode key specifies the particular format to be used by the netcdf-c library for interpreting the dataset specified by the URL. Using mode=nczarr causes the URL to be interpreted as a reference to a dataset that is stored in NCZarr format. The modes s3, file, and zip tell the library what storage driver to use. The s3 is default] and indicates using Amazon S3 or some equivalent. The file format stores data in a directory tree. The zip format stores data in a local zip file. It should be the case that zipping a file format directory tree will produce a file readable by the zip storage format. The zarr mode tells the library to use NCZarr, but to restrict its operation to operate on pure Zarr Version 2 datasets.

NCZarr Map Implementation

Internally, the nczarr implementation has a map abstraction that allows different storage formats to be used. This is closely patterned on the same approach used in the Python Zarr implementation, which relies on the Python MutableMap [5] class.

In NCZarr, the corresponding type is called zmap. The zmap API essentially implements a simplified variant of the Amazon S3 API.

As with Amazon S3, keys are utf8 strings with a specific structure: that of a path similar to those of a Unix path with '/' as the separator for the segments of the path.

As with Unix, all keys have this BNF syntax:

key: '/' | keypath ;  keypath: '/' segment | keypath '/' segment ;  segment: <sequence of UTF-8 characters except control characters and '/'>

Obviously, one can infer a tree structure from this key structure. A containment relationship is defined by key prefixes. Thus one key is "contained" (possibly transitively) by another if one key is a prefix (in the string sense) of the other. So in this sense the key "/x/y/z" is contained by the key "/x/y".

In this model all keys "exist" but only some keys refer to objects containing content -- content bearing. An important restriction is placed on the structure of the tree, namely that keys are only defined for content-bearing objects. Further, all the leaves of the tree are these content-bearing objects. This means that the key for one content-bearing object should not be a prefix of any other key.

There several other concepts of note.

Dataset - a dataset is the complete tree contained by the key defining the root of the dataset. Technically, the root of the tree is the key /.nczarr, where .nczarr can be considered the superblock of the dataset.
Object - equivalent of the S3 object; Each object has a unique key and "contains" data in the form of an arbitrary sequence of 8-bit bytes.

The zmap API defined here isolates the key-value pair mapping code from the Zarr-based implementation of NetCDF-4. It wraps an internal C dispatch table manager for implementing an abstract data structure implementing the zmap key/object model.

Search: The search function has two purposes:

Support reading of pure zarr datasets (because they do not explicitly track their contents).
Debugging to allow raw examination of the storage. See zdump for example.

The search function takes a prefix path which has a key syntax (see above). The set of legal keys is the set of keys such that the key references a content-bearing object -- e.g. /x/y/.zarray or /.zgroup. Essentially this is the set of keys pointing to the leaf objects of the tree of keys constituting a dataset. This set potentially limits the set of keys that need to be examined during search.

The search function returns a limited set of names, where the set of names are immediate suffixes of a given prefix path. That is, if is the prefix path, then search returns all such that / is itself a prefix of a "legal" key. This can be used to implement glob style searches such as "/x/y/" or "/x/y/*"

This semantics was chosen because it appears to be the minimum required to implement all other kinds of search using recursion. It was also chosen to limit the number of names returned from the search. Specifically

Avoid returning keys that are not a prefix of some legal key.
Avoid returning all the legal keys in the dataset because that set may be very large; although the implementation may still have to examine all legal keys to get the desired subset.
Allow for use of partial read mechanisms such as iterators, if available. This can support processing a limited set of keys for each iteration. This is a straighforward tradeoff of space over time.

As a side note, S3 supports this kind of search using common prefixes with a delimiter of '/', although the implementation is a bit tricky. For the file system zmap implementation, the legal search keys can be obtained one level at a time, which directly implements the search semantics. For the zip file implementation, this semantics is not possible, so the whole tree must be obtained and searched.

Issues:

S3 limits key lengths to 1024 bytes. Some deeply nested netcdf files will almost certainly exceed this limit.
Besides content, S3 objects can have an associated small set of what may be called tags, which are themselves of the form of key-value pairs, but where the key and value are always text. As far as it is possible to determine, Zarr never uses these tags, so they are not included in the zmap data structure.

A Note on Error Codes:

The zmap API returns two distinguished error code:

NC_NOERR if a operation succeeded
NC_EEMPTY is returned when accessing a key that has no content.

Note that NC_EEMPTY is a new error code to signal to that the caller asked for non-content-bearing key.

This does not preclude other errors being returned such NC_EACCESS or NC_EPERM or NC_EINVAL if there are permission errors or illegal function arguments, for example. It also does not preclude the use of other error codes internal to the zmap implementation. So zmap_file, for example, uses NC_ENOTFOUND internally because it is possible to detect the existence of directories and files. This does not propagate outside the zmap_file implementation.

Zmap Implementatons

The primary zmap implementation is s3 (i.e. mode=nczarr,s3) and indicates that the Amazon S3 cloud storage -- or some related applicance -- is to be used. Another storage format uses a file system tree of directories and files (mode=nczarr,file). A third storage format uses a zip file (mode=nczarr,zip). The latter two are used mostly for debugging and testing. However, the file and zip formats are important because they is intended to match corresponding storage formats used by the Python Zarr implementation. Hence it should serve to provide interoperability between NCZarr and the Python Zarr. This has not been tested.

Examples of the typical URL form for file and zip are as follows.

file:///xxx/yyy/testdata.file#mode=nczarr,file  file:///xxx/yyy/testdata.zip#mode=nczarr,zip

Note that the extension (e.g. ".file" in "testdata.file") is arbitraty, so this would be equally acceptable.

file:///xxx/yyy/testdata.anyext#mode=nczarr,file

As with other URLS (e.g. DAP), these kind of URLS can be passed as the path argument to ncdump, for example.

NCZarr versus Pure Zarr.

The NCZARR format extends the pure Zarr format by adding extra objects such as .nczarr and .ncvar. It is possible to suppress the use of these extensions so that the netcdf library can read and write a pure zarr formatted file. This is controlled by using mode=nczarr,zarr combination. The primary effects of using pure zarr are described in the Translation Section.

Notes on Debugging NCZarr Access

The NCZarr support has a trace facility. Enabling this can sometimes give important information. Tracing can be enabled by setting the environment variable NCTRACING=n, where n indicates the level of tracing. A good value of n is 9.

Zip File Support

In order to use the zip storage format, the libzip [3] library must be installed. Note that this is different from zlib.

Amazon S3 Storage

The Amazon AWS S3 storage driver currently uses the Amazon AWS S3 Software Development Kit for C++ (aws-s3-sdk-cpp). In order to use it, the client must provide some configuration information. Specifically, the ~/.aws/config file should contain something like this.

[default]  output = json  aws_access_key_id=XXXX...  aws_secret_access_key=YYYY...

Addressing Style

The notion of "addressing style" may need some expansion. Amazon S3 accepts two forms for specifying the endpoint for accessing the data.

Virtual -- the virtual addressing style places the bucket in the host part of a URL. For example:
```
https://<bucketname>.s2.<region>.amazonaws.com/  
```
Path -- the path addressing style places the bucket in at the front of the path part of a URL. For example:

https://s2.<region>.amazonaws.com/<bucketname>/

The NCZarr code will accept either form, although internally, it is standardized on path style. The reason for this is that the bucket name forms the initial segment in the keys.

Zarr vs NCZarr

Data Model

The NCZarr storage format is almost identical to that of the the standard Zarr version 2 format. The data model differs as follows.

Zarr supports filters -- NCZarr as yet does not
Zarr only supports anonymous dimensions -- NCZarr supports only shared (named) dimensions.
Zarr attributes are untyped -- or perhaps more correctly characterized as of type string.

Storage Format

Consider both NCZarr and Zarr, and assume S3 notions of bucket and object. In both systems, Groups and Variables (Array in Zarr) map to S3 objects. Containment is modeled using the fact that the container's key is a prefix of the variable's key. So for example, if variable v1 is contained in top level group g1 -- _/g1 -- then the key for v1 is /g1/v. Additional information is stored in special objects whose name start with ".z".

In Zarr, the following special objects exist.

Information about a group is kept in a special object named .zgroup; so for example the object /g1/.zgroup.
Information about an array is kept as a special object named .zarray; so for example the object /g1/v1/.zarray.
Group-level attributes and variable-level attributes are stored in a special object named .zattr; so for example the objects /g1/.zattr and _/g1/v1/.zattr.

The NCZarr format uses the same group and variable (array) objects as Zarr. It also uses the Zarr special .zXXX objects.

However, NCZarr adds some additional special objects.

.nczarr -- this is in the top level group -- key /.nczarr. It is in effect the "superblock" for the dataset and contains any netcdf specific dataset level information. It is also used to verify that a given key is the root of a dataset.
.nczgroup -- this is a parallel object to .zgroup and contains any netcdf specific group information. Specifically it contains the following.
- dims -- the name and size of shared dimensions defined in this group.
- vars -- the name of variables defined in this group.
- groups -- the name of sub-groups defined in this group.
  
  These lists allow walking the NCZarr dataset without having to use the potentially costly S3 list operation.
.nczvar -- this is a parallel object to .zarray and contains netcdf specific information. Specifically it contains the following.
- dimrefs -- the names of the shared dimensions referenced by the variable.
- storage -- indicates if the variable is chunked vs contiguous
```
     in the netcdf sense.  
```
.nczattr -- this is parallel to the .zattr objects and stores
```
         the attribute type information.  
```

Translation

With some constraints, it is possible for an nczarr library to read Zarr and for a zarr library to read the nczarr format. The latter case, zarr reading nczarr is possible if the zarr library is willing to ignore objects whose name it does not recognized; specifically anything beginning with .ncz.

The former case, nczarr reading zarr is also possible if the nczarr can simulate or infer the contents of the missing .nczXXX objects. As a rule this can be done as follows.

.nczgroup -- The list of contained variables and sub-groups can be computed using the search API to list the keys "contained" in the key for a group. By looking for occurrences of .zgroup, .zattr, .zarray to infer the keys for the contained groups, attribute sets, and arrays (variables). Constructing the set of "shared dimensions" is carried out by walking all the variables in the whole dataset and collecting the set of unique integer shapes for the variables. For each such dimension length, a top level dimension is created named ".zdim" where len is the integer length. The name is subject to change.
.nczvar -- The dimrefs are inferred by using the shape in .zarray and creating references to the simulated shared dimension. netcdf specific information.
.nczattr -- The type of each attribute is inferred by trying to parse the first attribute value string.

Compatibility

In order to accomodate existing implementations, certain mode tags are provided to tell the NCZarr code to look for information used by specific implementations.

XArray

The Xarray [7] Zarr implementation uses its own mechanism for specifying shared dimensions. It uses a special attribute named ''_ARRAY_DIMENSIONS''. The value of this attribute is a list of dimension names (strings). An example might be ["time", "lon", "lat"]. It is essentially equivalent to the .nczvar/dimrefs list, but stored as a specific variable attribute. It will be read/written if and only if the mode value "xarray" is specified. If enabled and detected, then these dimension names are used to define shared dimensions. Note that xarray implies pure zarr format.

Examples

Here are a couple of examples using the ncgen and ncdump utilities.

Create an nczarr file using a local directory tree as storage.

 ncgen -4 -lb -o 'file:///home/user/dataset.file#mode=nczarr,file' dataset.cdl

Display the content of an nczarr file using a local directory tree as storage.
```
 ncdump 'file:///home/user/dataset.zip#mode=nczarr,zip'  
```

Create an nczarr file using S3 as storage.

 ncgen -4 -lb -o 's3://s3.us-west-1.amazonaws.com/datasetbucket' dataset.cdl

Create an nczarr file using S3 as storage and keeping to the pure zarr format.

 ncgen -4 -lb -o 's3://s3.uswest-1.amazonaws.com/datasetbucket#mode=zarr' dataset.cdl

References

[1] Amazon Simple Storage Service Documentation
[2] Amazon Simple Storage Service Library
[3] The LibZip Library
[4] NetCDF ZARR Data Model Specification
[5] Python Documentation: 8.3. collections — High-performance container datatypes
[6] Zarr Version 2 Specification
[7] XArray Zarr Encoding Specification

Appendix A. Building NCZarr Support

Currently the following build cases are known to work.

Operating System	Build System	NCZarr	S3 Support
Linux	Automake	yes	yes
Linux	CMake	yes	yes
Cygwin	Automake	yes	no
OSX	Automake	unknown	unknown
OSX	CMake	unknown	unknown
Visual Studio	CMake	yes	tests fail

Note: S3 support includes both compiling the S3 support code as well as running the S3 tests.

Automake

There are several options relevant to NCZarr support and to Amazon S3 support. These are as follows.

--enable-nczarr -- enable the NCZarr support. If disabled, then all of the following options are disabled or irrelevant.
--enable-nczarr-s3 -- Enable NCZarr S3 support.
--enable-nczarr-s3-tests -- the NCZarr S3 tests are currently only usable by Unidata personnel, so they are disabled by default.

A note about using S3 with Automake. If S3 support is desired, and using Automake, then LDFLAGS must be properly set, namely to this.

LDFLAGS='$LDFLAGS -L/usr/local/lib -laws-cpp-sdk-s3'

The above assumes that these libraries were installed in '/usr/local/lib', so the above requires modification if they were installed elsewhere.

Note also that if S3 support is enabled, then you need to have a C++ compiler installed because part of the S3 support code is written in C++.

CMake

The necessary CMake flags are as follows (with defaults)

-DENABLENCZARR=on -- equivalent to the Automake --enable-nczarr_ option.
-DENABLE_NCZARRS3=off -- equivalent to the Automake --enable-nczarr-s3_ option.
-DENABLE_NCZARR_S3TESTS=off -- equivalent to the Automake --enable-nczarr-s3-tests_ option.

Note that unlike Automake, CMake can properly locate C++ libraries, so it should not be necessary to specify -laws-cpp-sdk-s3 assuming that the aws s3 libraries are installed in the default location. For CMake with Visual Studio, the default location is here:

C:/Program Files (x86)/aws-cpp-sdk-all

It is possible to install the sdk library in another location. In this case, one must add the following flag to the cmake command.

cmake ... -DAWSSDK_DIR=<awssdkdir>

where "AWSSDKDIR" is the path to the sdk installation. For example, this might be as follows.

cmake ... -DAWSSDK_DIR='c:	oolsaws-cpp-sdk-all'

This can be useful if blanks in path names cause problems in your build environment.

Testing S3 Support

The relevant tests for S3 support are in _nczarr_test. They will be run if --enable-nczarr-s3-tests_ is on.

Currently, by default, testing of S3 with NCZarr is supported only for Unidata members of the NetCDF Development Group. This is because it uses a specific bucket on a specific internal S3 appliance that is inaccessible to the general user.

However, an untested mechanism exists by which others may be able to run the tests. If someone else wants to attempt these tests, then they need to define the following environment variables:

NCZARR_S3_TEST_HOST=
NCZARR_S3_TEST_BUCKET=

This assumes a Path Style address (see above) where

host -- the complete host part of the url
bucket -- a bucket in which testing can occur without fear of damaging anything.

Example:

NCZARR_S3_TEST_HOST=s3.us-west-1.amazonaws.com  NCZARR_S3_TEST_BUCKET=testbucket

If anyone tries to use this mechanism, it would be appreciated it any difficulties were reported to Unidata as a Github issue.

Appendix B. Building aws-sdk-cpp

In order to use the S3 storage driver, it is necessary to install the Amazon aws-sdk-cpp library.

As a starting point, here are the CMake options used by Unidata to build that library. It assumes that it is being executed in a build directory, build say, and that build/../CMakeLists.txt exists.

cmake -DBUILD_ONLY=s3

The expected set of installed libraries are as follows:

aws-cpp-sdk-s3
aws-cpp-sdk-core

This library depends on libcurl, so you may to install that before building the sdk library.

Appendix C. Amazon S3 Imposed Limits

The Amazon S3 cloud storage imposes some significant limits that are inherited by NCZarr (and Zarr also, for that matter).

Some of the relevant limits are as follows:

The maximum object size is 5 Gigabytes with a total for all objects limited to 5 Terabytes.
S3 key names can be any UNICODE name with a maximum length of 1024 bytes. Note that the limit is defined in terms of bytes and not (Unicode) characters. This affects the depth to which groups can be nested because the key encodes the full path name of a group.

Appendix D. Alternative Mechanisms for Accessing Remote Datasets

The NetCDF-C library contains an alternate mechanism for accessing data store in Amazon S3: The byte-range mechanism. The idea is to treat the remote data as if it was a big file. This remote "file" can be randomly accessed using the HTTP Byte-Range header.

In the Amazon S3 context, a copy of a dataset, a netcdf-3 or netdf-4 file, is uploaded into a single object in some bucket. Then using the key to this object, it is possible to tell the netcdf-c library to treat the object as a remote file and to use the HTTP Byte-Range protocol to access the contents of the object. The dataset object is referenced using a URL with the trailing fragment containing the string #mode=bytes.

An examination of the test program _nc_test/testbyterange.sh shows simple examples using the ncdump program. One such test is specified as follows:

https://s3.us-east-1.amazonaws.com/noaa-goes16/ABI-L1b-RadC/2017/059/03/OR_ABI-L1b-RadC-M3C13_G16_s20170590337505_e20170590340289_c20170590340316.nc#mode=bytes

Note that for S3 access, it is expected that the URL is in what is called "path" format where the bucket, noaa-goes16 in this case, is part of the URL path instead of the host.

The #mode=byterange mechanism generalizes to work with most servers that support byte-range access.
Specifically, Thredds servers support such access using the HttpServer access method as can be seen from this URL taken from the above test program.

https://thredds-test.unidata.ucar.edu/thredds/fileServer/irma/metar/files/METAR_20170910_0000.nc#bytes

Byte-Range Authorization

If using byte-range access, it may be necessary to tell the netcdf-c library about the so-called secretid and accessid values. These are usually stored in the file ~/.aws/config and/or ~/.aws/credentials. In the latter file, this might look like this.

    [default]      aws_access_key_id=XXXXXXXXXXXXXXXXXXXX      aws_secret_access_key=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY

Point of Contact

Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 4/10/2020
Last Revised: 2/22/2021

Enhancing the netCDF C++ Library and the Siphon Package

2019-08-16T10:51:06+00:00

Aodhan Sweeney

by Aodhan Sweeney
2019 Unidata summer intern

This summer at Unidata I worked on expanding functionality for both the netCDF C++ library and the Python data access tool Siphon. Previously, the netCDF C++ library was lacking important functionality that was included in other netCDF libraries. Fortunately, adding this functionality is a straightforward process. I created function wrappers in the C++ library that would call previously made functions in the C library. This allows those working in a C++ framework to continue to use the netCDF libraries without sacrificing additional functionality.

Editor's Note: Aodhan's additions to the netCDF C++ library will be included in the next release, expected in late summer 2019.

Storm tracks visualized in a Jupyter notebook
(click to enlarge)

Siphon is a data access module written in Python. Originally, it was developed for easy remote access to data from THREDDS Data Servers. In recent years, an offshoot of Siphon that focuses on remote access to data servers not associated with a TDS has been developed. This summer I worked with on expanding the Siphon access to include data from the National Hurricane Center (NHC) and the Storm Prediction Center (SPC). With easy-to-learn commands in a Python environment, we are empowering our users to perform their analysis of the data stored in the NHC and SPC. To facilitate interaction with these servers, I also developed Jupyter notebook-based Graphical User Interfaces (GUIs) to plot and visualize the data stored in the NHC and SPC.

Editor's Note: Aodhan's additions to Siphon will be included in the next official release, expected in the fall of 2019. The notebooks will be available in the Unidata Python Gallery.

Temperature anomalies visualized in a browser

Because of my awesome mentors and the wealth of information here at the Unidata Program Center and in the wider community, I was also encouraged to pursue projects that I was curious about. I ended up creating and testing a few 3d visualization tools in Javascript that can be run out of a web browser. One of these, displaying average temperature anomalies over land between the years of 1910 and 2019, was accepted by the Experiments with Google program. You can see the visualization and the code that creates it here.

NetCDF Zarr API

2019-07-16T15:04:02+00:00

Introduction
The netCDF-Zarr API

Introduction

This document is a companion document to the NetCDF ZARR Data Model Specification[1]. That document provides a semi-formal and abstract representation of the NCZarr data model independent of any implementation.

This document describes a variant of the API provided by the netcdf-c library as shown in its primary definition file netcdf.h. Familiarity with the current netcdf-c library API is assumed.

The netCDF-Zarr API

This API takes the netcdf-c library API and divides it into sets of related functions. Any semantic differences are described. API functions that are disallowed are also described. Functions are organized according to the netCDF data model.

NetCDF File Functions

EXTERNL int  nc_create(const char* path, int cmode, int* ncidp);    EXTERNL int  nc__create(const char* path, int cmode, size_t initialsz, size_t* chunksizehintp, int* ncidp);    EXTERNL int  nc_open(const char* path, int mode, int* ncidp);    EXTERNL int  nc__open(const char* path, int mode, size_t* chunksizehintp, int* ncidp);    EXTERNL int  nc_inq_ncid(int ncid, const char* name, int* grp_ncid);    EXTERNL int  nc_redef(int ncid);    EXTERNL int  nc_enddef(int ncid);    EXTERNL int  nc__enddef(int ncid, size_t h_minfree, size_t v_align, size_t v_minfree, size_t r_align);    EXTERNL int  nc_sync(int ncid);    EXTERNL int  nc_abort(int ncid);    EXTERNL int  nc_close(int ncid);    EXTERNL int  nc_inq_path(int ncid, size_t* pathlen, char* path);

With exceptions, all of these functions are implemented with essentially standard semantics.

Notes:

The double underscore functions (e.g. nc__create) are implemented in terms of the single underscore versions with the extra parameters ignored.
nc_sync, nc_redef, and nc_enddef may be implemented as no-op functions depending on the underlying implementation.
The syntax and interpretation of the path argument are implementation dependent (See below.

Dimensions

EXTERNL int  nc_def_dim(int ncid, const char* name, size_t len, int* idp);    EXTERNL int  nc_inq_dimid(int ncid, const char* name, int* idp);    EXTERNL int  nc_inq_dim(int ncid, int dimid, char* name, size_t* lenp);    EXTERNL int  nc_inq_dimname(int ncid, int dimid, char* name);    EXTERNL int  nc_inq_dimlen(int ncid, int dimid, size_t* lenp);    EXTERNL int  nc_rename_dim(int ncid, int dimid, const char* name);

All of these functions are implemented with essentially standard semantics.

Notes:

These APIs all assume named dimensions. The management of named dimensions is still an open issue for Zarr. For writing, anonymous dimensions are not allowed, but they are for reading. When reading an anonymous dimension, a specially named dimension will be created to represent the anonymous dimension.
Unlimited dimensions are currently unimplemented.

Types

EXTERNL int  nc_inq_type(int ncid, nc_type xtype, char *name, size_t *size);    /* Get the id of a type from the name. */  EXTERNL int  nc_inq_typeid(int ncid, const char *name, nc_type *typeidp);

Notes:

In the current implemenation, only a selected set of atomic types are implemented, namely: NC_CHAR, NC_BYTE, NC_SHORT, NC_INT, NC_FLOAT, NC_DOUBLE, NC_UBYTE, NC_USHORT, NC_UINT, NC_INT64, and NC_UINT64.

Variables

EXTERNL int  nc_def_var(int ncid, const char* name, nc_type xtype, int ndims, const int* dimidsp, int* varidp);    EXTERNL int  nc_inq_var(int ncid, int varid, char* name, nc_type* xtypep, int* ndimsp, int* dimidsp, int* nattsp);    EXTERNL int  nc_inq_varid(int ncid, const char* name, int* varidp);    EXTERNL int  nc_inq_varname(int ncid, int varid, char* name);    EXTERNL int  nc_inq_vartype(int ncid, int varid, nc_type* xtypep);    EXTERNL int  nc_inq_varndims(int ncid, int varid, int* ndimsp);    EXTERNL int  nc_inq_vardimid(int ncid, int varid, int* dimidsp);    EXTERNL int  nc_inq_varnatts(int ncid, int varid, int* nattsp);    EXTERNL int  nc_rename_var(int ncid, int varid, const char* name);

The basic variable definition/inquiry functions have the standard netCDF-4 semantics.

Variable Representation Functions

EXTERNL int  nc_def_var_filter(int ncid, int varid, unsigned int id, size_t nparams, const unsigned int* parms);    EXTERNL int  nc_inq_var_filter(int ncid, int varid, unsigned int* idp, size_t* nparams, unsigned int* params);    EXTERNL int  nc_def_var_deflate(int ncid, int varid, int shuffle, int deflate, int deflate_level);    EXTERNL int  nc_inq_var_deflate(int ncid, int varid, int* shufflep, int* deflatep, int* deflate_levelp);    EXTERNL int  nc_inq_var_szip(int ncid, int varid, int* options_maskp, int* pixels_per_blockp);    EXTERNL int  nc_def_var_fletcher32(int ncid, int varid, int fletcher32);    EXTERNL int  nc_inq_var_fletcher32(int ncid, int varid, int* fletcher32p);    EXTERNL int  nc_def_var_chunking(int ncid, int varid, int storage, const size_t* chunksizesp);    EXTERNL int  nc_inq_var_chunking(int ncid, int varid, int* storagep, size_t* chunksizesp);    EXTERNL int  nc_def_var_fill(int ncid, int varid, int no_fill, const void* fill_value);    EXTERNL int  nc_inq_var_fill(int ncid, int varid, int* no_fill, void* fill_valuep);

These function specify information about the layout and storage of variables. The deflate and szip functions are all implemented as calls to the def/inq filter functions. It appears that the semantics of the chunking functions match that of Zarr so that they can be directly implemented. Handling of the fill functions is still T.B.D.

Variable IO

EXTERNL int  nc_put_var(int ncid, int varid,  const void* op);    EXTERNL int  nc_get_var(int ncid, int varid,  void* ip);    EXTERNL int  nc_put_var1(int ncid, int varid,  const size_t* indexp, const void* op);    EXTERNL int  nc_get_var1(int ncid, int varid,  const size_t* indexp, void* ip);    EXTERNL int  nc_put_vara(int ncid, int varid,  const size_t* startp, const size_t* countp, const void* op);    EXTERNL int  nc_get_vara(int ncid, int varid,  const size_t* startp, const size_t* countp, void* ip);    EXTERNL int  nc_put_vars(int ncid, int varid,  const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, const void* op);    EXTERNL int  nc_get_vars(int ncid, int varid,  const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, void* ip);    EXTERNL int  nc_put_varm(int ncid, int varid,  const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, const ptrdiff_t* imapp, const void* op);    EXTERNL int  nc_put_var_T(int ncid, int varid, const T* op);    EXTERNL int  nc_get_var_T(int ncid, int varid, T* ip);    EXTERNL int  nc_put_var1_T(int ncid, int varid, const size_t* indexp, const T* op);    EXTERNL int  nc_get_var1_T(int ncid, int varid, const size_t* indexp, T* ip);    EXTERNL int  nc_put_vara_T(int ncid, int varid, const size_t* startp, const size_t* countp, const T* op);    EXTERNL int  nc_get_vara_short(int ncid, int varid, const size_t* startp, const size_t* countp, T* ip);    EXTERNL int  nc_put_vars_T(int ncid, int varid, const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, const T* op);    EXTERNL int  nc_get_vars_T(int ncid, int varid, const size_t* startp, const size_t* countp, ptrdiff_t* stridep, T* ip);    EXTERNL int  nc_put_varm_T(int ncid, int varid, const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, const ptrdiff_t* imapp, const T* op);    EXTERNL int  nc_get_varm_T(int ncid, int varid, const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, const ptrdiff_t* imapp, T* ip);

The primary variable I/O functions are defined by the first eight functions in this list, as is the case in the existing netcdf library code. The put/get varm functions are all implemented in terms of calls to put/get vars functions, again as in the existing code.

The get/put var T functions primarily exist to support library implemented type conversion. If the actual variable type is different than the function type (the T), then automatic conversion is performed from the actual type to the desired type. With some judicious refactoring, it should be possible to reuse the existing conversion code in the netcdf-c library.

Attributes

EXTERNL int  nc_put_att(int ncid, int varid, const char* name, nc_type xtype, size_t len, const void* op);    EXTERNL int  nc_get_att(int ncid, int varid, const char* name, void* ip);    EXTERNL int  nc_inq_att(int ncid, int varid, const char* name, nc_type* xtypep, size_t* lenp);    EXTERNL int  nc_inq_attid(int ncid, int varid, const char* name, int* idp);    EXTERNL int  nc_inq_atttype(int ncid, int varid, const char* name, nc_type* xtypep);    EXTERNL int  nc_inq_attlen(int ncid, int varid, const char* name, size_t* lenp);    EXTERNL int  nc_inq_attname(int ncid, int varid, int attnum, char* name);    EXTERNL int  nc_copy_att(int ncid_in, int varid_in, const char* name, int ncid_out, int varid_out);    EXTERNL int  nc_rename_att(int ncid, int varid, const char* name, const char* newname);    EXTERNL int  nc_del_att(int ncid, int varid, const char* name);    EXTERNL int  nc_put_att_T(int ncid, int varid, const char* name, size_t len, const T* op);    EXTERNL int  nc_get_att_T(int ncid, int varid, const char* name, T* op);

The primary attribute put/get functions are defined by the first two functions in this list. The get/put T functions are implemented in terms of these two more generic functions.

The get/put T functions primarily exist to support library implemented type conversion. If the actual attribute type is different than the function type (the T), then automatic conversion is performed from the actual type to the desired type. With some judicious refactoring, it should be possible to reuse the existing conversion code in the netcdf-c library.

The put T functions specify the actual type of the attribute, so there is no conversion implied.

Groups

EXTERNL int  nc_def_grp(int parent_ncid, const char* name, int* new_ncid);    EXTERNL int  nc_rename_grp(int grpid, const char* name);

The semantics of the group functions appear to be completely consistent with the existing Zarr semantics. It is assumed that the graph of groups is a tree, which implies no cycles and no shared subgroups.

NetCDF Error Handling

EXTERNL const char*  nc_strerror(int ncerr);    EXTERNL int  nc_set_log_level(int new_level);

Error reporting and event logging is not defined for Zarr, so these are the same as for the netcdf-c library.

Miscellaneous Functions

EXTERNL const char*  nc_inq_libvers(void);    EXTERNL int  nc_initialize(void);    EXTERNL int  nc_finalize(void);    EXTERNL int  nc_set_fill(int ncid, int fillmode, int* old_modep);    EXTERNL int  nc_set_default_format(int format, int* old_formatp);    EXTERNL int  nc_inq_format(int ncid, int* formatp);    EXTERNL int  nc_inq_format_extended(int ncid, int* formatp, int* modep);    EXTERNL int  nc_set_chunk_cache(size_t size, size_t nelems, float preemption);    EXTERNL int  nc_get_chunk_cache(size_t* sizep, size_t* nelemsp, float* preemptionp);    EXTERNL int  nc_set_var_chunk_cache(int ncid, int varid, size_t size, size_t nelems, float preemption);    EXTERNL int  nc_get_var_chunk_cache(int ncid, int varid, size_t* sizep, size_t* nelemsp, float* preemptionp);    EXTERNL int  nc_inq(int ncid, int* ndimsp, int* nvarsp, int* nattsp, int* unlimdimidp);    EXTERNL int  nc_inq_ndims(int ncid, int* ndimsp);    EXTERNL int  nc_inq_nvars(int ncid, int* nvarsp);    EXTERNL int  nc_inq_natts(int ncid, int* nattsp);    EXTERNL int  nc_delete(const char* path);

Notes:

It is unclear if the format related functions are sufficient for specifying cloud format information. There may be significant implementation-dependent information that these functions cannot provide as currently defined.
Use of the chunk caching functions may be completely implementation dependent. The idea of using a chunk cache seems to be an obvious requirement for good performance.
All the inq functions should be able to have standard netcdf semantics.
The nc_delete function has always been something of an outlier, but it is useful to have a way to completely remove a dataset in a way that is implementation dependent.

Unimplemented Functions

Basically, any function not specified above will be unimplemented. The current list is as follows.

EXTERNL int  nc_inq_unlimdim(int ncid, int* unlimdimidp);    EXTERNL int  nc_inq_unlimdims(int ncid, int* nunlimdimsp, int* unlimdimidsp);    EXTERNL int  nc_show_metadata(int ncid);    EXTERNL int  nc_copy_var(int ncid_in, int varid, int ncid_out);    EXTERNL int  nc_def_opaque(int ncid, size_t size, const char* name, nc_type* xtypep);    EXTERNL int  nc_inq_opaque(int ncid, nc_type xtype, char* name, size_t* sizep);    EXTERNL int  nc_def_compound(int ncid, size_t size, const char* name, nc_type* typeidp);    EXTERNL int  nc_insert_compound(int ncid, nc_type xtype, const char* name, size_t offset, nc_type field_typeid);    EXTERNL int  nc_insert_array_compound(int ncid, nc_type xtype, const char* name, size_t offset, nc_type field_typeid, int ndims, const int* dim_sizes);    EXTERNL int  nc_inq_compound(int ncid, nc_type xtype, char* name, size_t* sizep, size_t* nfieldsp);    EXTERNL int  nc_inq_compound_name(int ncid, nc_type xtype, char* name);    EXTERNL int  nc_inq_compound_size(int ncid, nc_type xtype, size_t* sizep);    EXTERNL int  nc_inq_compound_nfields(int ncid, nc_type xtype, size_t* nfieldsp);    EXTERNL int  nc_inq_compound_field(int ncid, nc_type xtype, int fieldid, char* name, size_t* offsetp, nc_type* field_typeidp, int* ndimsp, int* dim_sizesp);    EXTERNL int  nc_inq_compound_fieldname(int ncid, nc_type xtype, int fieldid, char* name);    EXTERNL int  nc_inq_compound_fieldindex(int ncid, nc_type xtype, const char* name, int* fieldidp);    EXTERNL int  nc_inq_compound_fieldoffset(int ncid, nc_type xtype, int fieldid, size_t* offsetp);    EXTERNL int  nc_inq_compound_fieldtype(int ncid, nc_type xtype, int fieldid, nc_type* field_typeidp);    EXTERNL int  nc_inq_compound_fieldndims(int ncid, nc_type xtype, int fieldid, int* ndimsp);    EXTERNL int  nc_inq_compound_fielddim_sizes(int ncid, nc_type xtype, int fieldid, int* dim_sizes);    EXTERNL int  nc_def_enum(int ncid, nc_type base_typeid, const char* name, nc_type* typeidp);    EXTERNL int  nc_insert_enum(int ncid, nc_type xtype, const char* name, const void* value);    EXTERNL int  nc_inq_enum(int ncid, nc_type xtype, char* name, nc_type* base_nc_typep, size_t* base_sizep, size_t* num_membersp);    EXTERNL int  nc_inq_enum_member(int ncid, nc_type xtype, int idx, char* name, void* value);    EXTERNL int  nc_inq_enum_ident(int ncid, nc_type xtype, long long value, char* identifier);    EXTERNL int  nc_def_vlen(int ncid, const char* name, nc_type base_typeid, nc_type* xtypep);    EXTERNL int  nc_inq_vlen(int ncid, nc_type xtype, char* name, size_t* datum_sizep, nc_type* base_nc_typep);    EXTERNL int  nc_free_vlen(nc_vlen_t* vl);    EXTERNL int  nc_free_vlens(size_t len, nc_vlen_t vlens[]);    EXTERNL int  nc_put_vlen_element(int ncid, int typeid1, void* vlen_element, size_t len, const void* data);    EXTERNL int  nc_get_vlen_element(int ncid, int typeid1, const void* vlen_element, size_t* len, void* data);    EXTERNL int  nc_def_var_endian(int ncid, int varid, int endian);    EXTERNL int  nc_inq_var_endian(int ncid, int varid, int* endianp);

These functions are currently "unimplemented" in the sense that they will return the error code NC_ENOTBUILT.

Parallelism Functions

EXTERNL int  nc__create_mp(const char* path, int cmode, size_t initialsz, int basepe, size_t* chunksizehintp, int* ncidp);    EXTERNL int  nc__open_mp(const char* path, int mode, int basepe, size_t* chunksizehintp, int* ncidp);    EXTERNL int  nc_delete_mp(const char* path, int basepe);    EXTERNL int  nc_set_base_pe(int ncid, int pe);    EXTERNL int  nc_inq_base_pe(int ncid, int* pe);

The netcdf library parallelism-related functions are all heavily MPI oriented. It is unclear what is to be done with these functions.

Path URLS

It is assumed that the format of a Zarr file will look like a netcdf Enhanced file with some variations. However, the path for specifying a cloud-based dataset will be more complicated than a simple file path. As with DAP2 and DAP4, it will be some kind of URL annotated with extra information relevant to its interpretation.

References

[1] NetCDF ZARR Data Model Specification (https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification)
[2] Zarr Specification Version 2 (https://zarr.readthedocs.io/en/stable/spec/v2.html)

Copyright

Point of Contact

Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 12/1/2018
Last Revised: 7/16/2019

NetCDF ZARR Data Model Specification

2019-07-02T16:01:22+00:00

Introduction
Notation
Data Model
1. Dataset
2. Group
3. Attribute
4. Dimension
5. Variable
6. Dimension Reference
7. Types
Excluded Elements
Appendix A. Supporting Lexical Tokens
1. Fully Qualified Names
Appendix B. Supplementary Material
1. Specifying Context-Sensitive Elements
Appendix C. Complete Version of the Abstract Representation Specification

Introduction

This document describes the to-be-implemented NCZarr data model by reference to the netcdf-4 (aka netcdf enhanced) data model. Elements of the enhanced model included in this model will be listed. Elements of the enhanced model not included are listed in a later section.

Notation

In order to represent the abstract structure of the NCZarr data model, we must choose some suitable notation. This notation must meet the requirement that it is typed, meaning that the nodes of the tree have a type and the structure of the node must conform to that type.

Ideally, we would use Json as our notation since that is the target representation used by the Zarr specification. Unfortunately, Json is effectively typeless so we do not consider it powerful enough to properly represent the data model. If some way exists to do this, then this may be viable.

We choose Antlr4 [1] as our formalism because it is designed for such uses as this one, and it is quite concise. In the following specification, upper-case names (such as NAME or ZARRVERSION) are terminals in the parsing sense and are specified in Appendix A.

Data Model

Dataset

dataset : NAME ZARRVERSION (dimension | variable | attribute | group)*

The unit of data storage in NCZarr, as with netcdf-4, is the Dataset. A Dataset is also a Group (see below), so it can contain variables, attributes, and (sub-)groups. These semantics are consistent with the netcdf-4 Dataset semantics.

Group

group: NAME (dimension | variable | attribute | group)*

A Group contains a collection of dimension declarations, variable declarations, attributes, and (sub-)groups. Note that user-defined type declarations are not (yet) included.

Attribute

attribute : NAME value_type (CONSTANT)+

An Attribute contains a (ordered) set of values, where the values are constants consistent with the specified type of the attribute. An attribute must have at least one value.

Dimension

dimension: NAME SIZE

A Dimension declaration defines a named dimension where the dimension has a specific specified size.

Variable

variable: NAME type (dimref)* (attribute)*

A Variable declaration defines a named variable of a specified type. It also can reference a set of dimensions defining the rank and size of the variable. If no dimensions are referenced, then the variable is a scalar.

Additionally, any number of attributes can be associated with the variable to define properties about the variable.

Dimension Reference

dimref: SIZE | FQN

A Dimension reference specifies one the dimensions of a variable by either defining an anonymous dimension where the size is specified directly, or by providing the fully qualified name refering to some dimension defined in some Group via a <Dimension> declaration.

Types

type: atomic_type ;  atomic_type: fixed_atomic_type | char_type ;    fixed_atomic_type:        BYTE_T   // A signed 8 bit integer      | UBYTE_T  // An unsigned 8 bit integer      | SHORT_T  // A signed 16 bit integer      | USHORT_T // An unsigned 16 bit integer      | INT_T    // A signed 32 bit integer      | UINT_T   // An unsigned 32 bit integer      | INT64_T  // A signed 64 bit integer      | UINT64_T // An unsigned 64 bit integer      ;    char_type: CHAR_T ;

For now, NCZarr only supports the signed and unsigned integer types of sizes 8, 16, 32, and 64 bits. It also supports an approximation to the character type. Addition of more complex types such as strings must await the Zarr version 3 specification.

These atomic types are those can be used when specifying the type of a variable or an attribute, the names are taken from the corresponding netCDF-4 specification.

Character Type

The character type is almost universally (except for Java) associated with an 8-bit unsigned value. But this has always caused problems because historically, multiple encodings have been associated with it: ASCII, ISO-LATIN-8859, UTF-8, for example.

Each encoding may support only a subset of the 256 possible values that can be represented by an 8-bit unsigned value. In the case of UTF-8, which supports multi-byte characters, a single 8-bit value may not even be able to represent a legal UTF-8 character.

To deal with this, we essentially punt by declaring the character type to be the same as UBYTE_T (an 8-bit unsigned integer). Interpretation of the encoding of a character is then outside the scope of this document.

Excluded Elements

The initial data model for NCZarr deliberately excludes a number of netcdf-4 concepts so that a working implementation can be achieved as rapidly as possible. Additionally, implementation of some netcdf-4 features need to be coordinated with the new version 3 Zarr specification.

Strings

The biggest omission is the netcdf-4 String type. The reason is that it is a varying length type and proper representation in Zarr is still incomplete. It is expected that this will be the first new type to be added since it is so useful. For now, the netcdf-3 approach of using arrays of characters will need to be used.

User-Defined Types.

The netcdf-4 user-defined type constructors are enumeration, compound, opaque, and vlen. Of these, the most problematic is vlen because of its varying length. Without it, the others would all be fixed size and could be implemented. In fact the v2 Zarr specification does provide for compound types, but we choose to wait for v3 before implementing it.

Unlimited Dimension Size

The netcdf-4 notion of unlimited allows for the definition of a dimension whose size is known at any given point in time, but whose size can vary over time. It is still the case that all references to it are required to have the same size and this can cause some difficulties at the storage level where it can introduce undefined values into existing variables.

Appendix A. Supporting Lexical Tokens

In order to completely interpret the above data model, a number of supporting lexical definitions are required and are described here.

NAME: IDCHAR+  FQN: ([/])|([/](IDCHAR)+)+    SIZE: DIGITS // Non-negative integer  ZARRVERSION: DIGITS '.' DIGITS '.' DIGITS    // Type Lexemes  BYTE_T: 'byte'  UBYTE_T: 'ubyte'  SHORT_T: 'short'  USHORT_T: 'ushort'  INT_T: 'int'  UINT_T: 'uint'  INT64_T: 'int64'  UINT64_T: 'uint64'  CHAR_T: 'char    // Exact form is as usual, but will leave out for now  CONSTANT: INTEGER | UNSIGNED | FLOAT | CHAR;    fragment DIGITS: ['0'-'9']+  fragment UTF8: // Assume base character set is UTF8  fragment ASCII: [0-9a-zA-Z !#$%()*+:;<=>?@[]\^_`|{}~] // Printable ASCII  fragment IDCHAR: (IDASCII|UTF8)  fragment IDASCII: [0-9a-zA-Z!#$%()*+:;<=>?@[]^_`|{}~] | '\\' | '\/' | '\ '

A NAME consists of a sequence of any legal non-control UTF-8 characters. A control character is any UTF-8 character in the inclusive range 0x00 — 0x1F.

Fully Qualified Names

Every dimension and variable in a NCZarr Dataset has a Fully Qualified Name (FQN), which provides a way to unambiguously reference it in a dataset. Currently, the only case where this is used is for referencing named dimensions from within variable declarations.

These FQNs follow the common conventions of names for lexically scoped identifiers. In NCZarr scoping is provided by Groups (and the group subtype dataset). Just as with hierarchical file systems or variables in many programming languages, a simple grammar formally defines how the names are built using the names of the FQN's components (see lexical grammar above).

The FQN for a "top-level" variable or dimension is defined purely by the sequence of enclosing groups plus the variable's simple name.

Notes:

Every dataset has a single outermost dataset node. which semantically, acts like the root group. Whatever name that dataset has is ignored for the purposes of forming the FQN and instead is treated as if it has the empty name ("").
There is no limit to the nesting of groups.

The character "/" has special meaning in the context of a fully qualified name. This means that if a name is added to the FQN and that name contains this character, then that characters must be specially escaped so that they will not be misinterpreted. The escape character itself must also be escaped, as must a blank.

The defined escapes are as follows.

Character	Escaped Form
/	/

blank	lank

Appendix B. Supplementary Material

Specifying Context-Sensitive Elements

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. ^{[[[#Ref-7|7]]]}

Appendix C. Complete Version of the Abstract Representation Specification

This is the complete Antlr specification in a form that can be processed by Antlr.

grammar z ;  dataset : NAME ZARRVERSION (dimension | variable | attribute | group)* ;  group: NAME (dimension | variable | attribute | group)* ;  attribute : NAME value_type (CONSTANT)+ ;  dimension: NAME SIZE ;  variable: NAME type (dimref)* (attribute)* ;  dimref: SIZE | FQN ;  type: atomic_type ;  atomic_type: fixed_atomic_type | char_type ;    fixed_atomic_type:        BYTE_T   // A signed 8 bit integer      | UBYTE_T  // An unsigned 8 bit integer      | SHORT_T  // A signed 16 bit integer      | USHORT_T // An unsigned 16 bit integer      | INT_T    // A signed 32 bit integer      | UINT_T   // An unsigned 32 bit integer      | INT64_T  // A signed 64 bit integer      | UINT64_T // An unsigned 64 bit integer      ;    char_type: CHAR_T ;    // Lexemes  NAME: IDCHAR+ ;  FQN: ([/])|([/](IDCHAR)+)+ ;    SIZE: DIGITS ; // Non-negative integer ;  ZARRVERSION: DIGITS '.' DIGITS '.' DIGITS ;    // Type Lexemes  BYTE_T: 'byte' ;  UBYTE_T: 'ubyte' ;  SHORT_T: 'short' ;  USHORT_T: 'ushort' ;  INT_T: 'int' ;  UINT_T: 'uint' ;  INT64_T: 'int64' ;  UINT64_T: 'uint64' ;  CHAR_T: 'char' ;    // Exact form is as usual, but will leave out for now  CONSTANT: INTEGER | UNSIGNED | FLOAT | CHAR ;    fragment INTEGER: [+-]?DIGITS ;  fragment UNSIGNED: DIGITS ;  fragment FLOAT: [+-]?DIGITS '.' DIGITS ;  fragment STRING: ''' ~['] ''' ;    fragment DIGITS: [0-9]+ ;  fragment UTF8: ASCII ; // Assume base character set is UTF8 ;  fragment IDCHAR: (IDASCII|UTF8) ;  fragment IDASCII: [0-9a-zA-Z]|[!#$%()*+:;<=>?@]|'['|']'|'\'|[^_`|{}~]                    |'\\'|'\/'|'\ ' ;  fragment ASCII: [0-9a-zA-Z]|[ !#$%()*+:;<=>?@]|'['|']'|'\'|[^_`|{}~] ; // Printable ASCII

References

[1] https://www.antlr.org/

Copyright

Point of Contact

Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 11/28/2018
Last Revised: 07/2/2019

NCZarr Overview

2019-07-02T12:54:34+00:00

The Unidata NetCDF group is proposing to provide access to cloud storage (e.g. Amazon S3) by providing a mapping from a subset of the full netCDF Enhanced (aka netCDF-4) data model to one or more existing data models that already have mappings to key-value pair cloud storage systems.

The initial target is to map that subset of netCDF-4 to the Zarr data model [1]. As part of that effort, we intend to produce a set of related documents that provide a semi-formal definition of the following.

A description of the initial NCZarr data model.
A description of the subset of the netCDF API that conforms to the NCZARR data model. This interface will be the basis for programatically reading and writing cloud data via the netcdf-c library.
A mapping of the NCZarr data model to some variant of the Zarr storage representation, This representation is a combination of a mapping to Json plus a mapping to an abstract key-value pair interface.
The internal architecture of the cloud support in the netcdf-c library.
Any other documents required in support of the preceding documents (the chunking algorithm documents, for example).

The term "semi-formal" is used because rather than provide complete mathematical or operational semantics, prose text will be used to describe the context-sensitive features of the model. A complete formalization in order to produce an operationally defined specification is a possible future activity.

References

[1] Zarr storage specification version 2 (https://zarr.readthedocs.io/en/stable/spec/v2.html)

Copyright

Point of Contact

Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 11/28/2018
Last Revised: 7/2/2019

Chunking Algorithms for NetCDF-C

2019-05-22T14:14:54+00:00

Introduction

Unidata is in the process of developing a Zarr [] based variant of netcdf. As part of this effort, it was necessary to implement some support for chunking. Specifically, the problem to be solved was that of extracting a hyperslab of data from an n-dimensional variable (array in Zarr parlance) that has been divided into chunks (in the HDF5 sense). Each chunk is stored independently in the data storage -- Amazon S3, for example.

The algorithm takes a series of R slices of the form (first,stop,stride), where R is the rank of the variable. Note that a slice of the form (first, count, stride), as used by netcdf, is equivalent because stop = first + count*stride. These slices form a hyperslab.

The goal is to compute the set of chunks that intersect the hyperslab and to then extract the relevant data from that set of chunks to produce the hyperslab.

It appears from web searches that this algorithm is nowhere documented in the form of a high level pseudo code algorithm. It appears to only exist in the form of code in HDF5, Zarr, and probably TileDB, and maybe elsewhere.

What follows is an attempt to reverse engineer the algorithm used by the Zarr code to compute this intersection. This is intended to be the bare-bones algorithm with no optimizations included. Thus, its performance is probably not the best, but it should work in all cases. Some understanding of how HDF5 chunking is used is probably essential for understanding this algorithm.

The original python code relies heavily on the use of Python iterators, which was might be considered a mistake. Iterators are generally most useful in two situations: (1) it makes the code clearer -- arguably false in this case, and (2) there is a reasonable probability that the iterators will be terminated before they end, thus improving efficiency. This is demonstrably false for the Zarr code.

This code instead uses the concept of an odometer, which is a way to iterate over all the elements of an n-dimensional object. Odometer code already exists in several places in the existing netcdf-c library, among them ncdump/nciter.c, ncgen/odom.c, and libdap4/d4odom.c. It is also equivalent to the Python itertools.product iterator function.

This algorithm assumes the following items of input:

variable (aka array) - multidimensional array of data values
dimensions - the vector of dimensions defining the rank of the array; this set comes from the dimensions associated with the array; so given v(d1,d2) where (in netcdf terms d1 and d2 are the dimension names and where for example, d1=10, d2=20 the dimensions given to this algorithm are the ordered vector (d1,d2), or equivalently (10,20).
Chunk sizes - for each dimension in the dimension set, there is defined a chunk size along that dimension. Thus, there is also an ordered vector of chunk sizes, (c1,c2) corresponding to the dimension vector (d1,d2).
slices - a vector of slice instances defining the subset of data to be extracted from the array. A single slice as used here is of the form (start,stop,stride) where start is the starting position with respect to a dimension, stop is the last position + 1 to extract, and stride is the number of positions to skip when extracting data. Note that this is different than the netcdf-c nc_get_vars slices, which are of the form (start,count,stride). The two are equivalent since stop = start+(count*stride). When extracting data from a variable of rank R, one needs to specify R slices where each slice corresponds to a dimension of the variable.

At a high-level, the algorithm works by separately analyzing each dimension of the array using the corresponding slice and corresponding chunk size. The result is a set of projections specific to each dimension. By taking the cross-product of these projections, one gets a vector of projection-vectors that can be evaluated to extract a subset of the desired data for storage in an output array.

It is important to note that this algorithm operates in two phases. In the first phase, it constructs the projections for each dimension. In the second phase, these projections are combined as a cross-product to provide subsetting for the true chunks, which are R-dimensional rectangles.

What follows is the algorithm is written as a set of pseudo-code procedures to produce the final output given the above inputs.

Notations:

floordiv(x,y) = floor((x / y))
ceildiv(x,y) = ceil((x / y))
Notes:
The ith slice is matched to the ith dimension for the variable
The ith slice is matched to the ith chunksize for the variable
The zarr code uses iterators, but this code converts to using vectors and odometers for (one hopes) some clarity and consistency with existing netcdf-c code.

Global Type Declarations

class Slice {      int start      int stop      int stride  }    class SliceIndex { // taken from zarr code see SliceDimIndexer      int chunk0      // index of the first chunk touched by the slice      int nchunks     // number of chunks touched by this slice index      int count       //total number of output items defined by this slice index      Projection projections[nchunks]; // There are multiple projections                                       // derived from the original slice:                                       // one for each chunk touched by                                       // the original slice  }    class Projection {      int chunkindex;      Slice slice; // slice specific to this chunk      int outpos;  // start point in the output to store the extracted data  }

Global variables

In order to keep argument lists short, certain values are assumed to be globally defined and accessible.

R - the rank of the variable
dimlen[R] - the length of the dimensions associated with the variable
chunklen[R] - the length of the chunk sizes associated with the variable
int zeros[R] - a vector of zeros
int ones[R] - a vector of ones

Procedure EvaluateSlices

// Goal: Given the projections for each slice being applied to the  //       variable, create and walk all possible combinations of projection  //       vectors that can be evaluated to provide the output data    void EvaluateSlices(       Slice slices[R], // the complete set of slices       T output // the target storage for the extracted data: its type is T       )  {    int i;    SliceIndex allsliceindices[R];    Odometer odometer;    nchunks[R]; // the vector of nchunks from projections    int indices[R];      // Compute the slice index vector    allsliceindices = compute_all_slice_indices(slices);      // Extract the chunk0 and nchunks vectors    for(i=0;i<R;i++) {      nchunks[i] = allsliceindices[i].nchunks;    }      // Create an odometer to walk nchunk combinations    odometer = Odometer.new(R,zeros,nchunks); // iterate each 'wheel[i]' over 0..nchunk[i] with R wheels      // iterate over the odometer: all combination of chunk indices in the projections    for(;odometer.more();odometer.next()) {      chunkindices = odometer.indices();      ApplyChunkIndices(chunkindices,output,allsliceindices);    }  }

Procedure ApplyChunkIndices

// Goal: given a vector of chunk indices from projections,  //       extract the corresponding data and store it into the  //       output target    void ApplyChunkIndices(       int chunkindices[R], // indices chosen by the parent odometer       T output, // the target storage for the extracted data       SliceIndex allsliceindices[R]       )  {    int i;    SliceIndex subsliceindices[R];    chunk0[R];  // the vector of chunk0 values from projections    Projection projections[R];    int[R] start, stop,stride; // decomposed set of slices    int[R] outpos; // capture the outpos values across the projections    int ouputstart;      // This is complicated. We need to construct a vector of slices    // of size R where the ith slice is determined from a projection    // for the ith chunk index of chunkindices. We then iterate over    // that odometer to extract values and store them in the output.    for(i=0;i<R;i++) {      int chunkindex = chunkindices[i];      Slice slices[R];      projections[i] = allsliceindices[i].projections[chunkindex];      slices[i] = projections[i].slice;         outpos[i] = projections[i].outpos;    }    // Compute where the extracted data will go in the output vector    outputstart = computelinearoffset(R,outpos,dimlen);      GetData(slices,outputstart,output);  }

Procedure GetData

// Goal: given a set of indices pointing to projections,  //       extract the corresponding data and store it into the  //       output target.    void GetData(       Slice slices[R],       int chunksize, // total # T instances in chunk       T chunk[chunksize],       int outputstart,       T output       )  {    int i;    Odometer sliceodom,      sliceodom = Odometer.new(R, slices);      // iterate over the odometer to get a point in the chunk space    for(;odom.more();odom.next()) {      int chunkpos = odometer.indices(); // index    }  }

Procedure compute_all_slice_projections

// Goal:create a vector of SliceIndex instances: one for each slice in the top-level input    Projection[R]  compute_all_slice_projections(    Slice slice[R], // the complete set of slices  {    int i;    SliceIndex projections[R];    * for i in 0..R-1    * projections[i] = compute_perslice_projections(dimlen[i],chunklen[i],slice[i])  * return projections

Procedure compute_perslice_projections

Goal:

For each slice, compute a set of projections from it wrt a dimension and a chunk size associated with that dimension.
Inputs:
dimlen -- dimension length
chunklen -- chunk length associated with the input dimension
slice=(start,stop,stride) -- associated with the dimension
Outputs:
Instance of SliceIndexs

Data Structure:

Computations:  * count = max(0, ceildiv((stop - start), stride))      - total number of output items defined by this slice (equivalent to count as used by nc_get_vars)  * nchunks = ceildiv(dim_len, dim_chunk_len)      - number of chunks touched by this slice  * chunk0 = floordiv(start,chunklen)      - index (in 0..nchunks-1) of the first chunk touched by the slice  * chunkn = ceildiv(stop,chunklen)      - index (in 0..nchunks-1) of the last chunk touched by the slice  * n = ((chunkn - chunk0) + 1)      - total number of touched chunks      - the index i will range over 0..(n-1)      - is this value the same as nchunks?  * For each touched chunk index we compute a projection specific to that chunk, hence    there are n of them.  * projections.index[i] = i  * projections.offset[i] = chunk0 * i      - remember: offset is only WRT this dimension, not global  * projections.limit[i] = min(dimlen, (i + 1) * chunklen)      - end of this chunk but no greater than dimlen  * projections.len[i] = projections.limit[i] - projections.offset[i]      - actual limit of the ith touched chunk; should be same as chunklen except for last length because of the min function in computing limit[i]  * projections.start[i]:      - This is somewhat complex because for the first projection, the start is the slice start,         but after that, we have to take into account that for a non-one stride, the start point         in a projection may be offset by some value in the range of 0..(stride-1)      * i == 0 => projections.start[i] = start - projections.offset[i]        - initial case the original slice start is within the first projection      * i > 0 => projections.start[i] = start - projections.offset[i]          * prevunused[i] = (projections.offset[i] - start) % stride            - prevunused[i] is an intermediate computation and need not be saved            - amount unused in previous chunk => we need to skip (stride-prevunused[i]) in this chunk          * prevunused[i] > 0 => projections.start[i] = stride - prevunused[i]  * projections.stop[i]:      * stop > projections.limit[i] => projections.stop[i] = projections.len[i]      * stop <= projections.limit[i] => projections.stop[i] = stop - projections.offset[i]        - selection ends within current chunk  * projections.outpos[i] = ceildiv(offset - start, stride)      - 'location' in the output array to start storing items; again, per slice, not global      Procedure computelinearoffset(outpos,dimlen);

// Goal: Given a set of per-dimension indices, compute the corresponding linear position.    int  computelinearoffset(int R,                      int outpos[R],                      int dimlen[R]                     )  {    int offset;    int i;      offset = 0;    for(i=0;i<R;i++) {      offset *= dimlen[i];      offset += outpos[i];    }     return offset;  }

Appendix: Odometer Code

class Odometer  {    int R; // rank    int start[R];    int stop[R]    int stride[R];    int index[R]; // current value of the odometer      procedure new(int R, int start[R], int stop[R]) { return new(R, start,stop,ones);}      procedure new(int rank, Slice slices[R])    {        int i;        int start0[R];        int stop0[R];        int stride0[R];          R = rank;         for(i=0;i<R;i++) {            start = slices[i].start;            stop = slices[i].stop;            stride = slices[i].stride;        }        for(i=0;i<R;i++) {index[i] = start[i];}    }      boolean    procedure more(void)    {        return (index[0] < stop[0]);    }      procedure  next(void)    {        int i;          for(i=R-1;i>=0;i--) {            index[i] += stride[i];            if(index[i] < stop[i]) break;            if(i == 0) break; // leave the 0th entry if it overflows            index[i] = start[i]; // reset this position        }    }      // Get the value of the odometer    int[R]    procedure indices(void)    {      return indices;    }    }

Unidata Developer's Blog

Maintaining netCDF: Updating Java Tutorial Code and Performance Testing in Python

NCZarr Support for Zarr Filters

Pre-Processing Filter Libraries

Using Plugin Libraries

Using the Codec API

Writing an NCZarr Container

Reading an NCZarr Container

Supporting Filter Chains

Extensions

Summary

Appendix A. Codec API

NCZ_get_codec_info

Signature

NCZ_codec_t

NCZ_codec_to_hdf5

Signature

Arguments

NCZ_hdf5_to_codec

Signature

Arguments

Netcdf-4 Filter Support Changes

Highlights From My Summer Internship With Unidata

Overview of Zarr Support in netCDF-C

The NCZarr Data Model

Enabling NCZarr Support

Accessing Data Using the NCZarr Prototocol

URL Format

Client Parameters

NCZarr Map Implementation

Zmap Implementatons

NCZarr versus Pure Zarr.

Notes on Debugging NCZarr Access

Zip File Support

Amazon S3 Storage

Addressing Style

Zarr vs NCZarr

Data Model

Storage Format

Translation

Compatibility

XArray

Examples

References

Appendix A. Building NCZarr Support

Automake

CMake

Testing S3 Support

Appendix B. Building aws-sdk-cpp

Appendix C. Amazon S3 Imposed Limits

Appendix D. Alternative Mechanisms for Accessing Remote Datasets

Byte-Range Authorization

Point of Contact

Enhancing the netCDF C++ Library and the Siphon Package

NetCDF Zarr API

Introduction

The netCDF-Zarr API

NetCDF File Functions

Dimensions

Types

Variables

Variable Representation Functions

Variable IO

Attributes

Groups

NetCDF Error Handling

Miscellaneous Functions

Unimplemented Functions

Parallelism Functions

Path URLS

References

Copyright

Point of Contact

NetCDF ZARR Data Model Specification

Table of Contents

Introduction

Notation

Data Model

Dataset

Group