Unidata Developer's BlogUnidata Developer's Bloghttps://www.unidata.ucar.edu/blogs/developer/en/feed/entries/atom2024-03-18T20:14:52-06:00Apache Rollerhttps://www.unidata.ucar.edu/blogs/developer/entry/maintaining-netcdf-updating-java-tutorialMaintaining netCDF: Updating Java Tutorial Code and Performance Testing in PythonUnidata News2021-08-04T13:55:13-06:002021-08-04T13:55:13-06:00<div class="img_l" style="width: 150px;">
<a class="lightbox" title="Isabelle Pfander" href="/blog_content/images/2021/20210519_Izzy_Pfander.jpg">
<img width="150" src="/blog_content/images/2021/20210519_Izzy_Pfander.jpg" alt="Isabelle Pfander" />
</a>
<div class="caption">
Izzy (Isabelle) Pfander
</div>
<p></div></p>
<p class="byline">
by
<a href="/community/internship/#2021ip">Isabelle Pfander</a>
<br />2021 Unidata summer intern
</p>
<p>
I came into this summer internship with a goal of working on the Network Common Data
Form (netCDF) libraries. NetCDF is a combination of software libraries and APIs
describing a data model for scientific multidimensional arrays. I planned to improve
the online user guide, write tutorial code, and learn about storage and efficiency.
</p>
<p style="font-style: italic;">
Editor's Note:<br>
Due to the COVID-19 pandemic, Unidata's 2021 summer interns did not travel to
Boulder to work on their projects in person. Instead, they interacted with Unidata
developers through Slack, Zoom, and other electronic means.
</p>
<div class="img_l" style="width: 150px;">
<a class="lightbox" title="Isabelle Pfander" href="/blog_content/images/2021/20210519_Izzy_Pfander.jpg">
<img width="150" src="/blog_content/images/2021/20210519_Izzy_Pfander.jpg" alt="Isabelle Pfander" />
</a>
<div class="caption">
Izzy (Isabelle) Pfander
</div>
<p></div></p>
<p class="byline">
by
<a href="/community/internship/#2021ip">Isabelle Pfander</a>
<br />2021 Unidata summer intern
</p>
<p>
I came into this summer internship with a goal of working on the Network Common Data
Form (netCDF) libraries. NetCDF is a combination of software libraries and APIs
describing a data model for scientific multidimensional arrays. I planned to improve
the online user guide, write tutorial code, and learn about storage and efficiency.
</p>
<p>
Before this project, I had only used netCDF by calling high level functions to read
and write data in MATLAB, which uses functionality from the netCDF-C library. The
netCDF data model is a standard across languages, with programming interfaces in C,
Java, Fortran, Python, MATLAB, R, and more. For the majority of this summer, I worked
closely with the netCDF-Java library, updating and expanding the
<a href="https://docs.unidata.ucar.edu/netcdf-java/6.0/userguide/index.html">online user's guide</a>.
</p>
<div class="img_r" style="width: 200px;">
<a class="lightbox" title="Updated UML diagram in the netCDF-Java documentation." href="/blog_content/images/2021/20210803_pfander_uml.png">
<img width="200" src="/blog_content/images/2021/20210803_pfander_uml.png" alt="netCDF-Java UML" />
</a>
<div class="caption">
A UML diagram for<br>netCDF-Java<br>(click to enlarge)
</div>
<p></div></p>
<p>
I maintained the netCDF-Java documentation by updating tutorial code, testing code
snippets, and modernizing tutorial text to improve user understanding. I started by
improving the documentation by replacing raw HTML with Markdown, changing formatting,
linking to relevant sites, updating UML diagrams, and including updated screenshots. I
next moved on to update and rewrite the tutorial code in Java. I created a tutorial
class for each page with every code snippet contained in a method. Viewing the code
snippets inside of IntelliJ, I was able to fix deprecations and update the code after
some major changes were made to the structure of the netCDF-Java library. I then used
netCDF-Java’s jekyll plugin to insert the code snippets into the rendered html page.
Finally, I created test classes to confirm the code was running properly. Changing the
structure by moving code snippets to Java classes rather than inside the markdown file
will ensure that when future changes are made, errors in the user guide will not go
unnoticed. See one of my
<a href="https://github.com/Unidata/netcdf-java/pull/743">pull requests for user guide
updates</a>.
</p>
<div class="img_l" style="width: 200px;">
<a class="lightbox" title="Comparing performance when specifying different chunk sizes." href="/blog_content/images/2021/20210803_pfander_chunking.png">
<img width="200" src="/blog_content/images/2021/20210803_pfander_chunking.png" alt="Chunking performance" />
</a>
<div class="caption">
Performance with different chunk sizes.
</div>
<p></div></p>
<p>
After improving the user guide documentation, I embarked on the second focus of my
internship: performance testing in Python. Because of my interest in data storage and
efficiency, my mentors suggested that I look into comparing data formats, including
HDF5 and Zarr. HDF5 is a file format used by netCDF-4 providing compression and
chunking to the netCDF data model. I switched from working in Java to Python so that I
could compare reading times with Zarr, a Python-based data storage format. I compared
reads with netCDF-3, netCDF-4 Classic, netCDF-4, Zarr, and Zarr being read with
Xarray. The completed performance testing demonstrated that read times increase at
varying scales as chunk size decreases. When chunk size was large, a Zarr directory
store read was faster; however, as chunk size decreases, reads of netCDF-4 became much
faster. I learned that the difference in read times is due to how each format stores
data differently. A netCDF-4 file stores all data in one .nc file, consequently more
operations are needed to find the appropriate data, but only one open is required.
Zarr directory stores save chunked data as many subdirectories and files, meaning the
more chunks, the more individual files in one Zarr directory store. You can see the
notebooks I created, testing data, and full results in my
<a href="https://github.com/irpfander/Comparing-Read-Times-of-NetCDF-and-Zarr-with-Python">GitHub
repository</a>.
</p>
<p>
My internship with Unidata allowed me to explore my own interests with netCDF software
while sharing findings with the public. I was able to contribute to the open source
community for the first time, conduct testing of my own, and gain professional development
skills through my mentors and UCAR/Unidata’s community. I am very grateful for this summer
opportunity and all the individuals who made this remote collaboration possible.
</p>
https://www.unidata.ucar.edu/blogs/developer/entry/nczarr-support-for-zarr-filtersNCZarr Support for Zarr FiltersDennis Heimbigner 2021-05-23T16:11:43-06:002023-06-04T13:39:21-06:00<p>[Note: See github issue <a href="https://github.com/Unidata/netcdf-c/issues/2006">2006</a> for additional comments.]</p>
<p>To date, filters in the netcdf-c library referred to HDF5 style filters.
The inclusion of Zarr support in the netcdf-c library (called NCZarr) creates the need to provide a new representation consistent with the way that Zarr files store filter information.
For Zarr, filters are represented using the JSON notation.
Each filter is defined by a JSON dictionary, and each such filter dictionary
is guaranteed to have a key named "id" whose value is a unique string defining the filter algorithm: "lz4" or "bzip2", for example.</p>
<p>This document outlines the proposed process by which NCZarr will be able to utilize existing HDF5 filters.
At the same time, it provides mechanisms to support storing filter metadata in the NCZarr container using the Zarr compliant Codec style representation of filters and their parameters.</p>
<p>[Note: See github issue <a href="https://github.com/Unidata/netcdf-c/issues/2006">2006</a> for additional comments.]</p>
<p>To date, filters in the netcdf-c library referred to HDF5 style filters.
This style of filter is represented in the netcdf-c/HDF5 file by the following information:</p>
<ol>
<li>An unsigned integer, the "id", and</li>
<li>A vector of unsigned integers that encode the "parameters" for controlling the behavior of the filter.</li>
</ol>
<p>The "id" is a unique number assigned to the filter by the <a href="https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins">HDF Group
filter authority</a>.
It identifies a specific filter algorithm.
The "parameters" of the filter are not defined explicitly but only by the implementation of the filter.</p>
<p>The inclusion of Zarr support in the netcdf-c library (called NCZarr) creates the need to provide a new representation consistent with the way that Zarr files store filter information.
For Zarr, filters are represented using the JSON notation.
Each filter is defined by a JSON dictionary, and each such filter dictionary
is guaranteed to have a key named "id" whose value is a unique string defining the filter algorithm: "lz4" or "bzip2", for example.</p>
<p>The parameters of the filter are defined by additional -- algorithm specific -- keys in the filter dictionary.
One commonly used filter is "blosc", which has a JSON dictionary of this form.</p>
<pre><code>{
"id": "blosc",
"cname": "lz4",
"clevel": 5,
"shuffle": 1
}
</code></pre>
<p>So in HDF5 terms, it has three parameters:</p>
<ol>
<li>"cname" -- the sub-algorithm used by the blosc compressor, LZ4 in this case.</li>
<li>"clevel" -- the compression level, 5 in this case.</li>
<li>"shuffle" -- is the input shuffled before compression, yes (1) in this case.</li>
</ol>
<p>NCZarr (netcdf Zarr) is required to store its filter information in its metadata in the above JSON dictionary format.
Simultaneously, NCZarr expects to use many of the existing HDF5 filter implementations.
This means that some mechanism is needed to translate between the HDF5 id+parameter model and the Zarr JSON dictionary model.</p>
<p>The standardization authority for defining Zarr filters is the list supported by the <a href="https://numcodecs.readthedocs.io/en/stable/">NumCodecs project</a>. Comparing the set of standard filters (aka codecs) defined by NumCodecs to the set of standard filters defined by HDF5, it can be seen that the two sets overlap, but each has filters not defined by the other.</p>
<p>Note also that it is undesirable that a specific set of filters/codecs be built into the NCZarr implementation.
Rather, it is preferable for there be some extensible way to associate the JSON with the code implementing the codec. This mirrors the plugin model used by HDF5.</p>
<p>Currently, each HDF5 filter is implemented by a shared library that has certain well-defined entry points that allow the netcdf/HDF5 libraries to determine information about the filter, notably its id.
In order to use the codec JSON format, these entry points must be extended in some way to obtain the corresponding defining JSON.
But there is another desirable constraint.
It should be possible to associate an existing HDF5 filter -- one without codec JSON information -- with the corresponding codec JSON.
This association needs to be implemented by some mechanism external to the HDF5 filter.</p>
<h2>Pre-Processing Filter Libraries</h2>
<p>The process for using filters for NCZarr is defined to operate in several steps.
First, as with HDF5, all shared libraries in a specified directory
(<em>HDF5_PLUGIN_PATH</em>) are scanned.
They are interrogated to see what kind of library they implement, if any.
This interrogation operates by seeing if certain well-known (function) names are defined in this library.</p>
<p>There are two library types:</p>
<ol>
<li>HDF5 -- exports a specific API: "H5Z_plugin_type" and "H5Z_get_plugin_info".</li>
<li>Codec -- exports a specific API: "NCZ_codec_type" and "NCZ_get_codec_info"</li>
</ol>
<p>Note that a given library can export either or both of these APIs.
This means that we can have three types of libraries:</p>
<ol>
<li>HDF5 only</li>
<li>Codec only</li>
<li>HDF5 + Codec</li>
</ol>
<p>Suppose that our <em>HDF5_PLUGIN_PATH</em> location has an HDF5-only library.
Then by adding a corresponding, separate, Codec-only library to that same location, it is possible to make an HDF5 library usable by NCZarr.
It is possible to do this without having to modify the HDF5-only library.
Over time, it is possible to merge any given HDF5-only library with a Codec-only library to produce a single, combined library.</p>
<h2>Using Plugin Libraries</h2>
<p>The approach used by NCZarr is to have the netcdf-c library process all of the libraries by interrogating each one for the well-known APIs and recording the result.
Any libraries that do not export one or both of the well-known APIs is ignored.</p>
<p>Internally, the netcdf-c library pairs up each HDF5 library API with a corresponding Codec API by invoking the relevant well-known functions
(See <a href="#AppendixA">Appendix A</a>).
This results in this table for associated codec and hdf5 libraries.</p>
<table>
<tr><th>HDF5 API<th>Codec API<th>Action
<tr><td>Not defined<td>Not defined<td>Ignore
<tr><td>Defined<td>Not defined<td>Ignore
<tr><td>Defined<td>Defined<td>NCZarr usable
</table>
<h2>Using the Codec API</h2>
<p>Given a set of filters for which the HDF5 API and the Codec API
are defined, it is then possible to use the APIs to invoke the
filters and to process the meta-data in Codec JSON format.</p>
<h3>Writing an NCZarr Container</h3>
<p>When writing, the user program invokes the NetCDF API function <em>nc_def_var_filter</em>.
This function is currently defined to operate using HDF5-style id and parameters (unsigned ints).
The netcdf-c library examines its list of known filters to find one matching the HDF5 id provided by <em>nc_def_var_filter</em>.
The set of parameters provided is stored internally.
Then during writing of data, the corresponding HDF5 filter is invoked to encode the data.</p>
<p>When it comes time to write out the meta-data, the stored HDF5-style parameters are passed to a specific Codec function to obtain the corresponding JSON representation. Again see <a href="#AppendixA">Appendix A</a>.
This resulting JSON is then written in the NCZarr metadata. </p>
<h3>Reading an NCZarr Container</h3>
<p>When reading, the netcdf-c library reads the metadata for a given variable and sees that some set of filters are applied to this variable.
The metadata is encoded as Codec-style JSON.</p>
<p>Given a JSON Codec, it is parsed to provide a JSON dictionary containing the string "id" and the set of parameters as various keys.
The netcdf-c library examines its list of known filters to find one matching the Codec "id" string.
The JSON is passed to a Codec function to obtain the corresponding HDF5-style <em>unsigned int</em> parameter vector.
These parameters are stored for later use.</p>
<p>When it comes time to read the data, the stored HDF5-style filter is invoked with the parameters already produced and stored.</p>
<h2>Supporting Filter Chains</h2>
<p>HDF5 supports <em>filter chains</em>, which is a sequence of filters where the output of one filter is provided as input to the next filter in the sequence.
When encoding, the filters are executed in the "forward" direction,
while when decoding the filters are executed in the "reverse" direction.</p>
<p>In the Zarr meta-data, a filter chain is divided into two parts:
the "compressor" and the "filters". The former is a single JSON codec
as described above. The latter is an ordered JSON array of codecs.
So if compressor is something like
"compressor": {"id": "c"...}
and the filters array is like this:
"filters": [ {"id": "f1"...}, {"id": "f2"...}...{"id": "fn"...}]
then the filter chain is (f1,f2,...fn,c) with f1 being applied first and c being applied last when encoding. On decode, the filter chain is executed in the order (c,fn...f2,f1).</p>
<p>So, an HDF5 filter chain is divided into two parts, where the last filter in the chain is assigned as the "compressor" and the remaining
filters are assigned as the "filters".
But independent of this, each codec, whether a compressor or a filter,
is stored in the JSON dictionary form described earlier.</p>
<h2>Extensions</h2>
<p>The Codec style, using JSON, has the ability to provide very complex parameters that may be hard to encode as a vector of unsigned integers.
It might be desirable to consider exporting a JSON-base API out of the netcdf-c API to support user access to this complexity.
This would mean providing some alternate version of "nc<em>def</em>var_filter" that takes a string-valued argument instead of a vector of unsigned ints.</p>
<p>One bad side-effect of this is that we then may have two classes of plugins.
One class can be used by both HDF5 and NCZarr, and a second class that is usable only with NCZarr.</p>
<p>This extension is unlikely to be implemented until a compelling use-case is encountered. </p>
<h2>Summary</h2>
<p>This document outlines the proposed process by which NCZarr utilizes existing HDF5 filters.
At the same time, it describes the mechanisms to support storing filter metadata in the NCZarr container using the Zarr compliant Codec style representation of filters and their parameters.</p>
<h2><a name="AppendixA">Appendix A. Codec API</a></h2>
<p>The Codec API mirrors the HDF5 API closely. It has one well-known function that can be invoked to obtain information about the Codec as well as pointers to special functions to perform conversions.</p>
<p>Note that this Appendix is only an initial proposal and is subject to change.</p>
<h3>NCZ_get_codec_info</h3>
<p>This function returns a pointer to a C struct that provides detailed information about the codec converter.</p>
<h4>Signature</h4>
<pre><code>struct NCZ_codec_t NCZ_get_codec_info(void);
</code></pre>
<h3>NCZ_codec_t</h3>
<pre><code>typedef struct NCZ_codec_t {
int version; /* Version number of the struct */
int sort; /* Format of remainder of the struct;
Currently always NCZ_CODEC_HDF5 */
const char* id; /* The name/id of the codec */
const unsigned int hdf5id; /* corresponding hdf5 id */
int (*NCZ_codec_to_hdf5)(const char* codec, int* nparamsp, unsigned* paramsp);
int (*NCZ_hdf5_to_codec)(int nparams, unsigned* params, char** codecp);
} NCZ_codec_t;
</code></pre>
<p>The key to this struct is the two function pointers that do the conversion between codec JSON and HDF5 parameters.</p>
<h3>NCZ_codec_to_hdf5</h3>
<p>Given a JSON Codec representation, it returns a corresponding vector of unsigned integers for use with HDF5.</p>
<h4>Signature</h4>
<pre><code>int (*NCZ_codec_to_hdf5)(const char* codec, int* nparamsp, unsigned** paramsp);
</code></pre>
<h4>Arguments</h4>
<ol>
<li>codec -- (in) ptr to JSON string representing the codec.</li>
<li>nparamsp -- (out) store the length of the converted HDF5 unsigned vector</li>
<li>paramsp -- (out) store a pointer to the converted HDF5 unsigned vector; caller must free the returned vector. Note the double indirection.</li>
</ol>
<p>Return Value: a netcdf-c error code.</p>
<h3>NCZ_hdf5_to_codec</h3>
<p>Given an HDF5 vector of unsigned integers and it length, return the corresponding JSON codec representation.</p>
<h4>Signature</h4>
<pre><code>int (*NCZ_hdf5_to_codec)(int nparamsp, unsigned* paramsp, char** codecp);
</code></pre>
<h4>Arguments</h4>
<ol>
<li>nparams -- (in) the length of the HDF5 unsigned vector</li>
<li>params -- (in) pointer to the HDF5 unsigned vector.</li>
<li>codecp -- (out) store the string representation of the codec; caller must free.</li>
</ol>
<p>Return Value: a netcdf-c error code.</p>
https://www.unidata.ucar.edu/blogs/developer/entry/netcdf-4-filter-support-changesNetcdf-4 Filter Support ChangesDennis Heimbigner 2020-10-15T16:24:17-06:002020-10-15T16:24:17-06:00<p>The netcdf-c library filters API in version 4.7.4 has been deprecated in favor of a modified version that unfortunately may cause incompatibilties for users.</p>
<p>The initial reason for the incompatible changes was to support the use of filters in the new NCZarr code. The changes were not completely thought out so it was decided to remove them and revert to previous behaviors. At some future point, the filter mechanism will be extended to support filters for NCZarr, but these will be proper extensions: the existing, reverted, filter API will continue to be supported with no user-visible modifications.</p>
<p>Unfortunately, some advanced users of netcdf filters may experience some compilation or execution problems for previously working code because of these reversions. In that case, please revise your code. Apologies are extended for any inconvenience. Note that it is possible to detect which mechanism is in place at build time.</p>
<p>In summary, the changes are of the following kinds:</p>
<ul>
<li>Some functions were renamed for consistency.</li>
<li>Revert the way that the function <em>nc_inq_var_filter</em> was indicating no filters existed.</li>
<li>Some auxilliary functions for parsing textual filter specifications have been moved to <em>netcdf_aux.h</em>.</li>
<li>All of the "filterx" functions have been removed.</li>
<li>The undocumented function <em>nc_filter_remove</em> was deleted.</li>
</ul>
<p>See the Github document https://github.com/Unidata/netcdf/blob/master/NUG/filters.md for details.</p>
https://www.unidata.ucar.edu/blogs/developer/entry/highlights-from-my-summer-internshipHighlights From My Summer Internship With UnidataUnidata News2020-07-30T11:16:55-06:002020-07-30T11:16:55-06:00<div class="img_l" style="width: 150px;">
<a class="lightbox" title="Lauren Prox" href="/blog_content/images/2020/20200526_lprox.jpg">
<img width="150" src="/blog_content/images/2020/20200526_lprox.jpg" alt="Lauren Prox" />
</a>
<div class="caption">
Lauren Prox
</div>
<p></div></p>
<p class="byline">
by
Lauren Prox
<br />2020 Unidata summer intern
</p>
<p>
During the beginning of my internship, I devoted a great deal of time learning how to use Git and Github to collaborate on
software development projects. After gaining this experience, I began improving documentation for a
variety of Unidata remote repositories. I started with the netCDF-C repository and then moved on to the
MetPy, Siphon, and Python Training remote repositories. This work was significant as it ensured that
software users were able to locate resources, properly download software, and learn how to operate the
software via informational materials.</p>
<p style="font-style: italic;">
Editor's Note:<br>
Due to the COVID-19 pandemic, Unidata's 2020 summer interns did not travel to
Boulder to work on their projects in person. Instead, they interacted with Unidata
developers through Slack, Zoom, and other electronic means.
</p>
<div class="img_l" style="width: 150px;">
<a class="lightbox" title="Lauren Prox" href="/blog_content/images/2020/20200526_lprox.jpg">
<img width="150" src="/blog_content/images/2020/20200526_lprox.jpg" alt="Lauren Prox" />
</a>
<div class="caption">
Lauren Prox
</div>
<p></div></p>
<p class="byline">
by
<a href="/community/internship/#2020lp">Lauren Prox</a>
<br />2020 Unidata summer intern
</p>
<p>
During the beginning of my internship, I devoted a great deal of time learning how to use <a
href="https://git-scm.com/">Git</a> and <a href="https://github.com/">Github</a> to collaborate on
software development projects. After gaining this experience, I began improving documentation for a
variety of Unidata remote repositories. I started with the <a href="https://github.com/Unidata/netcdf-c">netCDF-C</a> repository and then moved on to the
<a href="https://github.com/Unidata/MetPy">MetPy</a>, <a href="https://github.com/Unidata/siphon">Siphon</a>, and <a href="https://github.com/Unidata/python-training">Python Training</a> remote repositories. This work was significant as it ensured that
software users were able to locate resources, properly download software, and learn how to operate the
software via informational materials.</p>
<p>I also provided feedback regarding the installation process for
NetCDF and its various libraries. This led me to investigate how users access and work with NetCDF data
using tools such as MATLAB and AWS S3. From this work, I found that there are several resources
available for users who wish to use NetCDF data and AWS S3 buckets including AWS S3’s <a
href="https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html">Developer Guide</a> and MATLAB’s <a
href="https://www.mathworks.com/help/matlab/getting-started-with-matlab.html?s_tid=CRUX_lftnav">Getting
Started</a> Guide. In mentioning these great resources, I would like to take the opportunity to mention
a Unidata resource that provides tutorials and resources concerning python skills and atmospheric
science education. Though I may be slightly biased after working on this website, Unidata’s <a
href="https://unidata.github.io/python-training/">Python Training</a> website is a wonderful resource
for people wishing to gain more coding experience while using real-word atmospheric science data.
</p>
<div class="img_r" style="width: 200px;">
<a class="lightbox" title="It's easy to contribute to an Open Source project and
project leaders welcome submissions from anyone who is interested in collaborating!"
href="/blog_content/images/2020/20200731_lprox_PR106.png">
<img width="200" src="/blog_content/images/2020/20200731_lprox_PR106.png" alt="GitHub PR screen shot"/>
</a>
<div class="caption">
An example of one of my merged pull requests<br>(click to enlarge)
</div>
<p></div></p>
<p>
While working on these projects, I simultaneously participated in educational courses that broadened my
knowledge of software development, data stewardship, and scientific research methods. In addition to
completing a software development mini-course with my institution, I also completed a Structured Query
Language (SQL) course offered by UCAR. I had frequently heard of SQL in my engineering courses, but I
never actually took the time to explore this language until this summer. Because my internship was
remote, I was also able to attend virtual conferences such as <a
href="https://www.earthcube.org/EC2020">EarthCube</a>, <a
href="https://2020esipsummermeeting.sched.com/">ESIP</a>, and <a
href="https://pearc.acm.org/pearc20/">PEARC</a>. These conferences were a great way to network with
users of Unidata software and to learn of new developments within the fields of scientific data storage
and sharing. The flexibility of my internship position allowed me to take advantage of countless
opportunities, which helped me gain invaluable knowledge that I will continue to use as a scientific
researcher.
</p>
<p>
This internship was such a rewarding experience because I actually applied what I learned, in my
computer science courses, to fix real-world problems. The moment when <a
href="https://github.com/Unidata/netcdf-c/pull/1741">my first pull request</a> —
a software documentation
correction in which I replaced a few broken web links — was merged onto a Unidata
repository’s master branch felt amazing. When I successfully revised the script to include
the proper links, my changes were approved and are now included in the official documentation for the
netCDF Github repository. What made this accomplishment even sweeter was the way that my coworkers
celebrated my progress. I am so appreciative for the Unidata and greater NCAR community because they
provided a supportive, engaging, and fun environment despite the remote setting of this internship. It
is because of this, that I look forward to working with NCAR and Unidata in the future. Additionally, I
would highly recommend the <a
href="https://www.unidata.ucar.edu/community/internship/">Unidata Summer
Internship</a> program to any student seeking to
bridge the gap between their environmental science coursework and their computer science skills.
</p>
https://www.unidata.ucar.edu/blogs/developer/entry/overview-of-zarr-support-inOverview of Zarr Support in netCDF-CDennis Heimbigner 2020-06-25T12:06:11-06:002023-06-04T13:49:51-06:00<p>Beginning with netCDF version 4.8.0, the Unidata NetCDF group has extended the netcdf-c library to provide access to cloud storage (e.g. Amazon S3) by providing a mapping from a subset of the full netCDF Enhanced (aka netCDF-4) data model to a variant of the Zarr data model that already has mappings to key-value pair cloud storage systems.</p>
<p><b>Note: This document is obsolete. Please refer to this <u><a href="https://docs.unidata.ucar.edu/netcdf-c/current/md__media_psf_Home_Desktop_netcdf_releases_v4_9_2_release_netcdf_c_docs_nczarr.html">document</a></u></b></p>
<p>Beginning with netCDF version 4.8.0, the Unidata NetCDF group has extended the netcdf-c library to provide access to cloud storage (e.g. Amazon S3 <a href="#ref_aws">[1]</a> ) by providing a mapping from a subset of the full netCDF Enhanced (aka netCDF-4) data model to a variant of the Zarr <a href="#ref_zarrv2">[4]</a> data model that already has mappings to key-value pair cloud storage systems.
The NetCDF version of this storage format is called NCZarr <a href="#ref_nczarr">[4]</a>.</p>
<h3>The NCZarr Data Model</h3>
<p>NCZarr uses a data model <a href="#ref_nczarr">[4]</a> that is, by design, similar to, but not identical with the Zarr Version 2 Specification <a href="#ref_zarrv2">[6]</a>. <br />
Briefly, the data model supported by NCZarr is netcdf-4 minus the user-defined types and the String type.
As with netcdf-4 it supports chunking.
Eventually it will also support filters in a manner similar to the way filters are supported in netcdf-4.</p>
<p>Specifically, the model supports the following.
- "Atomic" types: char, byte, ubyte, short, ushort, int, uint, int64, uint64.
- Shared (named) dimensions
- Attributes with specified types -- both global and per-variable
- Chunking
- Fill values
- Groups
- N-Dimensional variables
- Per-variable endianness (big or little)</p>
<p>With respect to full netCDF-4, the following concepts are
currently unsupported.
- String type
- User-defined types (enum, opaque, VLEN, and Compound)
- Unlimited dimensions
- Contiguous or compact storage</p>
<p>Note that contiguous and compact are not actually supported because they are HDF5 specific.
When specified, they are treated as chunked where the file consists of only one chunk.
This means that testing for contiguous or compact is not possible, the <em>nc</em>inq<em>var</em>chunking_ function will always return NC_CHUNKED and the chunksizes will be the same as the dimension sizes of the variable's dimensions.</p>
<h3>Enabling NCZarr Support</h3>
<p>NCZarr support is enabled if the <em>--enable-nczarr</em> option is used with './configure'.
If NCZarr support is enabled, then a usable version of <em>libcurl</em> must be specified using the <em>LDFLAGS</em> environment variable (similar to the way that the <em>HDF5</em> libraries are referenced).
Refer to the installation manual for details. NCZarr support can be disabled using the <em>--disable-dap</em>.</p>
<h3>Accessing Data Using the NCZarr Prototocol</h3>
<p>In order to access a NCZarr data source through the netCDF API, the file name normally used is replaced with a URL with a specific format.
Note specifically that there is no NC_NCZARR flag for the mode argument of <em>nc</em>create_ or <em>nc</em>open_.
In this case, it is indicated by the URL path.</p>
<h4>URL Format</h4>
<p>The URL is the usual scheme:://host:port/path?query#fragment format. There are some details that are important.
- Scheme: this should be <em>https</em> or <em>s3</em>,or <em>file</em>.
The <em>s3</em> scheme is equivalent
to "https" plus setting "mode=nczarr,s3" (see below).
Specifying "file" is mostly used for testing, but is used to support
directory tree or zipfile format storage.
- Host: Amazon S3 defines two forms: <em>Virtual</em> and <em>Path</em>.
+ <em>Virtual</em>: the host includes the bucket name as in
<strong>bucket.s3.<region>.amazonaws.com</strong>
+ <em>Path</em>: the host does not include the bucket name, but
rather the bucket name is the first segment of the path.
For example <strong>s3.<region>.amazonaws.com/bucket</strong>
+ <em>Other</em>: It is possible to use other non-Amazon cloud storage, but
that is cloud library dependent.
- Query: currently not used.
- Fragment: the fragment is of the form <em>key=value&key=value&...</em>.
Depending on the key, the <em>=value</em> part may be left out and some
default value will be used.</p>
<h4>Client Parameters</h4>
<p>The fragment part of a URL is used to specify information that is interpreted to specify what data format is to be used, as well as additional controls for that data format.
For NCZarr support, the following <em>key=value</em> pairs are allowed.</p>
<ul>
<li>mode=nczarr|zarr|s3|file|zip... -- The mode key specifies
the particular format to be used by the netcdf-c library for
interpreting the dataset specified by the URL. Using <em>mode=nczarr</em>
causes the URL to be interpreted as a reference to a dataset
that is stored in NCZarr format. The modes <em>s3</em>, <em>file</em>, and <em>zip</em>
tell the library what storage driver to use. The <em>s3</em> is default]
and indicates using Amazon S3 or some equivalent. The <em>file</em> format
stores data in a directory tree. The <em>zip</em> format stores data
in a local zip file. It should be the case that zipping a <em>file</em>
format directory tree will produce a file readable by the <em>zip</em>
storage format. The <em>zarr</em> mode tells the
library to use NCZarr, but to restrict its operation to operate on
pure Zarr Version 2 datasets.</li>
</ul>
<!--
- log=<output-stream>: this control turns on logging output,
which is useful for debugging and testing. If just _log_ is used
then it is equivalent to _log=stderr_.
-->
<h3>NCZarr Map Implementation</h3>
<p>Internally, the nczarr implementation has a map abstraction that allows different storage formats to be used.
This is closely patterned on the same approach used in the Python Zarr implementation, which relies on the Python <em>MutableMap</em> <a href="#ref_python">[5]</a> class.</p>
<p>In NCZarr, the corresponding type is called <em>zmap</em>.
The <em>zmap</em> API essentially implements a simplified variant
of the Amazon S3 API. </p>
<p>As with Amazon S3, <em>keys</em> are utf8 strings with a specific structure:
that of a path similar to those of a Unix path with '/' as the
separator for the segments of the path.</p>
<p>As with Unix, all keys have this BNF syntax:
````
key: '/' | keypath ;
keypath: '/' segment | keypath '/' segment ;
segment: <sequence of UTF-8 characters except control characters and '/'>
````</p>
<p>Obviously, one can infer a tree structure from this key structure.
A containment relationship is defined by key prefixes.
Thus one key is "contained" (possibly transitively)
by another if one key is a prefix (in the string sense) of the other.
So in this sense the key "/x/y/z" is contained by the key "/x/y".</p>
<p>In this model all keys "exist" but only some keys refer to
objects containing content -- <em>content bearing</em>.
An important restriction is placed on the structure of the tree,
namely that keys are only defined for content-bearing objects.
Further, all the leaves of the tree are these content-bearing objects.
This means that the key for one content-bearing object should not
be a prefix of any other key.</p>
<p>There several other concepts of note.
1. <strong>Dataset</strong> - a dataset is the complete tree contained by the key defining
the root of the dataset. Technically, the root of the tree is the key <dataset>/.nczarr, where .nczarr can be considered the <em>superblock</em> of the dataset.
2. <strong>Object</strong> - equivalent of the S3 object; Each object has a unique key
and "contains" data in the form of an arbitrary sequence of 8-bit bytes.</p>
<p>The zmap API defined here isolates the key-value pair mapping
code from the Zarr-based implementation of NetCDF-4. It wraps
an internal C dispatch table manager for implementing an
abstract data structure implementing the zmap key/object model.</p>
<p><strong>Search</strong>: The search function has two purposes:
1. Support reading of pure zarr datasets (because they do not explicitly
track their contents).
2. Debugging to allow raw examination of the storage. See zdump
for example.</p>
<p>The search function takes a prefix path which has a key syntax
(see above). The set of legal keys is the set of keys such that
the key references a content-bearing object -- e.g. /x/y/.zarray
or /.zgroup. Essentially this is the set of keys pointing to the
leaf objects of the tree of keys constituting a dataset. This
set potentially limits the set of keys that need to be examined
during search.</p>
<p>The search function returns a limited set of names, where the
set of names are immediate suffixes of a given prefix path.
That is, if <em>\<prefix\></em> is the prefix path, then search returns
all <em>\<name\></em> such that <em>\<prefix>/\<name\></em> is itself a prefix
of a "legal" key. This can be used to implement glob style
searches such as "/x/y/<em>" or "/x/y/</em>*"</p>
<p>This semantics was chosen because it appears to be the minimum required to implement all other kinds of search using recursion. It was also chosen
to limit the number of names returned from the search. Specifically
1. Avoid returning keys that are not a prefix of some legal key.
2. Avoid returning all the legal keys in the dataset because that set may be very large; although the implementation may still have to examine all legal keys to get the desired subset.
3. Allow for use of partial read mechanisms such as iterators, if available. This can support processing a limited set of keys for each iteration. This is a straighforward tradeoff of space over time.</p>
<p>As a side note, S3 supports this kind of search using common
prefixes with a delimiter of '/', although the implementation is
a bit tricky. For the file system zmap implementation, the legal
search keys can be obtained one level at a time, which directly
implements the search semantics. For the zip file
implementation, this semantics is not possible, so the whole
tree must be obtained and searched.</p>
<p><strong>Issues:</strong></p>
<ol>
<li>S3 limits key lengths to 1024 bytes. Some deeply nested netcdf files
will almost certainly exceed this limit.</li>
<li>Besides content, S3 objects can have an associated small set
of what may be called tags, which are themselves of the form of
key-value pairs, but where the key and value are always text. As
far as it is possible to determine, Zarr never uses these tags,
so they are not included in the zmap data structure.</li>
</ol>
<p><strong>A Note on Error Codes:</strong></p>
<p>The zmap API returns two distinguished error code:
1. NC_NOERR if a operation succeeded
2. NC_EEMPTY is returned when accessing a key that has no content.</p>
<p>Note that NC_EEMPTY is a new error code to signal to that the
caller asked for non-content-bearing key.</p>
<p>This does not preclude other errors being returned such
NC<em>EACCESS or NC</em>EPERM or NC_EINVAL if there are permission
errors or illegal function arguments, for example. It also does
not preclude the use of other error codes internal to the zmap
implementation. So zmap<em>file, for example, uses NC</em>ENOTFOUND
internally because it is possible to detect the existence of
directories and files. This does not propagate outside the zmap_file
implementation.</p>
<h4>Zmap Implementatons</h4>
<p>The primary zmap implementation is <em>s3</em> (i.e. <em>mode=nczarr,s3</em>)
and indicates that the Amazon S3 cloud storage
-- or some related applicance -- is to be used.
Another storage format uses a file system tree of directories and
files (<em>mode=nczarr,file</em>).
A third storage format uses a zip file (<em>mode=nczarr,zip</em>).
The latter two are used mostly for
debugging and testing. However, the <em>file</em> and <em>zip</em> formats
are important because they is intended to match corresponding
storage formats used by the Python Zarr implementation. Hence
it should serve to provide interoperability between NCZarr and
the Python Zarr. This has not been tested.</p>
<p>Examples of the typical URL form for <em>file</em> and <em>zip</em> are as follows.
````
file:///xxx/yyy/testdata.file#mode=nczarr,file
file:///xxx/yyy/testdata.zip#mode=nczarr,zip
````</p>
<p>Note that the extension (e.g. ".file" in "testdata.file")
is arbitraty, so this would be equally acceptable.
````
file:///xxx/yyy/testdata.anyext#mode=nczarr,file
````</p>
<p>As with other URLS (e.g. DAP), these kind of URLS can be passed
as the path argument to <strong>ncdump</strong>, for example.</p>
<h3>NCZarr versus Pure Zarr.</h3>
<p>The NCZARR format extends the pure Zarr format by adding extra objects such as <em>.nczarr</em> and <em>.ncvar</em>.
It is possible to suppress the use of these extensions so that the netcdf library can read and write a pure zarr formatted file.
This is controlled by using <em>mode=nczarr,zarr</em> combination.
The primary effects of using pure zarr are described
in the <a href="@ref nczarr_translation">Translation Section</a>.</p>
<h3>Notes on Debugging NCZarr Access</h3>
<p>The NCZarr support has a trace facility.
Enabling this can sometimes give important information.
Tracing can be enabled by setting the environment variable NCTRACING=n,
where <em>n</em> indicates the level of tracing. A good value of <em>n</em> is 9.</p>
<h3>Zip File Support</h3>
<p>In order to use the <em>zip</em> storage format, the libzip [3]
library must be installed. Note that this is different from zlib.</p>
<h3>Amazon S3 Storage</h3>
<p>The Amazon AWS S3 storage driver currently uses the Amazon AWS S3 Software Development Kit for C++ (aws-s3-sdk-cpp).
In order to use it, the client must provide some configuration information.
Specifically, the <code>~/.aws/config</code> file should contain something like this.</p>
<p>```
[default]
output = json
aws<em>access</em>key_id=XXXX...
aws<em>secret</em>access_key=YYYY...
```</p>
<h4>Addressing Style</h4>
<p>The notion of "addressing style" may need some expansion. Amazon S3 accepts two forms for specifying the endpoint for accessing the data.</p>
<ol>
<li><p>Virtual -- the virtual addressing style places the bucket in
the host part of a URL. For example:
```
https://<bucketname>.s2.<region>.amazonaws.com/
```</p></li>
<li><p>Path -- the path addressing style places the bucket in
at the front of the path part of a URL. For example:</p></li>
</ol>
<p>```
https://s2.<region>.amazonaws.com/<bucketname>/
```</p>
<p>The NCZarr code will accept either form, although internally, it is standardized on path style.
The reason for this is that the bucket name forms the initial segment in the keys.</p>
<h3>Zarr vs NCZarr</h3>
<h4>Data Model</h4>
<p>The NCZarr storage format is almost identical to that of the the
standard Zarr version 2 format. The data model differs as
follows.</p>
<ol>
<li>Zarr supports filters -- NCZarr as yet does not</li>
<li>Zarr only supports anonymous dimensions -- NCZarr supports
only shared (named) dimensions.</li>
<li>Zarr attributes are untyped -- or perhaps more correctly
characterized as of type string.</li>
</ol>
<h4>Storage Format</h4>
<p>Consider both NCZarr and Zarr, and assume S3 notions of bucket and object.
In both systems, Groups and Variables (Array in Zarr) map to S3 objects.
Containment is modeled using the fact that the container's key is a prefix of the variable's key.
So for example, if variable <em>v1</em> is contained in top level group g1 -- <em>/g1 -- then the key for _v1</em> is <em>/g1/v</em>.
Additional information is stored in special objects whose name start with ".z".</p>
<p>In Zarr, the following special objects exist.</p>
<ol>
<li>Information about a group is kept in a special object named
<em>.zgroup</em>; so for example the object <em>/g1/.zgroup</em>.</li>
<li>Information about an array is kept as a special object named <em>.zarray</em>;
so for example the object <em>/g1/v1/.zarray</em>.</li>
<li>Group-level attributes and variable-level attributes are stored
in a special object named <em>.zattr</em>;
so for example the objects <em>/g1/.zattr</em> and _/g1/v1/.zattr.</li>
</ol>
<p>The NCZarr format uses the same group and variable (array) objects as Zarr.
It also uses the Zarr special <em>.zXXX</em> objects.</p>
<p>However, NCZarr adds some additional special objects.</p>
<ol>
<li><p><em>.nczarr</em> -- this is in the top level group -- key <em>/.nczarr</em>.
It is in effect the "superblock" for the dataset and contains
any netcdf specific dataset level information. It is also used
to verify that a given key is the root of a dataset.</p></li>
<li><p><em>.nczgroup</em> -- this is a parallel object to <em>.zgroup</em> and contains any netcdf specific group information. Specifically it contains the following.</p>
<ul><li>dims -- the name and size of shared dimensions defined in this group.</li>
<li>vars -- the name of variables defined in this group.</li>
<li>groups -- the name of sub-groups defined in this group.</li></ul>
<p>These lists allow walking the NCZarr dataset without having to use
the potentially costly S3 list operation.</p></li>
<li><p><em>.nczvar</em> -- this is a parallel object to <em>.zarray</em> and contains
netcdf specific information. Specifically it contains the following.</p>
<ul><li>dimrefs -- the names of the shared dimensions referenced by the variable.</li>
<li>storage -- indicates if the variable is chunked vs contiguous
in the netcdf sense.</li></ul></li>
<li><p><em>.nczattr</em> -- this is parallel to the .zattr objects and stores
the attribute type information.</p></li>
</ol>
<h4>Translation</h4>
<p>With some constraints, it is possible for an nczarr library to read Zarr and for a zarr library to read the nczarr format.
The latter case, zarr reading nczarr is possible if the zarr library is willing to ignore objects whose name it does not recognized; specifically anything beginning with <em>.ncz</em>.</p>
<p>The former case, nczarr reading zarr is also possible if the nczarr can simulate or infer the contents of the missing <em>.nczXXX</em> objects.
As a rule this can be done as follows.</p>
<ol>
<li><em>.nczgroup</em> -- The list of contained variables and sub-groups
can be computed using the search API to list the keys
"contained" in the key for a group. By looking for occurrences
of <em>.zgroup</em>, <em>.zattr</em>, _.zarray to infer the keys for the
contained groups, attribute sets, and arrays (variables).
Constructing the set of "shared dimensions" is carried out
by walking all the variables in the whole dataset and collecting
the set of unique integer shapes for the variables.
For each such dimension length, a top level dimension is created
named ".zdim_<len>" where len is the integer length. The name
is subject to change.</li>
<li><em>.nczvar</em> -- The dimrefs are inferred by using the shape
in <em>.zarray</em> and creating references to the simulated shared dimension.
netcdf specific information.</li>
<li><em>.nczattr</em> -- The type of each attribute is inferred by trying to parse the first attribute value string.</li>
</ol>
<h3>Compatibility</h3>
<p>In order to accomodate existing implementations, certain mode tags are provided to tell the NCZarr code to look for information used by specific implementations.</p>
<h4>XArray</h4>
<p>The Xarray
<a href="#ref_xarray">[7]</a>
Zarr implementation uses its own mechanism for
specifying shared dimensions. It uses a special
attribute named ''<em>ARRAY</em>DIMENSIONS''.
The value of this attribute is a list of dimension names (strings).
An example might be <code>["time", "lon", "lat"]</code>.
It is essentially equivalent to the
<code>.nczvar/dimrefs list</code>, but stored as a specific variable attribute.
It will be read/written if and only if the mode value "xarray" is specified.
If enabled and detected, then these dimension names are used
to define shared dimensions. Note that xarray implies pure zarr format.</p>
<h3>Examples</h3>
<p>Here are a couple of examples using the <em>ncgen</em> and <em>ncdump</em> utilities.</p>
<ol>
<li>Create an nczarr file using a local directory tree as storage.
```
ncgen -4 -lb -o "file:///home/user/dataset.file#mode=nczarr,file" dataset.cdl
```</li>
<li>Display the content of an nczarr file using a local directory tree as storage.
```
ncdump "file:///home/user/dataset.zip#mode=nczarr,zip"
```</li>
<li>Create an nczarr file using S3 as storage.
```
ncgen -4 -lb -o "s3://s3.us-west-1.amazonaws.com/datasetbucket" dataset.cdl
```</li>
<li>Create an nczarr file using S3 as storage and keeping to the pure
zarr format.
```
ncgen -4 -lb -o "s3://s3.uswest-1.amazonaws.com/datasetbucket#mode=zarr" dataset.cdl
```</li>
</ol>
<h3>References</h3>
<p><a name="ref_aws"><a href="https://docs.aws.amazon.com/s3/index.html">1]</a> [Amazon Simple Storage Service Documentation</a><br>
<a name="ref_awssdk"><a href="https://github.com/aws/aws-sdk-cpp">2]</a> [Amazon Simple Storage Service Library</a><br>
<a name="ref_libzip"><a href="https://libzip.org/">3]</a> [The LibZip Library</a><br>
<a name="ref_nczarr"><a href="https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification">4]</a> [NetCDF ZARR Data Model Specification</a><br>
<a name="ref_python"><a href="https://docs.python.org/2/library/collections.html">5]</a> [Python Documentation: 8.3. collections — High-performance container datatypes</a><br>
<a name="ref_zarrv2"><a href="https://zarr.readthedocs.io/en/stable/spec/v2.html">6]</a> [Zarr Version 2 Specification</a><br>
<a name="ref_xarray"><a href="http://xarray.pydata.org/en/latest/internals.html#zarr-encoding-specification">7]</a> [XArray Zarr Encoding Specification</a><br></p>
<h3>Appendix A. Building NCZarr Support</h3>
<p>Currently the following build cases are known to work.</p>
<table>
<tr><td><u>Operating System</u><td><u>Build System</u><td><u>NCZarr</u><td><u>S3 Support</u>
<tr><td>Linux <td> Automake <td> yes <td> yes
<tr><td>Linux <td> CMake <td> yes <td> yes
<tr><td>Cygwin <td> Automake <td> yes <td> no
<tr><td>OSX <td> Automake <td> unknown <td> unknown
<tr><td>OSX <td> CMake <td> unknown <td> unknown
<tr><td>Visual Studio <td> CMake <td> yes <td> tests fail
</table>
<p>Note: S3 support includes both compiling the S3 support code as well as running the S3 tests.</p>
<h3>Automake</h3>
<p>There are several options relevant to NCZarr support and to Amazon S3 support.
These are as follows.</p>
<ol>
<li><em>--enable-nczarr</em> -- enable the NCZarr support. If disabled, then all of the following options are disabled or irrelevant.</li>
<li><em>--enable-nczarr-s3</em> -- Enable NCZarr S3 support.</li>
<li><em>--enable-nczarr-s3-tests</em> -- the NCZarr S3 tests are currently only usable by Unidata personnel, so they are disabled by default.</li>
</ol>
<p>A note about using S3 with Automake. If S3 support is desired, and using Automake, then LDFLAGS must be properly set, namely to this.
````
LDFLAGS="$LDFLAGS -L/usr/local/lib -laws-cpp-sdk-s3"
````</p>
<p>The above assumes that these libraries were installed in '/usr/local/lib', so the above requires modification if they were installed elsewhere.</p>
<p>Note also that if S3 support is enabled, then you need to have a C++ compiler installed because part of the S3 support code is written in C++.</p>
<h3>CMake</h3>
<p>The necessary CMake flags are as follows (with defaults)</p>
<ol>
<li>-DENABLE<em>NCZARR=on -- equivalent to the Automake _--enable-nczarr</em> option.</li>
<li>-DENABLE<em>NCZARR</em>S3=off -- equivalent to the Automake <em>--enable-nczarr-s3</em> option.</li>
<li>-DENABLE<em>NCZARR</em>S3<em>TESTS=off -- equivalent to the Automake _--enable-nczarr-s3-tests</em> option.</li>
</ol>
<p>Note that unlike Automake, CMake can properly locate C++ libraries, so it should not be necessary to specify <em>-laws-cpp-sdk-s3</em> assuming that the aws s3 libraries are installed in the default location.
For CMake with Visual Studio, the default location is here:</p>
<p>````
C:/Program Files (x86)/aws-cpp-sdk-all
````</p>
<p>It is possible to install the sdk library in another location.
In this case, one must add the following flag to the cmake command.
````
cmake ... -DAWSSDK_DIR=\<awssdkdir\>
````
where "AWSSDKDIR" is the path to the sdk installation.
For example, this might be as follows.
````
cmake ... -DAWSSDK_DIR="c:\tools\aws-cpp-sdk-all"
````
This can be useful if blanks in path names cause problems
in your build environment.</p>
<h4>Testing S3 Support</h4>
<p>The relevant tests for S3 support are in <em>nczarr_test</em>.
They will be run if <em>--enable-nczarr-s3-tests</em> is on.</p>
<p>Currently, by default, testing of S3 with NCZarr is supported only for Unidata members of the NetCDF Development Group.
This is because it uses a specific bucket on a specific internal S3 appliance that is inaccessible to the general user.</p>
<p>However, an untested mechanism exists by which others may be
able to run the tests. If someone else wants to attempt these
tests, then they need to define the following environment variables:</p>
<ul>
<li>NCZARR<em>S3</em>TEST_HOST=\<host\></li>
<li>NCZARR<em>S3</em>TEST_BUCKET=\<bucket-name\></li>
</ul>
<p>This assumes a Path Style address (see above) where
* host -- the complete host part of the url
* bucket -- a bucket in which testing can occur without fear of
damaging anything.</p>
<p><em>Example:</em></p>
<p>````
NCZARR<em>S3</em>TEST_HOST=s3.us-west-1.amazonaws.com
NCZARR<em>S3</em>TEST_BUCKET=testbucket
````</p>
<p>If anyone tries to use this mechanism, it would be appreciated
it any difficulties were reported to Unidata as a Github issue.</p>
<h3>Appendix B. Building aws-sdk-cpp</h3>
<p>In order to use the S3 storage driver, it is necessary to install the Amazon <a href="https://github.com/aws/aws-sdk-cpp.git">aws-sdk-cpp library</a>.</p>
<p>As a starting point, here are the CMake options used by Unidata to build that library.
It assumes that it is being executed in a build directory, <code>build</code> say, and that <code>build/../CMakeLists.txt exists</code>.</p>
<p>```
cmake -DBUILD_ONLY=s3
```</p>
<p>The expected set of installed libraries are as follows:</p>
<ul>
<li>aws-cpp-sdk-s3</li>
<li>aws-cpp-sdk-core</li>
</ul>
<p>This library depends on libcurl, so you may to install that
before building the sdk library.</p>
<h3>Appendix C. Amazon S3 Imposed Limits</h3>
<p>The Amazon S3 cloud storage imposes some significant limits that are inherited by NCZarr (and Zarr also, for that matter).</p>
<p>Some of the relevant limits are as follows:</p>
<ol>
<li>The maximum object size is 5 Gigabytes with a total for all objects limited to 5 Terabytes.</li>
<li>S3 key names can be any UNICODE name with a maximum length of 1024 bytes. Note that the limit is defined in terms of bytes and not (Unicode) characters. This affects the depth to which groups can be nested because the key encodes the full path name of a group.</li>
</ol>
<h3>Appendix D. Alternative Mechanisms for Accessing Remote Datasets</h3>
<p>The NetCDF-C library contains an alternate mechanism for accessing data store in Amazon S3: The byte-range mechanism.
The idea is to treat the remote data as if it was a big file.
This remote "file" can be randomly accessed using the HTTP Byte-Range header.</p>
<p>In the Amazon S3 context, a copy of a dataset, a netcdf-3 or netdf-4 file, is uploaded into a single object in some bucket.
Then using the key to this object, it is possible to tell the netcdf-c library to treat the object as a remote file and to use the HTTP Byte-Range protocol to access the contents of the object.
The dataset object is referenced using a URL with the trailing fragment containing the string <code>#mode=bytes</code>.</p>
<p>An examination of the test program <em>nc</em>test/test<em>byterange.sh</em> shows simple examples using the <em>ncdump</em> program.
One such test is specified as follows:</p>
<p>````
https://s3.us-east-1.amazonaws.com/noaa-goes16/ABI-L1b-RadC/2017/059/03/OR<em>ABI-L1b-RadC-M3C13</em>G16<em>s20170590337505</em>e20170590340289_c20170590340316.nc#mode=bytes
````</p>
<p>Note that for S3 access, it is expected that the URL is in what is called "path" format where the bucket, <em>noaa-goes16</em> in this case, is part of the URL path instead of the host.</p>
<p>The <em>#mode=byterange</em> mechanism generalizes to work with most servers that support byte-range access. <br />
Specifically, Thredds servers support such access using the HttpServer access method as can be seen from this URL taken from the above test program.</p>
<p>````
https://thredds-test.unidata.ucar.edu/thredds/fileServer/irma/metar/files/METAR<em>20170910</em>0000.nc#bytes
````</p>
<h4>Byte-Range Authorization</h4>
<p>If using byte-range access, it may be necessary to tell the netcdf-c
library about the so-called secretid and accessid values.
These are usually stored in the file <code>~/.aws/config</code>
and/or <code>~/.aws/credentials</code>. In the latter file, this
might look like this.
````
[default]
aws<em>access</em>key_id=XXXXXXXXXXXXXXXXXXXX
aws<em>secret</em>access_key=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
````</p>
<h3>__Point of Contact__</h3>
<p><strong>Author</strong>: Dennis Heimbigner<br>
<strong>Email</strong>: dmh at ucar dot edu<br>
<strong>Initial Version</strong>: 4/10/2020<br>
<strong>Last Revised</strong>: 2/22/2021</p>
https://www.unidata.ucar.edu/blogs/developer/entry/enhancing-the-netcdf-c-libraryEnhancing the netCDF C++ Library and the Siphon PackageUnidata News2019-08-16T10:51:06-06:002019-08-16T10:51:07-06:00<div class="img_l" style="width: 125px;padding-bottom:0;margin-bottom:0;">
<img width="125" src="/blog_content/images/2019/20190611_asweeney_1_400.jpg" alt="Aodhan Sweeney" />
<div class="caption">
Aodhan Sweeney
</div>
<p></div></p>
<p class="byline">
by
<a href="https://www.unidata.ucar.edu/blogs/news/entry/welcome-summer-intern-aodhan-sweeney">Aodhan
Sweeney</a>
<br />2019 Unidata summer intern
</p>
<p>
This summer at Unidata I worked on expanding functionality for both the netCDF C++ library
and the Python data access tool Siphon. Previously, the netCDF C++ library was
lacking important functionality that was included in other netCDF libraries. Fortunately,
adding this functionality is a straightforward process. I created function wrappers in the
C++ library that would call previously made functions in the C library. This allows those
working in a C++ framework to continue to use the netCDF libraries without sacrificing
additional functionality.
</p>
<p><link rel="stylesheet" type="text/css" href="/css/jquery/jquery.lightbox-0.5.css" media="screen" /></p>
<script type="text/javascript" src="/js/jquery/jquery.lightbox-0.5.min.js"></script>
<script type="text/javascript">
$(document).ready(function() {
$('a.lightbox').lightBox();
});
</script>
<!-- End Lightbox stuff -->
<div class="img_l" style="width: 150px;">
<a class="lightbox" title="Aodhan Sweeney" href="/blog_content/images/2019/20190611_asweeney_1_400.jpg">
<img width="150" src="/blog_content/images/2019/20190611_asweeney_1_400.jpg" alt="Aodhan Sweeney" />
</a>
<div class="caption">
Aodhan Sweeney
</div>
<p></div></p>
<p class="byline">
by
<a href="https://www.unidata.ucar.edu/blogs/news/entry/welcome-summer-intern-aodhan-sweeney">Aodhan
Sweeney</a>
<br />2019 Unidata summer intern
</p>
<p>
This summer at Unidata I worked on expanding functionality for both the netCDF C++ library
and the Python data access tool Siphon. Previously, the <a href="https://www.unidata.ucar.edu/software/netcdf/">netCDF C++ library</a> was
lacking important functionality that was included in other netCDF libraries. Fortunately,
adding this functionality is a straightforward process. I created function wrappers in the
C++ library that would call previously made functions in the C library. This allows those
working in a C++ framework to continue to use the netCDF libraries without sacrificing
additional functionality.
</p>
<p style="font-style:italic;">
Editor's Note: Aodhan's additions to the netCDF C++ library will be included in the next release, expected in late summer 2019.
</p>
<div class="img_r" style="width: 200px;">
<a class="lightbox" title="Jupyter notebook plotting storm tracks. The data were retrieved from the National Hurricane Center using Siphon."
href="/blog_content/images/2019/20190809_aodhan_seminar_04.png">
<img width="200" src="/blog_content/images/2019/20190809_aodhan_seminar_04_s.png"
alt="Hurricane tracks" /> </a>
<div class="caption"> Storm tracks visualized in a Jupyter notebook<br />(click to enlarge) </div>
</div>
<p><a class="lightbox" title="Event information retrieved from the Storm Prediction Center using Siphon."
href="/blog_content/images/2019/20190809_aodhan_seminar_03.png"></a></p>
<p>
<a href="https://www.unidata.ucar.edu/software/siphon/">Siphon</a> is a data access module
written in Python. Originally, it was developed for easy remote access to data from THREDDS
Data Servers. In recent years, an offshoot of Siphon that focuses on remote access to data
servers not associated with a TDS has been developed. This summer I worked with on expanding
the Siphon access to include data from the National Hurricane Center (NHC) and the Storm
Prediction Center (SPC). With easy-to-learn commands in a Python environment, we are
empowering our users to perform their analysis of the data stored in the NHC and SPC. To
facilitate interaction with these servers, I also developed Jupyter notebook-based Graphical
User Interfaces (GUIs) to plot and visualize the data stored in the NHC and SPC.
</p>
<p style="font-style:italic;">
Editor's Note: Aodhan's additions to Siphon will be included in the next official release,
expected in the fall of 2019. The notebooks will be available in the
<a href="https://unidata.github.io/python-gallery/index.html">Unidata Python Gallery</a>.
</p>
<div class="img_l" style="width: 200px;">
<a class="lightbox" title="Surface temperature anomaly data visualized using Javascript in a web browser."
href="/blog_content/images/2019/20190809_aodhan_seminar_02.png">
<img width="200" src="/blog_content/images/2019/20190809_aodhan_seminar_02.png"
alt="Surface temperature anomalies" /> </a>
<div class="caption">Temperature anomalies visualized in a browser</div>
</div>
<p><a class="lightbox" title="Geopotential height manifolds visualised using Javascript in a web browser."
href="/blog_content/images/2019/20190809_aodhan_seminar_01.png"></a></p>
<p>
Because of my awesome mentors and the wealth of information here at the Unidata Program
Center and in the wider community, I was also encouraged to pursue projects that I was
curious about. I ended up creating and testing a few 3d visualization tools in Javascript
that can be run out of a web browser. One of these, displaying average temperature anomalies
over land between the years of 1910 and 2019, was accepted by the
<a href="https://experiments.withgoogle.com/">Experiments with Google</a> program. You can see
the visualization and the code that creates it
<a href="https://experiments.withgoogle.com/a-century-of-surface-temperature-anomali">here</a>.
</p>
https://www.unidata.ucar.edu/blogs/developer/entry/netcdf-zarr-apiNetCDF Zarr APIDennis Heimbigner 2019-07-16T15:04:02-06:002019-07-16T15:04:02-06:00<p>This document defines the variant of the netcdf-c library API
that can be used to read/write NCZarr dataset. Additionally,
any special new flags or other parameter values are defined.
It is expected that this document should be consistent with the
NetCDF ZARR Data Model Specification [1].</p>
<ol>
<li><a href="#nczapi_intro">Introduction</a></li>
<li><a href="#nczapi_netcdf_zarr_api">The netCDF-Zarr API</a>
<ol><li><a href="#nczapi_file_functions">NetCDF File Functions</a></li>
<li><a href="#nczapi_dimensions">Dimensions</a></li>
<li><a href="#nczapi_types">Types</a></li>
<li><a href="#nczapi_variables">Variables</a></li>
<li><a href="#nczapi_representation_functions">Variable Representation Functions</a></li>
<li><a href="#nczapi_variable_io">Variable IO</a></li>
<li><a href="#nczapi_attributes">Attributes</a></li>
<li><a href="#nczapi_groups">Groups</a></li>
<li><a href="#nczapi_error_handling">NetCDF Error Handling</a></li>
<li><a href="#nczapi_misc">Miscellaneous Functions</a></li>
<li><a href="#nczapi_unimplemented">Unimplemented Functions</a></li>
<li><a href="#nczapi_parallelism">Parallelism Functions</a></li>
<li><a href="#nczapi_path_urls">Path URLS</a></li></ol></li>
</ol>
<h1>Introduction <a name="nczapi_intro"></a></h1>
<p>This document is a companion document to the
<em>NetCDF ZARR Data Model Specification</em>[1].
That document provides a semi-formal and abstract representation of
the NCZarr data model independent of any implementation.</p>
<p>This document describes a variant of the API provided by the netcdf-c
library as shown in its primary definition file <em>netcdf.h</em>.
Familiarity with the current netcdf-c library API is assumed.</p>
<h1>The netCDF-Zarr API <a name="nczapi_netcdf_zarr_api"></a></h1>
<p>This API takes the netcdf-c library API and divides it into sets
of related functions. Any semantic differences are described.
API functions that are disallowed are also described.
Functions are organized according to the netCDF data model.</p>
<h2>NetCDF File Functions <a name="nczapi_file_functions"></a></h2>
<pre><code>EXTERNL int
nc_create(const char* path, int cmode, int* ncidp);
EXTERNL int
nc__create(const char* path, int cmode, size_t initialsz, size_t* chunksizehintp, int* ncidp);
EXTERNL int
nc_open(const char* path, int mode, int* ncidp);
EXTERNL int
nc__open(const char* path, int mode, size_t* chunksizehintp, int* ncidp);
EXTERNL int
nc_inq_ncid(int ncid, const char* name, int* grp_ncid);
EXTERNL int
nc_redef(int ncid);
EXTERNL int
nc_enddef(int ncid);
EXTERNL int
nc__enddef(int ncid, size_t h_minfree, size_t v_align, size_t v_minfree, size_t r_align);
EXTERNL int
nc_sync(int ncid);
EXTERNL int
nc_abort(int ncid);
EXTERNL int
nc_close(int ncid);
EXTERNL int
nc_inq_path(int ncid, size_t* pathlen, char* path);
</code></pre>
<p>With exceptions, all of these functions are implemented with essentially standard semantics.</p>
<p>Notes:</p>
<ol>
<li>The double underscore functions (e.g. <em>nc__create</em>) are implemented in terms of the single underscore versions with the extra parameters ignored.</li>
<li><em>nc_sync</em>, <em>nc_redef</em>, and <em>nc_enddef</em> may be implemented as no-op
functions depending on the underlying implementation.</li>
<li>The syntax and interpretation of the <em>path</em> argument are implementation dependent (See <a href="#nczapi_path_urls">below</a>.</li>
</ol>
<h2>Dimensions <a name="nczapi_dimensions"></a></h2>
<pre><code>EXTERNL int
nc_def_dim(int ncid, const char* name, size_t len, int* idp);
EXTERNL int
nc_inq_dimid(int ncid, const char* name, int* idp);
EXTERNL int
nc_inq_dim(int ncid, int dimid, char* name, size_t* lenp);
EXTERNL int
nc_inq_dimname(int ncid, int dimid, char* name);
EXTERNL int
nc_inq_dimlen(int ncid, int dimid, size_t* lenp);
EXTERNL int
nc_rename_dim(int ncid, int dimid, const char* name);
</code></pre>
<p>All of these functions are implemented with essentially standard semantics.</p>
<p>Notes:</p>
<ol>
<li>These APIs all assume named dimensions. The management of named dimensions is still an open
issue for Zarr. For writing, anonymous dimensions are not allowed, but they are for reading.
When reading an anonymous dimension, a specially named dimension will be created to represent
the anonymous dimension.</li>
<li>Unlimited dimensions are currently unimplemented.</li>
</ol>
<h2>Types <a name="nczapi_types"></a></h2>
<pre><code>EXTERNL int
nc_inq_type(int ncid, nc_type xtype, char *name, size_t *size);
/* Get the id of a type from the name. */
EXTERNL int
nc_inq_typeid(int ncid, const char *name, nc_type *typeidp);
</code></pre>
<p>Notes:</p>
<ol>
<li>In the current implemenation, only a selected set of atomic types
are implemented, namely: <em>NC_CHAR, NC_BYTE, NC_SHORT, NC_INT, NC_FLOAT, NC_DOUBLE, NC_UBYTE, NC_USHORT, NC_UINT, NC_INT64, and NC_UINT64</em>.</li>
</ol>
<h2>Variables <a name="nczapi_variables"></a></h2>
<pre><code>EXTERNL int
nc_def_var(int ncid, const char* name, nc_type xtype, int ndims, const int* dimidsp, int* varidp);
EXTERNL int
nc_inq_var(int ncid, int varid, char* name, nc_type* xtypep, int* ndimsp, int* dimidsp, int* nattsp);
EXTERNL int
nc_inq_varid(int ncid, const char* name, int* varidp);
EXTERNL int
nc_inq_varname(int ncid, int varid, char* name);
EXTERNL int
nc_inq_vartype(int ncid, int varid, nc_type* xtypep);
EXTERNL int
nc_inq_varndims(int ncid, int varid, int* ndimsp);
EXTERNL int
nc_inq_vardimid(int ncid, int varid, int* dimidsp);
EXTERNL int
nc_inq_varnatts(int ncid, int varid, int* nattsp);
EXTERNL int
nc_rename_var(int ncid, int varid, const char* name);
</code></pre>
<p>The basic variable definition/inquiry functions have the standard
netCDF-4 semantics.</p>
<h2>Variable Representation Functions <a name="nczapi_representation_functions"></a></h2>
<pre><code>EXTERNL int
nc_def_var_filter(int ncid, int varid, unsigned int id, size_t nparams, const unsigned int* parms);
EXTERNL int
nc_inq_var_filter(int ncid, int varid, unsigned int* idp, size_t* nparams, unsigned int* params);
EXTERNL int
nc_def_var_deflate(int ncid, int varid, int shuffle, int deflate, int deflate_level);
EXTERNL int
nc_inq_var_deflate(int ncid, int varid, int* shufflep, int* deflatep, int* deflate_levelp);
EXTERNL int
nc_inq_var_szip(int ncid, int varid, int* options_maskp, int* pixels_per_blockp);
EXTERNL int
nc_def_var_fletcher32(int ncid, int varid, int fletcher32);
EXTERNL int
nc_inq_var_fletcher32(int ncid, int varid, int* fletcher32p);
EXTERNL int
nc_def_var_chunking(int ncid, int varid, int storage, const size_t* chunksizesp);
EXTERNL int
nc_inq_var_chunking(int ncid, int varid, int* storagep, size_t* chunksizesp);
EXTERNL int
nc_def_var_fill(int ncid, int varid, int no_fill, const void* fill_value);
EXTERNL int
nc_inq_var_fill(int ncid, int varid, int* no_fill, void* fill_valuep);
</code></pre>
<p>These function specify information about the layout and storage of variables.
The deflate and szip functions are all implemented as calls to the def/inq filter
functions. It appears that the semantics of the chunking functions
match that of Zarr so that they can be directly implemented.
Handling of the fill functions is still T.B.D.</p>
<h2>Variable IO <a name="nczapi_variable_io"></a></h2>
<pre><code>EXTERNL int
nc_put_var(int ncid, int varid, const void* op);
EXTERNL int
nc_get_var(int ncid, int varid, void* ip);
EXTERNL int
nc_put_var1(int ncid, int varid, const size_t* indexp, const void* op);
EXTERNL int
nc_get_var1(int ncid, int varid, const size_t* indexp, void* ip);
EXTERNL int
nc_put_vara(int ncid, int varid, const size_t* startp, const size_t* countp, const void* op);
EXTERNL int
nc_get_vara(int ncid, int varid, const size_t* startp, const size_t* countp, void* ip);
EXTERNL int
nc_put_vars(int ncid, int varid, const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, const void* op);
EXTERNL int
nc_get_vars(int ncid, int varid, const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, void* ip);
EXTERNL int
nc_put_varm(int ncid, int varid, const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, const ptrdiff_t* imapp, const void* op);
EXTERNL int
nc_put_var_T(int ncid, int varid, const T* op);
EXTERNL int
nc_get_var_T(int ncid, int varid, T* ip);
EXTERNL int
nc_put_var1_T(int ncid, int varid, const size_t* indexp, const T* op);
EXTERNL int
nc_get_var1_T(int ncid, int varid, const size_t* indexp, T* ip);
EXTERNL int
nc_put_vara_T(int ncid, int varid, const size_t* startp, const size_t* countp, const T* op);
EXTERNL int
nc_get_vara_short(int ncid, int varid, const size_t* startp, const size_t* countp, T* ip);
EXTERNL int
nc_put_vars_T(int ncid, int varid, const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, const T* op);
EXTERNL int
nc_get_vars_T(int ncid, int varid, const size_t* startp, const size_t* countp, ptrdiff_t* stridep, T* ip);
EXTERNL int
nc_put_varm_T(int ncid, int varid, const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, const ptrdiff_t* imapp, const T* op);
EXTERNL int
nc_get_varm_T(int ncid, int varid, const size_t* startp, const size_t* countp, const ptrdiff_t* stridep, const ptrdiff_t* imapp, T* ip);
</code></pre>
<p>The primary variable I/O functions are defined by the first eight functions in this list,
as is the case in the existing netcdf library code.
The put/get varm functions are all implemented in terms of calls to put/get vars functions,
again as in the existing code.</p>
<p>The get/put var T functions primarily exist to support library implemented type conversion.
If the actual variable type is different than the function type (the T), then automatic
conversion is performed from the actual type to the desired type. With some judicious refactoring,
it should be possible to reuse the existing conversion code in the netcdf-c library.</p>
<h2>Attributes <a name="nczapi_attributes"></a></h2>
<pre><code>EXTERNL int
nc_put_att(int ncid, int varid, const char* name, nc_type xtype, size_t len, const void* op);
EXTERNL int
nc_get_att(int ncid, int varid, const char* name, void* ip);
EXTERNL int
nc_inq_att(int ncid, int varid, const char* name, nc_type* xtypep, size_t* lenp);
EXTERNL int
nc_inq_attid(int ncid, int varid, const char* name, int* idp);
EXTERNL int
nc_inq_atttype(int ncid, int varid, const char* name, nc_type* xtypep);
EXTERNL int
nc_inq_attlen(int ncid, int varid, const char* name, size_t* lenp);
EXTERNL int
nc_inq_attname(int ncid, int varid, int attnum, char* name);
EXTERNL int
nc_copy_att(int ncid_in, int varid_in, const char* name, int ncid_out, int varid_out);
EXTERNL int
nc_rename_att(int ncid, int varid, const char* name, const char* newname);
EXTERNL int
nc_del_att(int ncid, int varid, const char* name);
EXTERNL int
nc_put_att_T(int ncid, int varid, const char* name, size_t len, const T* op);
EXTERNL int
nc_get_att_T(int ncid, int varid, const char* name, T* op);
</code></pre>
<p>The primary attribute put/get functions are defined by the first two functions in this list.
The get/put T functions are implemented in terms of these two more generic functions.</p>
<p>The get/put T functions primarily exist to support library implemented type conversion.
If the actual attribute type is different than the function type (the T), then automatic
conversion is performed from the actual type to the desired type. With some judicious refactoring,
it should be possible to reuse the existing conversion code in the netcdf-c library.</p>
<p>The put T functions specify the actual type of the attribute, so there is no conversion implied. </p>
<h2>Groups <a name="nczapi_groups"></a></h2>
<pre><code>EXTERNL int
nc_def_grp(int parent_ncid, const char* name, int* new_ncid);
EXTERNL int
nc_rename_grp(int grpid, const char* name);
</code></pre>
<p>The semantics of the group functions appear to be completely consistent with the
existing Zarr semantics. It is assumed that the graph of groups is a tree,
which implies no cycles and no shared subgroups.</p>
<h2>NetCDF Error Handling <a name="nczapi_error_handling"></a></h2>
<pre><code>EXTERNL const char*
nc_strerror(int ncerr);
EXTERNL int
nc_set_log_level(int new_level);
</code></pre>
<p>Error reporting and event logging is not defined for Zarr, so these are the
same as for the netcdf-c library.</p>
<h2>Miscellaneous Functions <a name="nczapi_misc"></a></h2>
<pre><code>EXTERNL const char*
nc_inq_libvers(void);
EXTERNL int
nc_initialize(void);
EXTERNL int
nc_finalize(void);
EXTERNL int
nc_set_fill(int ncid, int fillmode, int* old_modep);
EXTERNL int
nc_set_default_format(int format, int* old_formatp);
EXTERNL int
nc_inq_format(int ncid, int* formatp);
EXTERNL int
nc_inq_format_extended(int ncid, int* formatp, int* modep);
EXTERNL int
nc_set_chunk_cache(size_t size, size_t nelems, float preemption);
EXTERNL int
nc_get_chunk_cache(size_t* sizep, size_t* nelemsp, float* preemptionp);
EXTERNL int
nc_set_var_chunk_cache(int ncid, int varid, size_t size, size_t nelems, float preemption);
EXTERNL int
nc_get_var_chunk_cache(int ncid, int varid, size_t* sizep, size_t* nelemsp, float* preemptionp);
EXTERNL int
nc_inq(int ncid, int* ndimsp, int* nvarsp, int* nattsp, int* unlimdimidp);
EXTERNL int
nc_inq_ndims(int ncid, int* ndimsp);
EXTERNL int
nc_inq_nvars(int ncid, int* nvarsp);
EXTERNL int
nc_inq_natts(int ncid, int* nattsp);
EXTERNL int
nc_delete(const char* path);
</code></pre>
<p>Notes:</p>
<ol>
<li>It is unclear if the format related functions are sufficient for specifying cloud
format information. There may be significant implementation-dependent information
that these functions cannot provide as currently defined.</li>
<li>Use of the chunk caching functions may be completely implementation dependent.
The idea of using a chunk cache seems to be an obvious requirement for good
performance.</li>
<li>All the inq functions should be able to have standard netcdf semantics.</li>
<li>The <em>nc_delete</em> function has always been something of an outlier, but it is useful
to have a way to completely remove a dataset in a way that is implementation dependent.</li>
</ol>
<h2>Unimplemented Functions <a name="nczapi_unimplemented"></a></h2>
<p>Basically, any function not specified above will be unimplemented. The current list is as follows.</p>
<pre><code>EXTERNL int
nc_inq_unlimdim(int ncid, int* unlimdimidp);
EXTERNL int
nc_inq_unlimdims(int ncid, int* nunlimdimsp, int* unlimdimidsp);
EXTERNL int
nc_show_metadata(int ncid);
EXTERNL int
nc_copy_var(int ncid_in, int varid, int ncid_out);
EXTERNL int
nc_def_opaque(int ncid, size_t size, const char* name, nc_type* xtypep);
EXTERNL int
nc_inq_opaque(int ncid, nc_type xtype, char* name, size_t* sizep);
EXTERNL int
nc_def_compound(int ncid, size_t size, const char* name, nc_type* typeidp);
EXTERNL int
nc_insert_compound(int ncid, nc_type xtype, const char* name, size_t offset, nc_type field_typeid);
EXTERNL int
nc_insert_array_compound(int ncid, nc_type xtype, const char* name, size_t offset, nc_type field_typeid, int ndims, const int* dim_sizes);
EXTERNL int
nc_inq_compound(int ncid, nc_type xtype, char* name, size_t* sizep, size_t* nfieldsp);
EXTERNL int
nc_inq_compound_name(int ncid, nc_type xtype, char* name);
EXTERNL int
nc_inq_compound_size(int ncid, nc_type xtype, size_t* sizep);
EXTERNL int
nc_inq_compound_nfields(int ncid, nc_type xtype, size_t* nfieldsp);
EXTERNL int
nc_inq_compound_field(int ncid, nc_type xtype, int fieldid, char* name, size_t* offsetp, nc_type* field_typeidp, int* ndimsp, int* dim_sizesp);
EXTERNL int
nc_inq_compound_fieldname(int ncid, nc_type xtype, int fieldid, char* name);
EXTERNL int
nc_inq_compound_fieldindex(int ncid, nc_type xtype, const char* name, int* fieldidp);
EXTERNL int
nc_inq_compound_fieldoffset(int ncid, nc_type xtype, int fieldid, size_t* offsetp);
EXTERNL int
nc_inq_compound_fieldtype(int ncid, nc_type xtype, int fieldid, nc_type* field_typeidp);
EXTERNL int
nc_inq_compound_fieldndims(int ncid, nc_type xtype, int fieldid, int* ndimsp);
EXTERNL int
nc_inq_compound_fielddim_sizes(int ncid, nc_type xtype, int fieldid, int* dim_sizes);
EXTERNL int
nc_def_enum(int ncid, nc_type base_typeid, const char* name, nc_type* typeidp);
EXTERNL int
nc_insert_enum(int ncid, nc_type xtype, const char* name, const void* value);
EXTERNL int
nc_inq_enum(int ncid, nc_type xtype, char* name, nc_type* base_nc_typep, size_t* base_sizep, size_t* num_membersp);
EXTERNL int
nc_inq_enum_member(int ncid, nc_type xtype, int idx, char* name, void* value);
EXTERNL int
nc_inq_enum_ident(int ncid, nc_type xtype, long long value, char* identifier);
EXTERNL int
nc_def_vlen(int ncid, const char* name, nc_type base_typeid, nc_type* xtypep);
EXTERNL int
nc_inq_vlen(int ncid, nc_type xtype, char* name, size_t* datum_sizep, nc_type* base_nc_typep);
EXTERNL int
nc_free_vlen(nc_vlen_t* vl);
EXTERNL int
nc_free_vlens(size_t len, nc_vlen_t vlens[]);
EXTERNL int
nc_put_vlen_element(int ncid, int typeid1, void* vlen_element, size_t len, const void* data);
EXTERNL int
nc_get_vlen_element(int ncid, int typeid1, const void* vlen_element, size_t* len, void* data);
EXTERNL int
nc_def_var_endian(int ncid, int varid, int endian);
EXTERNL int
nc_inq_var_endian(int ncid, int varid, int* endianp);
</code></pre>
<p>These functions are currently "unimplemented" in the sense that they will return the error code <em>NC_ENOTBUILT</em>.</p>
<h2>Parallelism Functions <a name="nczapi_parallelism"></a></h2>
<pre><code>EXTERNL int
nc__create_mp(const char* path, int cmode, size_t initialsz, int basepe, size_t* chunksizehintp, int* ncidp);
EXTERNL int
nc__open_mp(const char* path, int mode, int basepe, size_t* chunksizehintp, int* ncidp);
EXTERNL int
nc_delete_mp(const char* path, int basepe);
EXTERNL int
nc_set_base_pe(int ncid, int pe);
EXTERNL int
nc_inq_base_pe(int ncid, int* pe);
</code></pre>
<p>The netcdf library parallelism-related functions are all heavily MPI oriented.
It is unclear what is to be done with these functions.</p>
<h2>Path URLS <a name="nczapi_path_urls"></a></h2>
<p>It is assumed that the format of a Zarr file will look like a
netcdf Enhanced file with some variations. However, the path for
specifying a cloud-based dataset will be more complicated than a
simple file path. As with DAP2 and DAP4, it will be some kind of
URL annotated with extra information relevant to its
interpretation.</p>
<h1>References</h1>
<p>[1] NetCDF ZARR Data Model Specification (https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification)<br>
[2] Zarr Specification Version 2 (https://zarr.readthedocs.io/en/stable/spec/v2.html)<br></p>
<h1>Copyright</h1>
<p>Copyright 2018, UCAR/Unidata<br>
See netcdf/COPYRIGHT file for copying and redistribution conditions.</p>
<h1>Point of Contact</h1>
<p><strong>Author</strong>: Dennis Heimbigner<br>
<strong>Email</strong>: dmh at ucar dot edu<br>
<strong>Initial Version</strong>: 12/1/2018<br>
<strong>Last Revised</strong>: 7/16/2019</p>
https://www.unidata.ucar.edu/blogs/developer/entry/netcdf-zarr-data-model-specificationNetCDF ZARR Data Model SpecificationDennis Heimbigner 2019-07-02T16:01:22-06:002019-07-02T16:01:22-06:00<p>This document defines the initial netcdf Zarr (NCZarr)
data model to be implemented. As the Zarr version 3 specification progresses, this model will be extended to include new data types.</p>
<h1>Table of Contents</h1>
<ol>
<li><a href="#nczarr_intro">Introduction</a></li>
<li><a href="#nczarr_notation">Notation</a></li>
<li><a href="#nczarr_datamodel">Data Model</a>
<ol><li><a href="#nczarr_dataset">Dataset</a></li>
<li><a href="#nczarr_group">Group</a></li>
<li><a href="#nczarr_attribute">Attribute</a></li>
<li><a href="#nczarr_dimension">Dimension</a></li>
<li><a href="#nczarr_variable">Variable</a></li>
<li><a href="#nczarr_dimref">Dimension Reference</a></li>
<li><a href="#nczarr_types">Types</a></li></ol></li>
<li><a href="#nczarr_excluded">Excluded Elements</a></li>
<li><a href="#nczarr_lexemes">Appendix A. Supporting Lexical Tokens</a>
<ol><li><a href="#nczarr_fqn">Fully Qualified Names</a></li></ol></li>
<li><a href="#nczarr_supplement">Appendix B. Supplementary Material</a>
<ol><li><a href="#nczarr_csensitive_spec">Specifying Context-Sensitive Elements</a></li></ol></li>
<li><a href="#nczarr_complete_spec">Appendix C. Complete Version of the Abstract Representation Specification</a></li>
</ol>
<h1>Introduction <a name="nczarr_intro"></a></h1>
<p>This document describes the to-be-implemented NCZarr data model
by reference to the netcdf-4 (aka netcdf enhanced) data model.
Elements of the enhanced model included in this model will be listed.
Elements of the enhanced model not included are listed in a later section.</p>
<h1>Notation <a name="nczarr_notation"></a></h1>
<p>In order to represent the abstract structure of the NCZarr data
model, we must choose some suitable notation. This notation
must meet the requirement that it is typed, meaning that the
nodes of the tree have a type and the structure of the node must
conform to that type.</p>
<p>Ideally, we would use Json as our notation since that is the
target representation used by the Zarr specification.
Unfortunately, Json is effectively typeless so we do not consider it
powerful enough to properly represent the data model. If some
way exists to do this, then this may be viable.</p>
<p>We choose Antlr4 [1] as our formalism because it is designed for
such uses as this one, and it is quite concise. In the following
specification, upper-case names (such as NAME or ZARRVERSION)
are terminals in the parsing sense and are specified in
Appendix A.</p>
<h1>Data Model <a name="nczarr_datamodel"></a></h1>
<h3>Dataset <a name="nczarr_dataset"></a></h3>
<pre><code>dataset : NAME ZARRVERSION (dimension | variable | attribute | group)*
</code></pre>
<p>The unit of data storage in NCZarr, as with netcdf-4, is the
<em>Dataset</em>. A Dataset is also a Group (see below), so it can contain
variables, attributes, and (sub-)groups. These semantics are consistent
with the netcdf-4 Dataset semantics.</p>
<h3>Group <a name="nczarr_group"></a></h3>
<pre><code>group: NAME (dimension | variable | attribute | group)*
</code></pre>
<p>A Group contains a collection of dimension declarations, variable
declarations, attributes, and (sub-)groups. Note that user-defined
type declarations are not (yet) included.</p>
<h3>Attribute <a name="nczarr_attribute"></a></h3>
<pre><code>attribute : NAME value_type (CONSTANT)+
</code></pre>
<p>An Attribute contains a (ordered) set of values, where the values
are constants consistent with the specified type of the attribute.
An attribute must have at least one value.</p>
<h3>Dimension <a name="nczarr_dimension"></a></h3>
<pre><code>dimension: NAME SIZE
</code></pre>
<p>A Dimension declaration defines a named dimension where the
dimension has a specific specified size.</p>
<h3>Variable <a name="nczarr_variable"></a></h3>
<pre><code>variable: NAME type (dimref)* (attribute)*
</code></pre>
<p>A Variable declaration defines a named variable of a specified
type. It also can reference a set of dimensions defining the
rank and size of the variable. If no dimensions are referenced,
then the variable is a scalar.</p>
<p>Additionally, any number of attributes can be associated with the variable
to define properties about the variable.</p>
<h3>Dimension Reference <a name="nczarr_dimref"></a></h3>
<pre><code>dimref: SIZE | FQN
</code></pre>
<p>A Dimension reference specifies one the dimensions of a variable
by either defining an anonymous dimension where the size is specified
directly, or by providing the fully qualified name refering to some
dimension defined in some Group via a <code><Dimension></code> declaration.</p>
<h3>Types <a name="nczarr_types"></a></h3>
<pre><code>type: atomic_type ;
atomic_type: fixed_atomic_type | char_type ;
fixed_atomic_type:
BYTE_T // A signed 8 bit integer
| UBYTE_T // An unsigned 8 bit integer
| SHORT_T // A signed 16 bit integer
| USHORT_T // An unsigned 16 bit integer
| INT_T // A signed 32 bit integer
| UINT_T // An unsigned 32 bit integer
| INT64_T // A signed 64 bit integer
| UINT64_T // An unsigned 64 bit integer
;
char_type: CHAR_T ;
</code></pre>
<p>For now, NCZarr only supports the signed and unsigned integer types
of sizes 8, 16, 32, and 64 bits. It also supports an approximation
to the character type.
Addition of more complex types such as strings must await the Zarr
version 3 specification.</p>
<p>These atomic types are those can be used when specifying the
type of a variable or an attribute, the names are taken from the
corresponding netCDF-4 specification.</p>
<h3>Character Type</h3>
<p>The character type is almost universally (except for Java)
associated with an 8-bit unsigned value.
But this has always caused problems because historically,
multiple encodings have been associated with it: ASCII, ISO-LATIN-8859,
UTF-8, for example.</p>
<p>Each encoding may support only a subset of the 256 possible values
that can be represented by an 8-bit unsigned value. In the case of UTF-8,
which supports multi-byte characters, a single 8-bit value may not even
be able to represent a legal UTF-8 character.</p>
<p>To deal with this, we essentially punt by declaring the character type
to be the same as UBYTE_T (an 8-bit unsigned integer). Interpretation
of the encoding of a character is then outside the scope of this document.</p>
<h1>Excluded Elements <a name="nczarr_excluded"></a></h1>
<p>The initial data model for NCZarr deliberately excludes
a number of netcdf-4 concepts so that a working implementation
can be achieved as rapidly as possible. Additionally,
implementation of some netcdf-4 features need to be coordinated
with the new version 3 Zarr specification.</p>
<h2>Strings</h2>
<p>The biggest omission is the netcdf-4 String type. The reason is
that it is a varying length type and proper representation
in Zarr is still incomplete. It is expected that this will
be the first new type to be added since it is so useful. For now,
the netcdf-3 approach of using arrays of characters will need to be
used.</p>
<h2>User-Defined Types.</h2>
<p>The netcdf-4 user-defined type constructors are enumeration,
compound, opaque, and vlen. Of these, the most problematic is vlen
because of its varying length. Without it, the others would all be
fixed size and could be implemented. In fact the v2 Zarr specification
does provide for compound types, but we choose to wait for v3
before implementing it.</p>
<h2>Unlimited Dimension Size</h2>
<p>The netcdf-4 notion of unlimited allows for the definition of
a dimension whose size is known at any given point in time, but
whose size can vary over time. It is still the case that
all references to it are required to have the same size and this
can cause some difficulties at the storage level where it can introduce
undefined values into existing variables.</p>
<h1>Appendix A. Supporting Lexical Tokens <a name="nczarr_lexemes"></a></h1>
<p>In order to completely interpret the above data model,
a number of supporting lexical definitions are required
and are described here.</p>
<pre><code>NAME: IDCHAR+
FQN: ([/])|([/](IDCHAR)+)+
SIZE: DIGITS // Non-negative integer
ZARRVERSION: DIGITS '.' DIGITS '.' DIGITS
// Type Lexemes
BYTE_T: 'byte'
UBYTE_T: 'ubyte'
SHORT_T: 'short'
USHORT_T: 'ushort'
INT_T: 'int'
UINT_T: 'uint'
INT64_T: 'int64'
UINT64_T: 'uint64'
CHAR_T: 'char
// Exact form is as usual, but will leave out for now
CONSTANT: INTEGER | UNSIGNED | FLOAT | CHAR;
fragment DIGITS: ['0'-'9']+
fragment UTF8: // Assume base character set is UTF8
fragment ASCII: [0-9a-zA-Z !#$%()*+:;<=>?@\[\]\\^_`|{}~] // Printable ASCII
fragment IDCHAR: (IDASCII|UTF8)
fragment IDASCII: [0-9a-zA-Z!#$%()*+:;<=>?@\[\]^_`|{}~] | '\\\\' | '\\/' | '\\ '
</code></pre>
<p>A NAME consists of a sequence of any legal non-control UTF-8 characters. A control character is any UTF-8 character in the inclusive range 0x00 — 0x1F.</p>
<h2>Fully Qualified Names <a name="nczarr_fqn"></a></h2>
<p>Every dimension and variable in a NCZarr Dataset has a Fully Qualified Name
(FQN), which provides a way to unambiguously reference it
in a dataset. Currently, the only case where this
is used is for referencing named dimensions from within
variable declarations.</p>
<p>These FQNs follow the common conventions of names for lexically
scoped identifiers. In NCZarr scoping is provided by Groups
(and the group subtype <em>dataset</em>).
Just as with hierarchical file
systems or variables in many programming languages, a simple
grammar formally defines how the names are built using the names
of the FQN's components (see lexical grammar above).</p>
<p>The FQN for a "top-level" variable or dimension is defined purely by
the sequence of enclosing groups plus the variable's simple
name.</p>
<p>Notes:</p>
<ol>
<li>Every dataset has a single outermost <em>dataset</em> node.
which semantically, acts like the root group.
Whatever name that dataset has is ignored for the purposes of forming the FQN and instead is treated as if it has the empty name ("").</li>
<li>There is no limit to the nesting of groups.</li>
</ol>
<p>The character "/" has special meaning in the context of a fully qualified name. This means that if a name is added to the FQN and that name contains this character, then that characters must be specially escaped so that they will not be misinterpreted. The escape character itself must also be escaped, as must a blank.</p>
<p>The defined escapes are as follows.</p>
<table border=1 width="25%">
<tr><th>Character<th>Escaped Form
<tr><th>/<th>\/
<tr><th>\<th>\\
<tr><th>blank <th>\blank
</table>
<h1>Appendix B. Supplementary Material <a name="nczarr_supplement"></a></h1>
<h2>Specifying Context-Sensitive Elements <a name="nczarr_csensitive_spec"></a></h2>
<p>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in <nowiki>RFC 2119</nowiki>. <sup><nowiki>[</nowiki>[[#Ref-7|7]]<nowiki>]</nowiki></sup></p>
<h1>Appendix C. Complete Version of the Abstract Representation Specification <a name="nczarr_complete_spec"></a></h1>
<p>This is the complete Antlr specification in a form that can
be processed by Antlr.</p>
<pre><code>grammar z ;
dataset : NAME ZARRVERSION (dimension | variable | attribute | group)* ;
group: NAME (dimension | variable | attribute | group)* ;
attribute : NAME value_type (CONSTANT)+ ;
dimension: NAME SIZE ;
variable: NAME type (dimref)* (attribute)* ;
dimref: SIZE | FQN ;
type: atomic_type ;
atomic_type: fixed_atomic_type | char_type ;
fixed_atomic_type:
BYTE_T // A signed 8 bit integer
| UBYTE_T // An unsigned 8 bit integer
| SHORT_T // A signed 16 bit integer
| USHORT_T // An unsigned 16 bit integer
| INT_T // A signed 32 bit integer
| UINT_T // An unsigned 32 bit integer
| INT64_T // A signed 64 bit integer
| UINT64_T // An unsigned 64 bit integer
;
char_type: CHAR_T ;
// Lexemes
NAME: IDCHAR+ ;
FQN: ([/])|([/](IDCHAR)+)+ ;
SIZE: DIGITS ; // Non-negative integer ;
ZARRVERSION: DIGITS '.' DIGITS '.' DIGITS ;
// Type Lexemes
BYTE_T: 'byte' ;
UBYTE_T: 'ubyte' ;
SHORT_T: 'short' ;
USHORT_T: 'ushort' ;
INT_T: 'int' ;
UINT_T: 'uint' ;
INT64_T: 'int64' ;
UINT64_T: 'uint64' ;
CHAR_T: 'char' ;
// Exact form is as usual, but will leave out for now
CONSTANT: INTEGER | UNSIGNED | FLOAT | CHAR ;
fragment INTEGER: [+-]?DIGITS ;
fragment UNSIGNED: DIGITS ;
fragment FLOAT: [+-]?DIGITS '.' DIGITS ;
fragment STRING: '"' ~["] '"' ;
fragment DIGITS: [0-9]+ ;
fragment UTF8: ASCII ; // Assume base character set is UTF8 ;
fragment IDCHAR: (IDASCII|UTF8) ;
fragment IDASCII: [0-9a-zA-Z]|[!#$%()*+:;<=>?@]|'['|']'|'\\'|[^_`|{}~]
|'\\\\'|'\\/'|'\\ ' ;
fragment ASCII: [0-9a-zA-Z]|[ !#$%()*+:;<=>?@]|'['|']'|'\\'|[^_`|{}~] ; // Printable ASCII
</code></pre>
<h1>References</h1>
<p>[1] https://www.antlr.org/</p>
<h1>Copyright</h1>
<p>Copyright 2018, UCAR/Unidata<br>
See netcdf/COPYRIGHT file for copying and redistribution conditions.</p>
<h1>Point of Contact</h1>
<p><strong>Author</strong>: Dennis Heimbigner<br>
<strong>Email</strong>: dmh at ucar dot edu<br>
<strong>Initial Version</strong>: 11/28/2018<br>
<strong>Last Revised</strong>: 07/2/2019</p>
https://www.unidata.ucar.edu/blogs/developer/entry/nczarr-overviewNCZarr OverviewDennis Heimbigner 2019-07-02T12:54:34-06:002019-07-31T10:52:09-06:00<p>The Unidata NetCDF group is proposing to provide access to cloud
storage (e.g. Amazon S3) by providing a mapping from a subset of
the full netCDF Enhanced (aka netCDF-4) data model to one or
more existing data models that already have mappings to
key-value pair cloud storage systems.</p>
<p>The initial target is to map that subset of netCDF-4 to the Zarr
data model [1]. As part of that effort, we intend to produce a
set of related documents that provide a semi-formal definition
of the following.</p>
<p>The Unidata NetCDF group is proposing to provide access to cloud
storage (e.g. Amazon S3) by providing a mapping from a subset of
the full netCDF Enhanced (aka netCDF-4) data model to one or
more existing data models that already have mappings to
key-value pair cloud storage systems.</p>
<p>The initial target is to map that subset of netCDF-4 to the Zarr
data model [1]. As part of that effort, we intend to produce a
set of related documents that provide a semi-formal definition
of the following.</p>
<ol>
<li>A description of the initial NCZarr data model.</li>
<li>A description of the subset of the netCDF API that conforms
to the NCZARR data model. This interface will be the basis
for programatically reading and writing cloud data via the
netcdf-c library.</li>
<li>A mapping of the NCZarr data model to some variant of the
Zarr storage representation, This representation is a
combination of a mapping to Json plus a mapping to an
abstract key-value pair interface.</li>
<li>The internal architecture of the cloud support in the netcdf-c
library.</li>
<li>Any other documents required in support of the preceding documents
(the chunking algorithm documents, for example).</li>
</ol>
<p>The term "semi-formal" is used because rather than provide
complete mathematical or operational semantics, prose text will
be used to describe the context-sensitive features of the model.
A complete formalization in order to produce an operationally defined
specification is a possible future activity.</p>
<h2>References</h2>
<p>[1] Zarr storage specification version 2 (https://zarr.readthedocs.io/en/stable/spec/v2.html)</p>
<h2>Copyright</h2>
<p>Copyright 2018, UCAR/Unidata<br>
See netcdf/COPYRIGHT file for copying and redistribution conditions.</p>
<h2>Point of Contact</h2>
<p><strong>Author</strong>: Dennis Heimbigner<br>
<strong>Email</strong>: dmh at ucar dot edu<br>
<strong>Initial Version</strong>: 11/28/2018<br>
<strong>Last Revised</strong>: 7/2/2019</p>
https://www.unidata.ucar.edu/blogs/developer/entry/chunking-algorithms-for-netcdf-cChunking Algorithms for NetCDF-CDennis Heimbigner 2019-05-22T14:14:54-06:002019-05-22T14:14:54-06:00<p>Unidata is in the process of developing a Zarr [] based variant
of netcdf. As part of this effort, it was necessary to
implement some support for chunking. Specifically, the problem
to be solved was that of extracting a hyperslab of data from an
n-dimensional variable (array in Zarr parlance) that has been divided
into chunks (in the HDF5 sense). Each chunk is stored independently
in the data storage -- Amazon S3, for example.</p>
<p>The algorithm takes a series of R slices of the form (first,stop,stride),
where R is the rank of the variable. Note that a slice of the form
(first, count, stride), as used by netcdf, is equivalent because
stop = first + count*stride. These slices form a hyperslab.</p>
<p>The goal is to compute the set of chunks that intersect the hyperslab
and to then extract the relevant data from that set of chunks to
produce the hyperslab.</p>
<h1>Introduction</h1>
<p>Unidata is in the process of developing a Zarr [] based variant
of netcdf. As part of this effort, it was necessary to
implement some support for chunking. Specifically, the problem
to be solved was that of extracting a hyperslab of data from an
n-dimensional variable (array in Zarr parlance) that has been divided
into chunks (in the HDF5 sense). Each chunk is stored independently
in the data storage -- Amazon S3, for example.</p>
<p>The algorithm takes a series of R slices of the form (first,stop,stride),
where R is the rank of the variable. Note that a slice of the form
(first, count, stride), as used by netcdf, is equivalent because
stop = first + count*stride. These slices form a hyperslab.</p>
<p>The goal is to compute the set of chunks that intersect the hyperslab
and to then extract the relevant data from that set of chunks to
produce the hyperslab.</p>
<p>It appears from web searches that this algorithm is nowhere documented
in the form of a high level pseudo code algorithm. It appears to only
exist in the form of code in HDF5, Zarr, and probably TileDB, and maybe
elsewhere.</p>
<p>What follows is an attempt to reverse engineer the algorithm used by
the Zarr code to compute this intersection. This is intended to be
the bare-bones algorithm with no optimizations included. Thus, its
performance is probably not the best, but it should work in all cases.
Some understanding of how HDF5 chunking is used is probably essential
for understanding this algorithm.</p>
<p>The original python code relies heavily on the use of Python
iterators, which was might be considered a mistake. Iterators
are generally most useful in two situations: (1) it makes
the code clearer -- arguably false in this case, and (2) there is a reasonable
probability that the iterators will be terminated before they end, thus
improving efficiency. This is demonstrably false for the Zarr code.</p>
<p>This code instead uses the concept of an <em>odometer</em>,
which is a way to iterate over all the elements of an
n-dimensional object. Odometer code already exists in several
places in the existing netcdf-c library, among them
<em>ncdump/nciter.c</em>, <em>ncgen/odom.c</em>, and <em>libdap4/d4odom.c</em>. It
is also equivalent to the Python <em>itertools.product</em> iterator
function.</p>
<p>This algorithm assumes the following items of input:
1. variable (aka array) - multidimensional array of data values
2. dimensions - the vector of dimensions defining the rank of the array;
this set comes from the dimensions associated with the array;
so given v(d1,d2) where (in netcdf terms d1 and d2 are the
dimension names and where for example, d1=10, d2=20
the dimensions given to this algorithm are the ordered vector (d1,d2),
or equivalently (10,20).
3. Chunk sizes - for each dimension in the dimension set, there is defined
a chunk size along that dimension. Thus, there is also an ordered vector
of chunk sizes, (c1,c2) corresponding to the dimension vector (d1,d2).
4. slices - a vector of slice instances defining the subset of data to be
extracted from the array. A single slice as used here is of the form
(start,stop,stride) where start is the starting position with respect to
a dimension, stop is the last position + 1 to extract, and stride is
the number of positions to skip when extracting data. Note that this
is different than the netcdf-c nc<em>get</em>vars slices, which are of the form
(start,count,stride). The two are equivalent since stop = start+(count*stride).
When extracting data from a variable of rank R, one needs to specify R slices
where each slice corresponds to a dimension of the variable.</p>
<p>At a high-level, the algorithm works by separately analyzing
each dimension of the array using the corresponding slice and
corresponding chunk size. The result is a set of <em>projections</em>
specific to each dimension. By taking the cross-product of these
projections, one gets a vector of projection-vectors that can be
evaluated to extract a subset of the desired data for storage in
an output array.</p>
<p>It is important to note that this algorithm operates
in two phases. In the first phase, it constructs the projections
for each dimension. In the second phase, these projections
are combined as a cross-product to provide subsetting for the
true chunks, which are R-dimensional rectangles.</p>
<p>What follows is the algorithm is written as a set of pseudo-code procedures
to produce the final output given the above inputs.</p>
<h2>Notations:</h2>
<ul>
<li>floordiv(x,y) = floor((x / y))</li>
<li><p>ceildiv(x,y) = ceil((x / y))</p>
<h2>Notes:</h2></li>
<li><p>The ith slice is matched to the ith dimension for the variable</p></li>
<li>The ith slice is matched to the ith chunksize for the variable</li>
<li>The zarr code uses iterators, but this code converts to using
vectors and odometers for (one hopes) some clarity and consistency
with existing netcdf-c code.</li>
</ul>
<h2>Global Type Declarations</h2>
<pre><code>class Slice {
int start
int stop
int stride
}
class SliceIndex { // taken from zarr code see SliceDimIndexer
int chunk0 // index of the first chunk touched by the slice
int nchunks // number of chunks touched by this slice index
int count //total number of output items defined by this slice index
Projection projections[nchunks]; // There are multiple projections
// derived from the original slice:
// one for each chunk touched by
// the original slice
}
class Projection {
int chunkindex;
Slice slice; // slice specific to this chunk
int outpos; // start point in the output to store the extracted data
}
</code></pre>
<h2>Global variables</h2>
<p>In order to keep argument lists short, certain values are
assumed to be globally defined and accessible.
* R - the rank of the variable
* dimlen[R] - the length of the dimensions associated with the variable
* chunklen[R] - the length of the chunk sizes associated with the variable
* int zeros[R] - a vector of zeros
* int ones[R] - a vector of ones</p>
<h2>Procedure EvaluateSlices</h2>
<pre><code>// Goal: Given the projections for each slice being applied to the
// variable, create and walk all possible combinations of projection
// vectors that can be evaluated to provide the output data
void EvaluateSlices(
Slice slices[R], // the complete set of slices
T output // the target storage for the extracted data: its type is T
)
{
int i;
SliceIndex allsliceindices[R];
Odometer odometer;
nchunks[R]; // the vector of nchunks from projections
int indices[R];
// Compute the slice index vector
allsliceindices = compute_all_slice_indices(slices);
// Extract the chunk0 and nchunks vectors
for(i=0;i<R;i++) {
nchunks[i] = allsliceindices[i].nchunks;
}
// Create an odometer to walk nchunk combinations
odometer = Odometer.new(R,zeros,nchunks); // iterate each "wheel[i]" over 0..nchunk[i] with R wheels
// iterate over the odometer: all combination of chunk indices in the projections
for(;odometer.more();odometer.next()) {
chunkindices = odometer.indices();
ApplyChunkIndices(chunkindices,output,allsliceindices);
}
}
</code></pre>
<h2>Procedure ApplyChunkIndices</h2>
<pre><code>// Goal: given a vector of chunk indices from projections,
// extract the corresponding data and store it into the
// output target
void ApplyChunkIndices(
int chunkindices[R], // indices chosen by the parent odometer
T output, // the target storage for the extracted data
SliceIndex allsliceindices[R]
)
{
int i;
SliceIndex subsliceindices[R];
chunk0[R]; // the vector of chunk0 values from projections
Projection projections[R];
int[R] start, stop,stride; // decomposed set of slices
int[R] outpos; // capture the outpos values across the projections
int ouputstart;
// This is complicated. We need to construct a vector of slices
// of size R where the ith slice is determined from a projection
// for the ith chunk index of chunkindices. We then iterate over
// that odometer to extract values and store them in the output.
for(i=0;i<R;i++) {
int chunkindex = chunkindices[i];
Slice slices[R];
projections[i] = allsliceindices[i].projections[chunkindex];
slices[i] = projections[i].slice;
outpos[i] = projections[i].outpos;
}
// Compute where the extracted data will go in the output vector
outputstart = computelinearoffset(R,outpos,dimlen);
GetData(slices,outputstart,output);
}
</code></pre>
<h2>Procedure GetData</h2>
<pre><code>// Goal: given a set of indices pointing to projections,
// extract the corresponding data and store it into the
// output target.
void GetData(
Slice slices[R],
int chunksize, // total # T instances in chunk
T chunk[chunksize],
int outputstart,
T output
)
{
int i;
Odometer sliceodom,
sliceodom = Odometer.new(R, slices);
// iterate over the odometer to get a point in the chunk space
for(;odom.more();odom.next()) {
int chunkpos = odometer.indices(); // index
}
}
</code></pre>
<h2>Procedure compute_all_slice_projections</h2>
<pre><code>// Goal:create a vector of SliceIndex instances: one for each slice in the top-level input
Projection[R]
compute_all_slice_projections(
Slice slice[R], // the complete set of slices
{
int i;
SliceIndex projections[R];
* for i in 0..R-1
* projections[i] = compute_perslice_projections(dimlen[i],chunklen[i],slice[i])
* return projections
</code></pre>
<h2>Procedure compute_perslice_projections</h2>
<h3>Goal:</h3>
<ul>
<li><p>For each slice, compute a set of projections from it wrt a
dimension and a chunk size associated with that dimension.</p>
<h3>Inputs:</h3></li>
<li><p>dimlen -- dimension length </p></li>
<li>chunklen -- chunk length associated with the input dimension</li>
<li><p>slice=(start,stop,stride) -- associated with the dimension</p>
<h3>Outputs:</h3></li>
<li><p>Instance of SliceIndexs</p></li>
</ul>
<h3>Data Structure:</h3>
<p>````</p>
<h3>Computations:</h3>
<ul>
<li>count = max(0, ceildiv((stop - start), stride))
<ul><li>total number of output items defined by this slice (equivalent to count as used by nc<em>get</em>vars)</li></ul></li>
<li>nchunks = ceildiv(dim<em>len, dim</em>chunk_len)
<ul><li>number of chunks touched by this slice</li></ul></li>
<li>chunk0 = floordiv(start,chunklen)
<ul><li>index (in 0..nchunks-1) of the first chunk touched by the slice</li></ul></li>
<li>chunkn = ceildiv(stop,chunklen)
<ul><li>index (in 0..nchunks-1) of the last chunk touched by the slice</li></ul></li>
<li>n = ((chunkn - chunk0) + 1)
<ul><li>total number of touched chunks</li>
<li>the index i will range over 0..(n-1)</li>
<li>is this value the same as nchunks?</li></ul></li>
<li>For each touched chunk index we compute a projection specific to that chunk, hence
there are n of them.</li>
<li>projections.index[i] = i</li>
<li>projections.offset[i] = chunk0 * i
<ul><li>remember: offset is only WRT this dimension, not global</li></ul></li>
<li>projections.limit[i] = min(dimlen, (i + 1) * chunklen)
<ul><li>end of this chunk but no greater than dimlen</li></ul></li>
<li>projections.len[i] = projections.limit[i] - projections.offset[i]
<ul><li>actual limit of the ith touched chunk; should be same as chunklen except for last length because of the min function in computing limit[i]</li></ul></li>
<li>projections.start[i]:
<ul><li>This is somewhat complex because for the first projection, the start is the slice start,
but after that, we have to take into account that for a non-one stride, the start point
in a projection may be offset by some value in the range of 0..(stride-1)</li>
<li>i == 0 => projections.start[i] = start - projections.offset[i]
<ul><li>initial case the original slice start is within the first projection</li></ul></li>
<li>i > 0 => projections.start[i] = start - projections.offset[i]
<ul><li>prevunused[i] = (projections.offset[i] - start) % stride
<ul><li>prevunused[i] is an intermediate computation and need not be saved</li>
<li>amount unused in previous chunk => we need to skip (stride-prevunused[i]) in this chunk</li></ul></li>
<li>prevunused[i] > 0 => projections.start[i] = stride - prevunused[i]</li></ul></li></ul></li>
<li>projections.stop[i]:
<ul><li>stop > projections.limit[i] => projections.stop[i] = projections.len[i]</li>
<li>stop <= projections.limit[i] => projections.stop[i] = stop - projections.offset[i]
<ul><li>selection ends within current chunk</li></ul></li></ul></li>
<li>projections.outpos[i] = ceildiv(offset - start, stride)
<ul><li>"location" in the output array to start storing items; again, per slice, not global</li></ul></li>
</ul>
<h2>Procedure computelinearoffset(outpos,dimlen);</h2>
<p>````
// Goal: Given a set of per-dimension indices, compute the corresponding linear position.</p>
<pre><code>int
computelinearoffset(int R,
int outpos[R],
int dimlen[R]
)
{
int offset;
int i;
offset = 0;
for(i=0;i<R;i++) {
offset *= dimlen[i];
offset += outpos[i];
}
return offset;
}
</code></pre>
<h1>Appendix: Odometer Code</h1>
<pre><code>class Odometer
{
int R; // rank
int start[R];
int stop[R]
int stride[R];
int index[R]; // current value of the odometer
procedure new(int R, int start[R], int stop[R]) { return new(R, start,stop,ones);}
procedure new(int rank, Slice slices[R])
{
int i;
int start0[R];
int stop0[R];
int stride0[R];
R = rank;
for(i=0;i<R;i++) {
start = slices[i].start;
stop = slices[i].stop;
stride = slices[i].stride;
}
for(i=0;i<R;i++) {index[i] = start[i];}
}
boolean
procedure more(void)
{
return (index[0] < stop[0]);
}
procedure next(void)
{
int i;
for(i=R-1;i>=0;i--) {
index[i] += stride[i];
if(index[i] < stop[i]) break;
if(i == 0) break; // leave the 0th entry if it overflows
index[i] = start[i]; // reset this position
}
}
// Get the value of the odometer
int[R]
procedure indices(void)
{
return indices;
}
}
</code></pre>