Re: Meeting about improving the GRD API.

Hello all, comments are in-line:

Ian Barrodale wrote:
Hi Ted, John, Russ, and John:

Thank you all for taking the time yesterday to both listen to our story and to further enlighten us about your work. It was much appreciated.

The note below provides a possible implementation route, and some questions. Please feel free to point out any shortcomings in our proposed approach, and please provide any answers that come to mind regarding our questions.

Thanks again,
Ian
======================
Goal
-------

Based on feedback from BCS Grid DataBlade customers and, in particular, Ted Habermann, we feel that there may be some value in providing alternate ways of accessing data from a Grid DataBlade (GRD) - powered database through existing widely-used protocols and methods. Note that by "accessing", we really mean just the reading part, as we already provide, through the BCS Gridded Data Loader client, a means of conveniently ingesting data from many forms into a GRD-powered database. One method of accessing the data would be to cast it in the form of the Common Data Model (CDM) supported by the Java netCDF API from UCAR. The advantage of this is that:

    * users would be able to write software using the Java netCDF API
      (which is fairly straightforward to use and well documented) for
      accessing GRD data, and
    * data providers can use a GRD-powered database and provide access
      to it through OPeNDAP, WCS, netCDF files, etc. using the Java
      netCDF API (see page 53 attachment, modified from the slide on
      page 53 of
http://www.unidata.ucar.edu/staff/caron/presentations/CDM.ppt).
Our understanding of a possible implementation
---------------------------------------------------------------------

To handle GRD data from the Java netCDF API, we would have to:

(i) Create a GRD I/O service provider for the Java netCDF API (see page 38 attachment) that can communicate with the GRD database using a combination of JDBC and the existing Java GRD API. The Java netCDF API uses a service provider architecture to handle reading multiple different file formats and casting them in the form of the CDM.

(ii) Create a GRD content manager to handle the georeferencing information in the GRD.

One possible method for allowing users to access GRD data without a full THREDDS catalog is to supply some type of unique URL to the database:

  grd://user:pass@server/database

and the service provider would construct a CDM instance that contains a main group of all the grids in the database and allow the user to access those grids through the API.
For example:

  grd://peter:test123@xxxxxxxxxxxxxxxxxx/coastwatch

might be a reference to a GRD database running at Barrodale that contains gridded NOAA CoastWatch satellite-derived data for some number of geographic areas and time periods. The resulting netCDF dataset would be one that contains a list of grids under a root group like a directory structure:

  /
  /sst/
  /sst/northeast/
  /sst/northeast/jan01_2007    <---- a grid
  /sst/northeast/jan02_2007    <---- another grid
  ...
  /chlorophyll/northeast/jan01_2007   <---- a third grid
  /chlorophyll/northeast/jan02_2007   <---- and so on

It depends on the desired complexity of the grids in the database as to whether the user would require a more sophisticated catalog with querying ability such as that which THREDDS could supply.

see the last answer below.

BTW, the TDS will soon have the ability to do proper HTTP-based authentication, 
and we are hoping to make that a standard in OPenDAP clients, which can act 
like browsers and pop up a username/password dialog window, instead of 
embedding the user:pass@ in the URL.


Questions
---------------

We have the following questions:

1) Where in the netCDF API would the content manager that handles GRD georeferencing information sit?

2) How does the I/O SP architecture determine the I/O SP for a given file:// <file://\> style URL? How would it know to handle a grd:// URL differently?

Very perceptive question; let me start here to explain these 2 questions:

The IOSP architecture is, in fact (RandomAccessFile) file based. Since you will 
be URL based, we have to fit you in at a higher level, namely 
NetcdfDataset.openFile(). If you look there you will see that we look for 
opendap (http: or dods:) and thredds: URLs. It might makes sense to generalize 
this to allow plugging in external handlers for your protocol, similar to how 
java.net.ContentHandler works. Otherwise we might put your code in the core, 
which is also a possibility.

Anyway, NetcdfDataset.openFile() would detect your URL scheme and call 
NetcdfFile with your IOSP. We will have to add a new constructor for that. (You 
could alternately just subclass NetcdfFile, which is what DODSNetcdfFile does).

As for the "content manager that handles GRD georeferencing information". It could be a 
CoordSysBuilder subclass. However, this is actually unnecessary if you use an existing Convention, and we 
would highly recommend using the CF Convention for gridded data. Since you are creating the "file", 
you can add the attributes and variables needed by that Convention. This makes your data "CF 
compliant" automatically, which is a real win.


3) Have we interpreted the slide on page 53 correctly -- is there a server that can serve out data using the CDM (via the Java netCDF API) as an intermediate step?

yes, the THREDDS Data Server


4) Does a group structure to represent GRD contents map to an OPeNDAP connection, WCS, or netCDF file or do those types of data representations only have netCDF variables and no groups?

In principle you could use Groups, but they really wont be fully supported 
until we get the netcdf-4 file format finished and tested. I would advise to 
start with the simpler case of no groups.


5) Our understanding of the netCDF Java library is that it has, in particular, the following two entry points:

    * NetcdfFile : this is the bare netCDF access to files of various
      types. It doesn't understand anything about coordinate systems.
      You can add an I/O service provider to handle your favorite file
      format via a class method. The variables it returns are instances
      of Variable (which of course don't know anything about coordinate
      systems).
    * NetcdfDataset : this is a layer built above the NetcdfFile layer
      and is the usual interface for applications (e.g., a WCS). It
      handles converting various attributes into a coordinate system. It
      has a number of methods relating to adding or getting coordinate
      systems. These methods seem to be applied to the entire file,
rather than to individual variables (or groups).

coordinate systems are really variable-specific. however the common case is 
that each dataset has a single coordinate system (or a set of closely related 
ones).



    CoordinateSystem
    
<http://www.unidata.ucar.edu/software/netcdf-java/v2.2.18/javadoc/ucar/nc2/dataset/CoordinateSystem.html>
    *findCoordinateSystem*
    
<http://www.unidata.ucar.edu/software/netcdf-java/v2.2.18/javadoc/ucar/nc2/dataset/NetcdfDataset.html#findCoordinateSystem%28java.lang.String%29>(
java.lang.String name) // Retrieve the CoordinateSystem with the specified name. java.util.List *getCoordinateAxes* <http://www.unidata.ucar.edu/software/netcdf-java/v2.2.18/javadoc/ucar/nc2/dataset/NetcdfDataset.html#getCoordinateAxes%28%29>() // Get the list of all CoordinateAxis objects used by this dataset.

java.util.List * getCoordinateTransforms * <http://www.unidata.ucar.edu/software/netcdf-java/v2.2.18/javadoc/ucar/nc2/dataset/NetcdfDataset.html#getCoordinateTransforms%28%29> () // Get the list of all CoordinateTransform objects used by this dataset.

boolean * getCoordSysWereAdded * <http://www.unidata.ucar.edu/software/netcdf-java/v2.2.18/javadoc/ucar/nc2/dataset/NetcdfDataset.html#getCoordSysWereAdded%28%29> ()
          // Has Coordinate System metadata been added.

The NetcdfDataset object contains instances of VariableDS. They are like a wrapper for the Variable objects found in the NetcdfFile object. There is a method to ask a VariableDS for the list of coordinate systems associated with it.

exactly


If we interpret things correctly , when a NetcdfDataset object is built from a NetcdfFile object, the NetcdfDataset object is responsible for figuring out the coordinate system information from attributes in the NetcdfFile, and composing a VariableDS from the coordinate system information and each Variable. In theory, by implementing our own CoordSysBuilder class and registering it, we should be able to add coordinate system information to each VariableDS individually.

yes, or as i mentioned use an existing Convention and CoordSysBuilder.


A question then is : do applications like the web coverage server and OPeNDAP server get their coordinate information from VariableDS objects or from the NetcdfDataset object?


OPenDAP is (more or less) at the same level as NetcdfFile, and so just 
faithfully transmits Variables, Attributes, and Dimensions across the wire. The 
coordinate systems then are added by clients (like CDM) that understand the 
convention. We are expecting that DAP4, the future opendap protocol, will add 
Groups.

WCS, OTOH, works at the coordinate system level, and so uses the GridDatatype, which is specialized for "coverage" data, and gets its coordinates systems from NetcdfDataset. The clent makes requests in coordinate space, and we know how to translate that into index space. Currently we can send back either geoTiff or netcdf/CF files. There are some limittions- the grid spacing must be uniform in WCS 1.0. We expect to move to WCS 1.1 later this year, which removes that limitation. We havent implemented reprojection/resampling, and im not sure that we will.

If it is from the NetcdfDataset object, then the strategy of grouping all the grids in a database into a single NetcdfDataset, as outline above, won't work, and we'd be obliged to use a THREDDS server. Is this correct?

It would likely be a mistake to put a lot of disparate data into the same 
NetcdfDataset. Better to find the right granularity, which is typically 
homogenous data that shares the same discovery metadata.  So I would not use 
the Group mechanism to break the data into granules, better to make seperate 
datasets. Its possible that such an idiom will develop with Netcdf-4, but 
better to get something working that stays within existing practice, then 
decide if you want to forge ahead. Let me emphasize that its really important 
to find the right dataset granularity.

This means you want to use THREDDS catalogs to publish the dataset URLs and 
associated metadata, and possibly use TDS to serve your data. Once you had an 
IOSP or equivilent for your data, the main work is to develop the catalogs. 
These can be pretty minimal, but automatically populating catalogs with 
high-quality metadata is a huge win in the long run.

I think that would be a powerful value-added product, but of course i dont know 
what your customers really want. As Ted mentioned, its a good time to help 
influence TDS strategy, and it appears to me that your small company with 
extensive scientific experience would be a good fit with Unidata.

John