Re: [thredds] Station Data Subset Service vs. OPeNDAP constraint expressions

On 10/13/2010 5:00 PM, Bob Simons wrote:
John,

I read about the new Station Data Subset Service (I'll call it SDSS in this 
email)
http://www.unidata.ucar.edu/projects/THREDDS/tech/interfaceSpec/StationDataSubsetService.html
(version 0.2), which lists you as the contact. I understand that the UAF group 
is considering using SDSS to deal with station data.

I noticed that SDSS queries are very similar to OPeNDAP constraint expression 
queries (
http://www.opendap.org/user/guide-html/guide_33.html ).
Yet, SDSS seems limited to one type of dataset (stations with time, latitude, longitude, 
... data, because it uses specific variable names, e.g., stn, north, south, west, east, 
time for the constraints) while OPeNDAP constraint expressions can be used with a much 
broader range of datasets, notably, any dataset that can be represented as a database-like 
table, because it isn't tied to any specific variable names. And OPeNDAP's bigger set of 
operators (=, <, <=, >, >=, !=, =~) can be applied to any variable, not just 
longitude/latitude/depth/time/stn.

The sample queries in the SDSS documentation can easily be converted to OPeNDAP 
constraint expression queries, for example:

SDSS: ?north=17.3&south=12.088&west=140.2&east=160.0
OPeNDAP: ?latitude<=17.3&latitude>=12.088&longitude>=140.2&longitude<=160.0

SDSS: ?stn=KDEN
OPeNDAP: ?stn="KDEN"

SDSS: ?stn=KDEN&stn=KPAL&stn=SDOL
OPeNDAP: ?stn=~"KDEN|KPAL|SDOL"
(=~ lets you specify a regular expression to be matched)

SDSS: ?time_start=2007-03-29T12:00:00Z&time_end=2007-03-29T13:00:00Z
OPeNDAP: ?time>="2007-03-29T12:00:00Z"&time<="2007-03-29T13:00:00Z"

SDSS' accept=mime_type could be mimicked by having the OPeNDAP server support 
file extensions in addition to .dods and .asc (or by some other means if 
necessary). And mime types have a problem if two file types share the same mime 
type.

OPeNDAP's sequence data type is well-suited to this type of data query and to 
the API described at
http://www.unidata.ucar.edu/software/netcdf-java/reference/FeatureDatasets/PointFeatures.html
 .

I have worked quite a lot with OPeNDAP constraint expressions and I have found 
them to be
* Very flexible (well-suited to a wide range of datasets and queries),
* Very easy for non-programmers to read, write, and understand,
* Easy to convert into queries for other types of data servers (e.g., SQL, SOS, 
OBIS),
* Easy for data servers to handle and optimize.
They are sort of like a nice subset of SQL with a really simple syntax.


All of this discussion leads up to this:
I'm very curious: why did you decide to define a new protocol instead of using 
the existing standard OPeNDAP constraint expression protocol? And/or, would you 
consider switching to the OPeNDAP constraint expression protocol?

Instead of creating a new service with one server implementation (THREDDS) and 
one client implementation (netcdf-java), switching to OPeNDAP constraint 
expressions would hook your service into the realm of other servers and clients 
that already support OPeNDAP constraint expressions.

And supporting OPeNDAP constraint expressions in THREDDS seems like a logical 
extension for a data server which already supports OPeNDAP grid/hyperslab 
queries.

I am very curious to hear your thoughts on this.

Thanks for considering this.


Sincerely,

Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
1352 Lighthouse Ave
Pacific Grove, CA 93950-2079
Phone: (831)658-3205
Fax: (831)648-8440
Email: bob.simons@xxxxxxxx

The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric
Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><

Hi Bob:

The original motivation of the Netcdf Subset Service was to provide subsets of 
gridded data in netCDF-CF format. The subsetting request is specified in 
coordinate (lat/lon/alt/time) space, so that it could be done from a web form, 
or from a simple wget script. The service has continued to evolve, and its time 
to evaluate where it is and where it should go, so your question comparing it 
to OPeNDAP is timely.

Background

The NetCDF Subset Services (NCSS) are a family of experimental web protocols for making 
queries in coordinate space (rather than index space), against CDM "Feature 
Type" datasets; see:

http://www.unidata.ucar.edu/projects/THREDDS/tech/interfaceSpec/NetcdfSubsetService.html

Functionally, they are intended to be a simplified version of the OGC 
protocols, and are most directly an alternative to OGC web services. In order 
to support queries in coordinate space the data model has to have a general 
notion of coordinates, and in particular, the use case I want to cover is to 
support space/time subsetting. The data models of OPeNDAP, netCDF and HDF5 have 
only partially handled coordinate systems; see:

http://www.unidata.ucar.edu/software/netcdf-java/CoordinateSystemsNeeded.htm

This is one reason why the OGC protocols have the mind share that they do (plus 
lots of $$$ and commercial effort, etc). This is also the reason that the CDM 
is an extension of OPeNDAP, netCDF and HDF5, rather than just their union, see:

http://www.unidata.ucar.edu/software/netcdf-java/CDM/index.html

As I mentioned, NCSS are intended to return results in commonly used formats 
(netCDF, CSV, XML, etc) that can be used in other applications directly, rather 
than having to have a smart client that can convert binary dods objects.

OPeNDAP

To answer your specific questions:

Yet, SDSS seems limited to one type of dataset (stations with time, latitude, longitude, 
... data, because it uses specific variable names, e.g., stn, north, south, west, east, 
time for the constraints) while OPeNDAP constraint expressions can be used with a much 
broader range of datasets, notably, any dataset that can be represented as a database-like 
table, because it isn't tied to any specific variable names.  And OPeNDAP's bigger set of 
operators (=, <, <=, >, >=, !=, =~) can be applied to any variable, not just 
longitude/latitude/depth/time/stn.

"stn, north, south, west, east, time" are not variable names, they are names 
for those semantic concepts, and dont depend on those names being present in the dataset. 
In that sense they are more general than an OPeNDAP request, where you have to know what 
the actual names of the variables are.

OPeNDAP constraint expressions are very powerful but they have two major 
problems:

1) they operate on the syntactic level, so, for example, they dont know that 
lon == longitude, and so cant deal with the longitude seam at +/- 180 (or 
wherever it is). Another example: if your dataset does not include lat/lon 
variables, but instead is on a projection, your client has to know how to do 
the projective geometry math.

2) its hard to efficiently implement the full relational constraint expressions 
unless you are using an RDBMS. For that reason, you rarely see it implemented 
in OPeNDAP servers. The NCSS only implements space and time and variable 
subsetting. This is hard enough to do in a general way, but not as hard as 
supporting relational constraints on all fields. (OTOH, the relational queries 
are very nice to use, its just the server implementation thats hard).

I have made various suggestions to James over the years on what extensions to OPeNDAP could be used 
for this use case, but there's no point in Unidata creating non-standard OPeNDAP implementations, 
since the whole point of OPeNDAP is interoperability between clients and servers.  If a standard 
OPeNDAP way to do coordinate space subsetting emerged, we would be willing to implement it. The 
"DAPPER protocol" for example seems to be the best fit that Ive seen for the 
"Station Data Subset Service" use case; essentially DAPPER is a small set of conventions 
on top of OPeNDAP.  These need to be clarified and extended a bit IMO to be generally useful, but 
are a good start. (BTW, Are you using it?)

In the meanwhile, its much faster for us to roll our own, since we own both the 
server and the client stack, so we can experiment with what works without 
worrying about breaking OPeNDAP or OGC standards. Most of the work is in the 
server implementation, so if there was a different but functionally equivalent 
query protocol, we could easily switch to it. So Im pretty confident that the 
software we have been implementing can be used, no matter what protocol clients 
eventually want us to support. I am aware of the dangers of proprietary 
protocols, but also the frustration of complex standards and ones that don't 
move for 10 years.

Smart clients like the ones you have been writing can do a lot on top of OPeNDAP, but dumb(er) clients cant. 
We need to push as much of those smarts into the server as possible, and in order to do that, we need to 
operate on "higher level semantic" objects than indexed arrays. In the CDM, these objects are 
intended to be the "Feature Types". The "Grid" Feature Type allows the TDS to support the 
OGC WCS and WMS protocols, which are becoming more important to getting our data out to a wider community. 
Those have the problem of being overly complex. The NCSS protocols are looking for the sweet spot of 
functionality and simplicity.

would you consider switching to the OPeNDAP constraint expression protocol?

Id be willing to add something like DAPPER as another way that the Station Data 
Subset Service can deliver data, if there was an important class of clients 
that needed it and could use it. OTOH, if your software is using the CDM stack, 
do you care how the objects are delivered to it?

switching to OPeNDAP constraint expressions would hook your service into the 
realm of other servers and clients that already support OPeNDAP constraint 
expressions.

Id be interested in knowing which clients can handle relational constraint 
expressions? The NetCDF clients cannot, because it falls outside of the data 
model and API. I know you guys do a lot with relational databases, so its not 
surprising if your software does.  Ive been working almost exclusively on top 
of collections of files (netcdf, hdf, grib, bufr, etc). I have been on the 
lookout for new solutions, but for now it seems that people need services that 
run on top of those file collections.

Comments, please

Im looking forward to an extended discussion of these issues and where remote 
access protocols should evolve. Anyone who would like to comment, please feel 
free. Note that Ive cross posted to 2 groups, beware of cross posting if you're 
not on both. (now that i think of it, im not sure that im on both).

John Caron