Re: [thredds] [uaf_tech] Re: Station Data Subset Service vs. OPeNDAP constraint expressions

  • To: Steve hankin <Steven.C.Hankin@xxxxxxxx>
  • Subject: Re: [thredds] [uaf_tech] Re: Station Data Subset Service vs. OPeNDAP constraint expressions
  • From: Bob Simons <Bob.Simons@xxxxxxxx>
  • Date: Thu, 14 Oct 2010 11:20:11 -0700
Comments below...

On 10/14/2010 10:23 AM, Steve hankin wrote:
  Hi John,

A really great, thoughtful discussion. You point out below, "Smart
clients like the ones you [Roy and Bob] have been writing can do a lot
on top of OPeNDAP [and the netCDF API], but dumb(er) clients cant."

I'm not so comfortable with this smart/dumb client distinction. Yes, in a pure opendap world, I guess a smart client is one that can process an opendap binary stream and a dumb one is one that can't. But the distinction fades when the server allows the response to be in different formats. John's NCSS, OGC services, and ERDDAP all support multiple response formats. Then, the client doesn't have to be very smart. For example, Matlab needs an opendap/netcdf to be smart enough to get data from a pure opendap server. But if the server can take a URL request and return a Matlab file, Matlab doesn't need to be very smart. It (and other "dumb" clients) just needs to be smart enough to read its native file type.

Yes, opendap + netcdf has been very successful. Netcdf is a great client, but it is just one client. I hope we (notably UAF) won't limit ourselves to just this one client. With a little extra effort, servers can connect to other clients (smart? dumb?) by providing a response in a format the client can already understand.

Essentially this is an argument that the underlying data access problem
probably doesn't have a one-size-fits-all solution. With the SDSS you
have proposed a new protocol intended to address the needs of simple
clients. In this email I'd like to talk about the other case -- what to
do for "smart" clients?

Would we agree that much of the power and success of the marriage
between netCDF and OPeNDAP stems from the fact that it defines a service
(OPeNDAP) that is nearly 100% compatible with a local file API (netCDF)?
Scientists will continue to need local files for the foreseeable future.
The power of the marriage is that it unifies access to local files with
access to remote (virtual) files. I don't think this is a contentious
point; if it is I hope others will follow up.

With https://cf-pcmdi.llnl.gov/trac/wiki/PointObservationConventions you
have proposed extensions of the CDM to represent collections of
observations in a netCDF file. Lets assume that these conventions will
be embraced by the community and will prove popular. There are good
reasons to believe so. A number of today's netCDF clients will be able
to read these new conventions with little or no modification; relatively
small mods in the clients will add a level of convenience that may
initially be missing.

Will it be possible to retain the tight marriage of remote and local
file access for the Point Data extensions of CDM. The potential killer
is performance. Using netCDF array syntax, when a client wants to read
(say) all of the time series stations found within a particular lat/long
box, it typically has to do so with many separate netCDF calls, because
the data are not contiguous in the file. This works ok for local file
access, but network latency can make it impractical for remote data access.


Yes. This is a very important issue.

Is there a way to side-step the network latency? I think so ... and I'd
like to hear your thoughts on it. Suppose we want to serve the (large)
file http://server/time_series_collection.nc. If we take a liberal view
of the OPeNDAP syntax and allow relational constraints to be applied to
array variables, then this expression

    http://server/time_series_collection.nc?lat<=17&lat>=12&lon>=140&lon<=160


represents the collection of stations in a lat-lon box (assuming
variables named "lon" and "lat"). The expression also describes a
virtual file that contains *only* the time series that lie within this
box. In that virtual file the time series of interest are stored
contiguously. So in principle, a client that executes

ncopn("http://server/time_series_collection.nc?lat<=17&lat>=12&lon>=140&lon<=160",
...);

can then read entire an variable with a single netCDF read operation.
The latency is eliminated; the work has been shifted from the client to
the server.

Yes. And I think that is the nature of a service. A server provides a service via a specified interface. How it does its work is the server's problem.

For PointFeatures datasets (and for the coming UAF project for those datasets), I think opendap constraint expressions and opendap sequence responses (and hopefully other file format responses) are the most appropriate, flexible, already-standardized protocol for this type of data. And as Steve emphasizes ("Don't solve problems. Copy success."), there are existing successes which use this approach: opendap's RDBMS, ERDDAP, PyDap, Dapper, Oceanotron/Dap4cor.

Hopefully, we can get this to work with the netcdf-java library (as Dennis Heimigner is doing with the netcdf C library) *and* other clients. Then the solution isn't just limited to THREDDS+netcdf and the solution will be more generalized than the station/lat/lon/time NCSS service (which is fine for its purposes).

If I can slightly redirect my request to John:
* I know there is already some support for opendap constraint expressions in netcdf-java. Is this feature fully operational/ could it be made fully operational? * Would you consider adding support in THREDDS for opendap constraint expressions and opendap sequence responses for PointFeatures-type datasets?


This approach would require new caching logic on the server --
presumably extracting the requested data subset and serving it as new
(virtual) datasets. That's the down side. But the up side is that the
solution would build naturally on the success story of netCDF-CF-OPeNDAP.

Discussion?

- Steve

===============================

On 10/13/2010 8:52 PM, John Caron wrote:
On 10/13/2010 5:00 PM, Bob Simons wrote:
John,

I read about the new Station Data Subset Service (I'll call it SDSS
in this email)
http://www.unidata.ucar.edu/projects/THREDDS/tech/interfaceSpec/StationDataSubsetService.html

(version 0.2), which lists you as the contact. I understand that the
UAF group is considering using SDSS to deal with station data.

I noticed that SDSS queries are very similar to OPeNDAP constraint
expression queries (
http://www.opendap.org/user/guide-html/guide_33.html ).
Yet, SDSS seems limited to one type of dataset (stations with time,
latitude, longitude, ... data, because it uses specific variable
names, e.g., stn, north, south, west, east, time for the constraints)
while OPeNDAP constraint expressions can be used with a much broader
range of datasets, notably, any dataset that can be represented as a
database-like table, because it isn't tied to any specific variable
names. And OPeNDAP's bigger set of operators (=, <, <=, >, >=, !=,
=~) can be applied to any variable, not just
longitude/latitude/depth/time/stn.

The sample queries in the SDSS documentation can easily be converted
to OPeNDAP constraint expression queries, for example:

SDSS: ?north=17.3&south=12.088&west=140.2&east=160.0
OPeNDAP:
?latitude<=17.3&latitude>=12.088&longitude>=140.2&longitude<=160.0

SDSS: ?stn=KDEN
OPeNDAP: ?stn="KDEN"

SDSS: ?stn=KDEN&stn=KPAL&stn=SDOL
OPeNDAP: ?stn=~"KDEN|KPAL|SDOL"
(=~ lets you specify a regular expression to be matched)

SDSS: ?time_start=2007-03-29T12:00:00Z&time_end=2007-03-29T13:00:00Z
OPeNDAP: ?time>="2007-03-29T12:00:00Z"&time<="2007-03-29T13:00:00Z"

SDSS' accept=mime_type could be mimicked by having the OPeNDAP server
support file extensions in addition to .dods and .asc (or by some
other means if necessary). And mime types have a problem if two file
types share the same mime type.

OPeNDAP's sequence data type is well-suited to this type of data
query and to the API described at
http://www.unidata.ucar.edu/software/netcdf-java/reference/FeatureDatasets/PointFeatures.html
.

I have worked quite a lot with OPeNDAP constraint expressions and I
have found them to be
* Very flexible (well-suited to a wide range of datasets and queries),
* Very easy for non-programmers to read, write, and understand,
* Easy to convert into queries for other types of data servers (e.g.,
SQL, SOS, OBIS),
* Easy for data servers to handle and optimize.
They are sort of like a nice subset of SQL with a really simple syntax.


All of this discussion leads up to this:
I'm very curious: why did you decide to define a new protocol instead
of using the existing standard OPeNDAP constraint expression
protocol? And/or, would you consider switching to the OPeNDAP
constraint expression protocol?

Instead of creating a new service with one server implementation
(THREDDS) and one client implementation (netcdf-java), switching to
OPeNDAP constraint expressions would hook your service into the realm
of other servers and clients that already support OPeNDAP constraint
expressions.

And supporting OPeNDAP constraint expressions in THREDDS seems like a
logical extension for a data server which already supports OPeNDAP
grid/hyperslab queries.

I am very curious to hear your thoughts on this.

Thanks for considering this.


Sincerely,

Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
1352 Lighthouse Ave
Pacific Grove, CA 93950-2079
Phone: (831)658-3205
Fax: (831)648-8440
Email: bob.simons@xxxxxxxx

The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric
Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><

Hi Bob:

The original motivation of the Netcdf Subset Service was to provide
subsets of gridded data in netCDF-CF format. The subsetting request is
specified in coordinate (lat/lon/alt/time) space, so that it could be
done from a web form, or from a simple wget script. The service has
continued to evolve, and its time to evaluate where it is and where it
should go, so your question comparing it to OPeNDAP is timely.

Background

The NetCDF Subset Services (NCSS) are a family of experimental web
protocols for making queries in coordinate space (rather than index
space), against CDM "Feature Type" datasets; see:

http://www.unidata.ucar.edu/projects/THREDDS/tech/interfaceSpec/NetcdfSubsetService.html


Functionally, they are intended to be a simplified version of the OGC
protocols, and are most directly an alternative to OGC web services.
In order to support queries in coordinate space the data model has to
have a general notion of coordinates, and in particular, the use case
I want to cover is to support space/time subsetting. The data models
of OPeNDAP, netCDF and HDF5 have only partially handled coordinate
systems; see:

http://www.unidata.ucar.edu/software/netcdf-java/CoordinateSystemsNeeded.htm


This is one reason why the OGC protocols have the mind share that they
do (plus lots of $$$ and commercial effort, etc). This is also the
reason that the CDM is an extension of OPeNDAP, netCDF and HDF5,
rather than just their union, see:

http://www.unidata.ucar.edu/software/netcdf-java/CDM/index.html

As I mentioned, NCSS are intended to return results in commonly used
formats (netCDF, CSV, XML, etc) that can be used in other applications
directly, rather than having to have a smart client that can convert
binary dods objects.

OPeNDAP

To answer your specific questions:

Yet, SDSS seems limited to one type of dataset (stations with time,
latitude, longitude, ... data, because it uses specific variable
names, e.g., stn, north, south, west, east, time for the constraints)
while OPeNDAP constraint expressions can be used with a much broader
range of datasets, notably, any dataset that can be represented as a
database-like table, because it isn't tied to any specific variable
names. And OPeNDAP's bigger set of operators (=, <, <=, >, >=, !=,
=~) can be applied to any variable, not just
longitude/latitude/depth/time/stn.

"stn, north, south, west, east, time" are not variable names, they are
names for those semantic concepts, and dont depend on those names
being present in the dataset. In that sense they are more general than
an OPeNDAP request, where you have to know what the actual names of
the variables are.

OPeNDAP constraint expressions are very powerful but they have two
major problems:

1) they operate on the syntactic level, so, for example, they dont
know that lon == longitude, and so cant deal with the longitude seam
at +/- 180 (or wherever it is). Another example: if your dataset does
not include lat/lon variables, but instead is on a projection, your
client has to know how to do the projective geometry math.

2) its hard to efficiently implement the full relational constraint
expressions unless you are using an RDBMS. For that reason, you rarely
see it implemented in OPeNDAP servers. The NCSS only implements space
and time and variable subsetting. This is hard enough to do in a
general way, but not as hard as supporting relational constraints on
all fields. (OTOH, the relational queries are very nice to use, its
just the server implementation thats hard).

I have made various suggestions to James over the years on what
extensions to OPeNDAP could be used for this use case, but there's no
point in Unidata creating non-standard OPeNDAP implementations, since
the whole point of OPeNDAP is interoperability between clients and
servers. If a standard OPeNDAP way to do coordinate space subsetting
emerged, we would be willing to implement it. The "DAPPER protocol"
for example seems to be the best fit that Ive seen for the "Station
Data Subset Service" use case; essentially DAPPER is a small set of
conventions on top of OPeNDAP. These need to be clarified and extended
a bit IMO to be generally useful, but are a good start. (BTW, Are you
using it?)

In the meanwhile, its much faster for us to roll our own, since we own
both the server and the client stack, so we can experiment with what
works without worrying about breaking OPeNDAP or OGC standards. Most
of the work is in the server implementation, so if there was a
different but functionally equivalent query protocol, we could easily
switch to it. So Im pretty confident that the software we have been
implementing can be used, no matter what protocol clients eventually
want us to support. I am aware of the dangers of proprietary
protocols, but also the frustration of complex standards and ones that
don't move for 10 years.

Smart clients like the ones you have been writing can do a lot on top
of OPeNDAP, but dumb(er) clients cant. We need to push as much of
those smarts into the server as possible, and in order to do that, we
need to operate on "higher level semantic" objects than indexed
arrays. In the CDM, these objects are intended to be the "Feature
Types". The "Grid" Feature Type allows the TDS to support the OGC WCS
and WMS protocols, which are becoming more important to getting our
data out to a wider community. Those have the problem of being overly
complex. The NCSS protocols are looking for the sweet spot of
functionality and simplicity.

would you consider switching to the OPeNDAP constraint expression
protocol?

Id be willing to add something like DAPPER as another way that the
Station Data Subset Service can deliver data, if there was an
important class of clients that needed it and could use it. OTOH, if
your software is using the CDM stack, do you care how the objects are
delivered to it?

switching to OPeNDAP constraint expressions would hook your service
into the realm of other servers and clients that already support
OPeNDAP constraint expressions.

Id be interested in knowing which clients can handle relational
constraint expressions? The NetCDF clients cannot, because it falls
outside of the data model and API. I know you guys do a lot with
relational databases, so its not surprising if your software does. Ive
been working almost exclusively on top of collections of files
(netcdf, hdf, grib, bufr, etc). I have been on the lookout for new
solutions, but for now it seems that people need services that run on
top of those file collections.

Comments, please

Im looking forward to an extended discussion of these issues and where
remote access protocols should evolve. Anyone who would like to
comment, please feel free. Note that Ive cross posted to 2 groups,
beware of cross posting if you're not on both. (now that i think of
it, im not sure that im on both).

John Caron

_______________________________________________
thredds mailing list
thredds@xxxxxxxxxxxxxxxx
For list information or to unsubscribe, visit:
http://www.unidata.ucar.edu/mailing_lists/

--
Sincerely,

Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
1352 Lighthouse Ave
Pacific Grove, CA 93950-2079
Phone: (831)658-3205
Fax:   (831)648-8440
Email: bob.simons@xxxxxxxx

The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric
Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><



  • 2010 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the thredds archives: