More on conventions

Ethan Alpert (ethan@niwot.scd.ucar.EDU)
Wed, 30 Sep 92 14:02:27 MDT

Hello again,

	There's been so much to talk about in the past few days I'm not
sure I know how to respond. First, I think we all agree that conventions 
are needed and these conventions should support the "possibility of writing 
applications that access generic netCDF files and operate on them." So,
what do we need to do to support this. I contend that the,
in its current form, is somewhat inadaquate, for a few reasons which I
intend to cover.

	I'd like to start by looking at some examples based on what Tim
Holt stated. Before I start I should state that I am a "computer tweak?". 
Although I've never heard this term before, I'm sure it could be applied to me. 

Tim writes:

	What it comes down to for the average tech/PI with raw data is this -- 
"I want to make a graph of time vs temperature", "I want to plot the tracklines from the cruise", or "Where were we when we took water sample 38, and what was 
the flow-through temp and conductivity?"

	I think discussing what a "generic application" will need to know, about
the data, inorder to accomplish these tasks will highlight some areas where 
conventions will do a lot of good and some areas where conventions, of the
wrong type, may inhibit the production of general tools. So lets look at
each of these examples and what's involved with implementing them from the
applications perspective.

"I want to make a graph of time vs temperature"

	Seems simple enough. First a generic application may know about the
concept of time but it realy doesn't need to know that the dependent variable
is temperature. In fact all the application really needs is an array that
represents coordinates in the X direction, an array that represents 
coordinates in the Y direction and which one is the independent variable.  
With these arrays it can then determine what the ranges of the values in this 
data are and can then set up a window->viewport mapping for transforming the 
data onto a location on the screen. Not really much of a problem except for 
how does the application no which of possibly many variables in the file are 
the appropriate variables to use to make this plot? 

	Now what happens when the data is not stored in two simple arrays? 
Whose responsibility is it to state how variables in the netCDF file should
be selected and ordered to produce the two arrays needed for this task? 
For example, a data set could be collected that contains temperature, 
pressure and humidity. The following are a couple of the many possible ways
to put this data in to a netCDF file.

netcdf file1 {
	values = 5;
	time = UNLIMITED;

	float dataset1(time,values);
		dataset1:index0 = "temperature"
		dataset1:index1 = "pressure"
		dataset1:index2 = "humidity"
		dataset1:index3 = "lattitude"
		dataset1:index4 = "longitude"
	long time(time);

netcdf file2 {
	values = 3;
	latlon = 2;
	time = UNLIMITED;

	float dataset2(time,values);
		dataset2:index0 = "temperature"
		dataset2:index1 = "pressure"
		dataset2:index2 = "humidity"
	long time(time);
	float location(time,latlon);
		dataset2:index0 = "lattitude"
		dataset2:index1 = "longitude"

The reasons why someone would want to organize their data in this fashion 
is inconsequential.  The reasons may be related to how the instrument measuring 
the data works.  In these two examples one file uses 3 netcdf-dimensions and 
three variables and the other uses 2 netcdf-dimensions and two variables to 
represent the same data.  So now I ask the question again, how is the 
application supposed to know what it means to plot time vs temperature? These 
are VERY VERY simple examples. The complexity of "understanding" the 
organization of the data, from simply looking at the organization of the 
variables and dimensions in a file, grows as higher dimensional datasets are 
looked at. The number of permutations in the organization of a dataset grow as 
the dimensionality of the data grows.  

"I want to plot the tracklines from the cruise" 

	What information is needed by the application in this case. The app
needs to know which variables in the netCDF file are "latitude" and "longitude"
and that the data is infact geographic data. It then needs to determine what the
extent of the lattitude and longitude variables are so it can select the 
appropriate map projection. Again as in the previous example this data could
exist in the netCDF file in various organizations.

"Where were we when we took water sample 38, and what was the flow-through temp
and conductivity?"

	This type of request, if made directly to a generic application, would 
require the application to "know" what "sample 38", "flow-through", "temp" and
"conductivity" are, where they're stored and how to access and display them.
This certainly seems like it would be out of a resonable scope of capability of
a generic application.

	As can be seen there are several things that a self-describing netCDF 
file cannot possibly describe to an application. IMHO the primary problem is a
lack of standard organizations of data or a lack of a mechanism for 
communicating the organization of the data. By organization I mean what are
the geometries of the data( 1D, 2D, 3D ...), what set of variables and 
dimensions make up a single data set, is the set of a certain class of data
( Rectilinear grid, scattered, line, irregular grid, mesh . . .), does a
given variable represent an independent or dependent variable. I maintain that
these are the types of information for which conventions are needed in order
to realize "applications that access generic netCDF files and operate on them."

	The current document only standardizes names, although
important for allowing humans to understand the data, it is inadaquate for
communicating to the application how the data is organized. Understanding the
organization is needed to allow the application to determine which methodes 
could be used to visualize the data. If the intention of standardizing names 
is to allow applications to "understand" the data based on the names of 
variables in a netCDF file, it won't work as well as standardizing data 
representations (organizations). Why, because many types of data from 
different disciplines can be classified and visuzilized based on the geometry 
information(coordinate system) of the data which does not depend on the names 
or type of data, but on the structure. Using names like "sfc_t" for surface
temperature does nothing to comunicate the organization of the data or allow
an application to infer a visualization method unless the application has been
configured to "understand" all of the names in the document.
This is completely unnecessary do to the fact that most data fit in to simple
classes (organizations, structures) of data. 

A boat moving around on the surface of the ocean and collects data 
The struture or class of data for these data sets can be classified as a 
"2D Random data set." Why 2D, because in each case there are 2 coordinates 
(lat,lon) that define the location of the sample point. Why Random, because 
there are no functional relationships between the coordinate pairs. Similar 
abstractions can be made for gridded data and other classes of data. I feel 
very strongly that these are the areas that need to be standardized not names 
but structures. Until there is a method of grouping variables in a netCDF file
such that the geometric properties of the data can be inferred a generic 
visualization application is really impossible.

Ethan Alpert  internet: | Standard Disclaimer:
Scientific Visualization Group,             | I represent myself only.
Scientific Computing Division               |-------------------------------
National Center for Atmospheric Research, PO BOX 3000, Boulder Co, 80307-3000