« building NetCDF with... | Main | Indexes R'nt us »

04 June 2011

Lets look at Gray's 2nd criterion for Scientific Data Management Systems (SDMS):

Schema language: powerful data definition tools allow one to specify the abstract data formats and to specify how the data is organized.

Gray also mentions the following specific capabilities that I think are part of this topic:

allow data to be replicated - and organized in multiple ways
materialized view

NetCDF CDL is a declarative language that allows one to specify the data in a netCDF file and how its organized on disk:

netcdf jan {
 dimensions:
   time = 31;
   lat = 3;
   lon = 4;
 variables:
   double P(time, lat, lon);
   double T(time, lat, lon);
     T:long_name = "surface temperature";
     T:units = "degC";
   float lat(lat);
     lat:units = "degrees_north";
   float lon(lon);
     lon:units = "degrees_east";
   int time(time);
     time:units = "days since Jan 1, 2000";
 // global attributes:
 :title = "Example Data";
}

In its simplest form, NcML is just an XML version of CDL:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="https://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location="C:/dev/tds4.3/cdm/src/test/data/jan.nc">
  <dimension name="time" length="31" />
  <dimension name="lat" length="3" />
  <dimension name="lon" length="4" />
  <attribute name="title" value="Example Data" />
  <variable name="P" shape="time lat lon" type="double" />
  <variable name="T" shape="time lat lon" type="double">
    <attribute name="long_name" value="surface temperature" />
    <attribute name="units" value="degC" />
  </variable>
  <variable name="lat" shape="lat" type="float">
    <attribute name="units" value="degrees_north" />
  </variable>
  <variable name="lon" shape="lon" type="float">
    <attribute name="units" value="degrees_east" />
  </variable>
  <variable name="time" shape="time" type="int">
    <attribute name="units" value="days since Jan 1, 2000" />
  </variable>
</netcdf>

A powerful part of netCDF is that a binary form of the schema is encoded in the file itself, and one can move back and forth between the binary encoding and the ascii CDL or XML. This is one sense of the meaning of a netCDF file being self contained (the other refers to the metadata being contained in the file, as opposed to using external tables that may be difficult to find).

Its can be an advantage that the schema, metadata, and data can be contained in a single file. One can move this file around to different machines using the OS file system primitives. That's as simple as it gets.

Tools such as the NCO operators, NcML in the netcdf-java library, Unidata's new nccopy utility, and many others allow users to fairly easily reorganize the data in netcdf files. This is equivilent to the data replication / materialized views of retaional databases. The main limitation in netCDF is that such reorganization may break index-based applications, as previously noted.

Gray recognizes that scientific file formats like netCDF/HDF have schema languages:

We see NetCDF, HDF, FITS, and Google Map-Reduce as nascent database systems (others might think of them as file systems). They have a schema language (metadata) to define the metadata. They have a few indexing strategies, and a simple data manipulation language. They have the start of non-procedural and parallel programming. And, they have a collection of tools to create, access, search, and visualize the data. So, in our view they are simple database systems.

Perhaps interestingly, the netCDF documentation disclaims any aspirations to being a database system.

So for Gray's 2nd criterion for SDMS, I will give netCDF a grade of "pass". The functionality is all there, although arguably the seperate tools could be packaged together into a convenient scripting language. Is there a Python programmer in the house?

An interesting exercise is to compare SQL to CDL, which will also be an exercise in comparing the multidimensional data model to the relational table data model. To keep it from getting completely out of hand, lets use a netCDF file that fits the relational table model:

netcdf point {   dimensions:     sample= UNLIMITED; // currently 2344    variables:     float lon(sample);       lon:long_name = "longitude";       lon:units = "degrees_east";     float lat(sample);       lat:long_name = "latitude";       lat:units = "degrees_north";     int time(sample);       time:long_name = "time";       time:units = "hours since 1989-01-01 00:00:00";     float data(sample);       data:long_name = "skin temperature";       data:units = "Celsius";       data:coordinates = "time lon lat";     float data2(sample);       data2:long_name = "humidity";       data2:units = "%";       data2:coordinates = "time lon lat";  
 // global attributes:   :featureType = "point";   :Conventions = "CF-1.5";  }

So we have two data values (we could have as many as we want), with associated lat, lon and time coordinates at each "sample point". This is an example of point data in the CF discrete sampling Convention. The points are unconnected in that there is no particular relationship between data at sample i and sample i+1.

The corresponding SQL might look something like:

CREATE TABLE point( data REAL, data2 REAL, lat REAL, lon REAL, time DATE )

(Im not sure where the metadata (long name, units, etc) goes, I think as a seperate table somewhere, but we will ignore it for this conversation.)

Each row of the table corresponds to the data for a particular value of the sample index in the netCDF file. By making the sample dimension the unlimited dimension, the data is even laid out in the same way on disk for both the database and the netCDF file, namely all the data for sample=0, then all the data for sample=1, and so on. This is called a row-store, and is the usual way that data is stored in an RDBMS. Recent work on database column stores, where the data for a single column is stored together, has shown significant storage improvements in some cases. Interestingly, this disk storage strategy corresponds to our example netCDF file when the sample dimension is not unlimited. Of course Im sweeping a lot of details under the rug in these comparisons.

In the netCDF classic model, one is limited to a single unlimited dimension and therefore a single table per file. In the netCDF extended model, implemented in the CDM and in netCDF-4 files, one can explicitly use structures (aka compound types in the netCDF-4 C library API) to mimic tables. Using the CDM notation here, we could rewrite the example CDL as:

netcdf point {   dimensions:    sample=UNLIMITED; // currently 2344   variables:    structure {      float lon;      float lat;      int time;      float data;      float data2;    } sample(sample)  }

where I am leaving out the attribute metadata for clarity. So we have an array of structures with each structure having the same columns as in the relational table above. We can make this look even more like tables by using CDM sequences (aka vlen of compound types in the netCDF-4 C library API) :

netcdf point {   variables:    structure {      float lon;      float lat;      int time;      float data;      float data2;    } sample(*)  }

The (*) means an unlimited number of structures in this example. There can be any number of these in a netCDF-4 file. Suppose for example, that the data is actually located at a small number of stations, and you'd like to not store the station information more than once. One might use:

netcdf point {   variables:    structure {      float lon;      float lat;      int id;      String name;      String desc;    } station(*)    structure {      int time;      float data;      float data2;      int station_id;    } sample(*)  }

where each data point defines which station it belongs to through the station_id. A somewhat more natural way to associate these in a netCDF file is to use the station index instead of the id.

Its pretty obvious that the corresponding SQL is something like:

CREATE TABLE station( lat REAL, lon REAL, PRIMARY KEY id INT, name VARCHAR(80, desc VARCHAR(200))  CREATE TABLE point( data REAL, data2 REAL, time DATE, KEY station_id INT )

where I've put an index on the station id and data station_id fields to allow fast lookup of all the data for a station. These indexes, of course, are where relational databases offer significant advantages over netCDF files, although at some cost in complexity. But the complexity is hidden from the user doing data access, eg get all data for station 'KDEN':

SELECT * FROM station, point WHERE station.id = point.station_id AND station.id = 'KDEN'

Very nice. Compare this to the 42-page CF convention for discrete sample data, where we limited ourselves to the netCDF classic model and variable names have to be 6 letters or less and all caps.

(Apologies to those of you who actually know SQL. I havent used it since 1967, when we had to use manual wall switches from Ace hardware for logic gates. Thank God we now have Home Depot ;^)

Posted by $entry.creator.screenName