An Essay on Domain Specific Models

A domain specific model is one which has constructs that are specific to a particular domain. Examples are the CDM Coordinate System and Scientific Feature Type models, which augment CDM with features such as lat-lon based indexing, grids, and point data.

It is often the case that a domain specific model is implemented by providing an API that in turn is implemented with respect to some underlying, more generic data model. Again, CDM is an example, where the Coordinate System and Scientific Feature Type models are implemented using the underlying CDM Data Access Layer model.

DAP4 also provides a good, generic model capable of supporting domain specific models. In fact, it should be possible to implement the equivalent of the CDM Coordinate System and Scientific Feature Type models on top of DAP4.

Figure 1. Notional Domain Model over DAP4 Architecture

Figure 1 shows a notional architecture for using DAP4 as the basis for domain specific modeling.

The client is given an API supporting a domain specific model. It is assumed here that an instance of the domain model meta-data is represented as a traversable abstract syntax tree. So, the model API would provide the following operations.

  1. Request the meta-data for a specific dataset. The result would be a reference to the root of the abstract tree for that meta-data. This is analogous to asking DAP2 for a DDS.
  2. Traverse the meta-data tree.
  3. Request some subset of the data associated with the dataset. The request would be in terms of the domain-specific model.

The domain-specific library implementing the API would be responsible for making requests to the server for information. Those requests would indicate to the server the domain model being used as well as the dataset being requested.

The reply from the server would be in the form of a DAP4 DDX and data with annotations (i.e. attributes and extra variables) sufficient to allow the client-side domain model library to convert the returned information to the domain model for presentation to the client.

The server side operation is similar. It is assumed that request from a client contains sufficient identifying information to allow the server (a servlet server such as Tomcat) to forward the request to the servlet capable of interpreting such a domain specific model request.

The domain specific model servlet is capable of translating the dataset into DAP4 as the reply to the client's request. As expected by the client, the reply is annotated with sufficient information to allow the generic DAP4 reply to be converted to the domain specific model.

Dennis Heimbigner

GRIB renaming in 4.3

What's in a name?
[Read More]

Unidata Users Committee GRIB recommendations for WMO consideration

The Unidata Users Committee has asked the WMO to consider establishing a web registry of GRIB and BUFR tables. Your comments on this idea would be helpful. And if you have any influence with the WMO, please use it![Read More]

DAP4 Commentary: DAP4 On-The-Wire Format

Background

The current DAP2 clients use two different approaches to managing the packet of data that is sent by the server.

The C++ libdap library uses what I will call an "eager" evaluation method. By this I mean that the whole packet is processed when received, is decomposed into its constituent parts (e.g. data arrays, sequence records, etc) and those parts are used to annotate the parsed DDS.

In contrast, the oc library uses a "lazy" evaluation method. That is, the incoming packet is sent immediately into a file or into a chunk of heap memory. Almost no preproccessing occurs. Data extraction occurs only when requested by the user code through the API.

Problem addressed

The relative merits and demerits of lazy versus eager are well known and will not be repeated here.

Lazy evaluation of the DAP2 packet is hampered by the inlining of variable length data: sequences and strings specifically. If it were not for those, the lazy evaluator could compute directly the location of the desired subset of data as requested by the user, and do so without having to read any intermediate information. But when, for example, Strings are inlined, then it is necessary to walk the packet piece by piece to step over the strings.

I plan to use lazy evaluation for my implementations of DAP4, and propose here the outline of a format for the on-the-wire data packet that makes lazy operation fast and simple without, I believe, interferring with eager evaluation.

Proposed solution

Since we have previously agreed on the use of multipart-mime, the incoming data is presumed to be sequence of variable length parts with a unique id for each part and (optionally) a known length for each part. In order to accommodate streaming, the length is allowed to have the value -1, which indicates that the length was unknown at the time the part was sent out by the server and must be computed by the client when received.

Under these assumptions, I propose the format described in this grammar. That grammar has a number of semantic (context sensitive) constraints not representable in the given context-free grammar. In order to disambiguate the grammar, I had to add some extra tokens, BOA,EOA,BOS,EOS, to delimit certain unbounded lists. In a real implementation, the equivalent of these tokens would be handled by the enforcement of the semantic constraints.

Notes:

  1. The concept of group does not appear in the grammar. This is because a group is a lexical notion and thus only affects names, not the structure of the on-the-wire format.

  2. The format presented here supports a limited form of self-description in that various tags and counts are included in the format sufficient to allow for parsing an instance of the format.
Grammar Overview
A narrative description of the grammar is as follows. The names enclosed int {...} are non-terminals in the grammar.

{request}
The complete on-the-wire {request} consists of a {mainpart} followed by zero or more {sequencepart} instances.

{mainpart}
The {mainpart} consists of a part header followed by a {structure} followed by a {stringannex} part. The {mainpart} instance is assumed to be a multi-part mime part and The {structure} represents a single, top-level dataset.

The {mainpart} is of known computable length. That is, its size can be computed solely knowing the DXD for the incoming data. This means that strings and sequences are not represented inline, but instead are represented by "pointers" into subsequent {sequencepart} and {stringannex} parts that contain the sequence records and/or string data. Note that the "pointers" are actually the unique id's of those parts.

{partheader}
The {partheader} is at the beginning of every multipart-mime part. It includes a unique identifier, which is the multipart-mime unique identifier. The {partheader} also contains the length in bytes of the data part of the mime part. If this is -1, then the length is unknown and must be computed by the client. The {partheader} also contains a {parttype} that indicates the type of the part.

{parttype}
The {parttype} defines the type of part and is equivalent to this enumeration.
enum parttype {mainpart=1, sequencepart=2, stringannex=3}

{structure}
A {structure} consists of a {tag} and a {fieldlist}. The {tag}'s {name} gives the name attribute value of the DDX element for this structure. The {tag}'s {typecode} indicates that this is a {structure} and the {tag's} count gives the number of fields in the structure. The {tag} is part of the self description feature.

{tag}
A {tag} consists of a {name}, a {typecode} and a {count}. The {name} gives the name attribute of some DDX element. The {typecode} gives the type of the following component of the part. The count gives the number of items in that component.

It should be noted that the tag could be removed from the on-the-wire format because it is reconstructable from the DDX. Removing it, however, would mean that the format is not parseable at all without knowing the DDX.

{name}
A {name} is a character array of size 16. The purpose of the name is soley to aid in self-description, so it is an inlined fixed size string that is presumed to represent the 16 character prefix of the the name attribute of some DDX element.

{count}
A {count} is a 64-bit unsigned integer.

{typecode}
A {typecode} is a 32-bit unsigned integer encoding the equivalent of the following enumeration listing all the defined DAP4 primitive data types plus some structural types (structure,sequence,record).
enum typecode {char=1,
int8=2,int16=3,int32=4,int64=5,
uint8=6,uint16=7,uint32=8,uint64=9,
float32=10,float64=11,
opaque=12,string=13,
structure=14,sequence=15,record=16};

{fieldlist}
A {fieldlist} is a sequence of {field}s. The number of fields was defined by the count in the tag for the enclosing {structure}.

{field}
A {field} consists of a {tag} giving the field name, a {typecode} indicating the type of the field and a {count}, and finally, an {array}, where the count specifies the number of elements in the {array}.

{array}
An {array}, generically, consists of the concatenation of one or more elements, where the elements all have the same {typecode} value, and whose length is defined by the {count} in the {tag} of the defining {field}.
  • For most of the simple types, the {array} is a simple concatenation of instances of the specified type.

  • For three of the simple types (char, int16, and uint16), the {array} is a simple concatenation of instances of the specified type that is then padded with ASCII NUL characters until the length of the array in bytes is a multiple of four; this is an accomodation to XDR.

  • For an array of opaque, the array is preceded by a count of the size of each opaque instance (this is separate from the count in the field of the total number of opaque instances). Each instance is an array of 8 bit bytes (uint8).

  • For an array of strings, the array is a concatenation of {stringref}s (see below).

  • For {structure} arrays, the array is a concatenation of the structure instances.

  • For {sequence} arrays, the array is a concatenation of {sequenceref}s.

{stringref}
A {stringref} is a "pointer" to a string in the {stringannex}. It consists of an {offset}, which indicates the relative start of the string in the annex, and a {count} to indicate the length of the string in utf-8 bytes. Note that this count is technically redundant with respect to the same count in the string in the annex. Note also that the unique id of the annex is not needed because the string annex part is presumed to immediately follow the part containing the {stringref}.

{sequenceref}
A {sequenceref} is a concatenation of "pointers", where each "pointer" gives the unique id of some corresponding {sequencepart} instance in some following multi-part mime part.

{stringannex}
A {stringannex} is a part and consists of a {partheader} followed by a {stringlist}. Each {mainpart} or each {sequencepart} part is presumed to be followed immediately with an associated {stringannex} part. The annex holds the content of all the strings referenced in that preceding part. If the annex is empty, then it may be elided.

{stringlist}
A {stringlist} is a concatenation of {string} instances.

{string}
A {string} is a {count} followed by a {chararray} whose length is specified by that {count}. The string content is followed by padding of ASCII NUL characters sufficient to bring the total size of the {string} instance up to a multiple of four (an accomodation to XDR).

{sequencepartlist}
A {sequencepartlist} represents the content of the sequences referenced in the {mainpart} followed by the transitive closure of the content of all nested sequence instances referenced in all the sequence parts.

{sequencepart}
A {sequencepart} is similar in form to a {mainpart} in that it consists of a {partheader} followed by data and followed by an optional {stringannex}. It differs from a {mainpart} in that its size is a variable number of {record} instances (i.e. a {sequence}). The number of records can be computed knowing the length of the {partheader} and knowing the size of the {record}. Note that the record's size is computable from the DDX because, like a {mainpart}, all nested sequences and strings have been moved to subsequence parts.

{sequence}
A {sequence} is a tag indicating the {name} of the sequence from the DDX, the {typecode}, which is always the sequence typecode, followed by a {count} of the number of records, and a {recordarray}.

{recordarray}
A {recordarray} is a concatenation of {record} instances.

{record}
A {record} is essentially identical in format to a single {structure}.

Discussion

The format described above assumes that the on-the-wire encoding is loosely based on the XDR encoding. However, some variations are assumed.
  • The encoding assumes "receiver makes it right", which means that the byte order on the wire is that of the sender, and that order may differ from that of the receiver. It is the duty of the receiver to determine if byte re-ordering is necessary and perform it as needed.

  • The above format includes short and unsigned short types. Normally, in XDR, all short/ushort instances are promoted to int/uint values. This encoding does not assume that, but in order to be consistent with the XDR four-byte rule, arrays of shorts are treated like arrays of bytes and are extended with ASCII NUL characters to a multiple of four bytes in length.

netCDF Identifiers and Character Escape Mechanisms (sigh!)

netCDF Identifiers and Character Escape Mechanisms (sigh!)

Ideally, netCDF should allow any printable UTF-8 character to be used in an identifier. Currently, that is almost the case, with forward slash being the exception because of the syntax of HDF5 identifiers.

More and more, the netCDF API is being used as wrapper for a wide variety of other formats: HD5, HDF4, GRIB, BUFR, DAP2, DAP4, etc. During the process of defining translations to/from netCDF and these other format, it is necessary to implicitly or explicitly define netCDF identifiers from the schemas of these other formats.

The canonical example is HDF5. In HDF5, many API functions take a path, which is a sequence of identifiers separated by '/'. A path may be absolute ("/g1/g2/x") or relative ("y"). It appears to be the case that there is no way in HDF5 to specify an identifier containing '/', such cases are always interpreted as paths. So, if one naively defined, thru the netcdf-4 API, a variable named "/x/y", there is no apparent way to actually get this defined properly in HDF5. It is this fact that has led to the current, IMO undesirable, restriction that netCDF identifiers may not contain '/'.

Super Escapes

This situation is going to recur as the netcdf API is used to wrap other data formats. What we will need is a mechanism by which we can convert an identifer containing arbitrary UTF-8 characters into another identifier in some rather restricted set of legal identifier characters. In addition, I would impose the rule that the conversion is invertible.

This kind of "super-escaping" is very hard because in the worst case, we are likely to encounter the situation where legal identifier characters are restricted to something like the alphanumerics plus underscore.

Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

Developers' blog

Recent Entries:
Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
« May 2012
SunMonTueWedThuFriSat
  
1
2
3
4
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today