DAP4 Commentary: DDX Lexical Elements

This document describes the lexical elements that occur in the DAP4 grammar.

Within the Relax-NG (rng) DAP4 grammar, there are markers for occurrences of primitive type such as integers, floats, or strings. The markers typically look like this when defining an attribute that can occur in the DAP4 DDX.

<attribute name="namespace"><data type="string"/></attribute>
The "<data type="string"/>" specifies the lexical class for the values that this attribute can have. In this case, the namespace attribute is defined to have a String value. Similar notation is used for values occurring as text within an xml element. The lexical specification later in this document defines the legal lexical structure for such lexical items. Specifically, it defines the format of the following lexical items.
  1. Constants, namely: string, float, integer, and character.
  2. Identifiers

The specification is written using the ISO/IEC 9945-2:2003 Information technology -- Portable Operating System Interface (POSIX) -- Part 2: System Interfaces. This is the extended Posix regular expression specification.

I have augmented it in the following ways.

  1. Names are assigned to regular expressions using the notation
    name = regular-expression

  2. Named expressions can be used in subsequent regular expressions by using the notation {name}. Such occurrences are equivalent to textually substituting the expression associated with name for the {name} occurrence: More or less like a macro.

DAP4 Lexical elements

Notes:
  1. The definition of {UTF8} is deferred to the next section.

  2. Comments are indicated using the "//" notation.

  3. Standard xml escape formats (&xDD) are assumed to be allowed anywhere.

Basic character set definitions

CONTROLS   = [\x00-\x1F] // ASCII control characters
WHITESPACE = [ \r\t\f]+
HEXCHAR = [0-9a-zA-Z]
// ASCII printable characters
ASCII = [0-9a-zA-Z !"#$%&'()*+,-./:;<=>?@[\\\]\\^_`|{}~]

Ascii characters that may appear unescaped in Identifiers
This is assumed to be basically all ASCII printable characters except the characters ' ', '.', '/', '"', ''', and '&'. Occurrences of these characters are assumed to be representable using the standard xml '&xx;' notation.

IDASCII    = [0-9a-zA-Z!#$%'()*+,-:;<=>?@[\\\]\\^_`|{}~]

The numeric classes: integer and float

INTEGER    = {INT}|{UINT}|{HEXINT}
INT = [+-][0-9]+{INTTYPE}?
UINT = [0-9]+{INTTYPE}?
HEXINT = {HEXSTRING}{INTTYPE}?
INTTYPE = ([BbSsLl]|"ll"|"LL")
HEXSTRING = (0[xX]{HEXCHAR}+)

FLOAT = ({MANTISSA}{EXPONENT}?)|{NANINF}
EXPONENT = ([eE][+-]?[0-9]+)
MANTISSA = [+-]?[0-9]*\.[0-9]*
NANINF = (-?inf|nan|NaN)

The Character classes

STRING     = ([^"\&]|{XMLESCAPE})*
CHARACTER = ([^'\&]|{XMLESCAPE})

Note that the character type only supports ASCII characters because it can only hold a single 8-bit byte.

The Identifier class

ID         = {IDCHAR}+
IDCHAR = ({IDASCII}|{XMLESCAPE}|{UTF8})
XMLESCAPE = &x{HEXCHAR}{HEXCHAR};

Note that the above lexical element classes are not disjoint. For example, the sequence of characters 1234 can be either an identifer,a float, or an integer. So the order of testing is assumed to be this.

  1. INTEGER
  2. FLOAT
  3. ID
  4. STRING

UTF-8 Character Encodings

We discuss UTF-8 character encoding in the context of this document. http://www.w3.org/2005/03/23-lex-U.

The most correct (validating) version of UTF8 character set is as follows.

UTF8 =   ([\xC2-\xDF][\x80-\xBF])     
| (\xE0[\xA0-\xBF][\x80-\xBF])
| ([\xE1-\xEC][\x80-\xBF][\x80-\xBF])
| (\xED[\x80-\x9F][\x80-\xBF])
| ([\xEE-\xEF][\x80-\xBF][\x80-\xBF])
| (\xF0[\x90-\xBF][\x80-\xBF][\x80-\xBF])
| ([\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF])
| (\xF4[\x80-\x8F][\x80-\xBF][\x80-\xBF])
The lines of the expression cover the UTF8 characters as follows:
  1. non-overlong 2-byte
  2. excluding overlongs
  3. straight 3-byte
  4. excluding surrogates
  5. straight 3-byte
  6. planes 1-3
  7. planes 4-15
  8. plane 16

Note that ASCII and control characters are not included.

The above reference also defines some alternative regular expressions.

The most relaxed version of UTF8 is this.

UTF8 = ([\xC0-\xD6].)
|([\xE0-\xEF]..)
|([\xF0-\xF7]...)

The partially relaxed version of UTF8 is this.

UTF8    = ([\xC0-\xD6][\x80-\xBF])        
| ([\xE0-\xEF][\x80-\xBF][\x80-\xBF])
| ([\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])

We deem it acceptable to use this last relaxed expression for validating UTF-8 character strings.

DAP4 Commentary: DAP4 Grammar

[Version: 1.0]

At the end of this document are instructions for accessing and testing a formal grammar for the DAP4 DDX using the Relax-NG schema language. I constructed it initially without any reference to any other explicit or implicit grammars so I could record my ideas. I have since modified it based on examining James' implied grammar and from comments from others and from a comparison with the xsd grammar.

Differences with DAP4 xsd Grammar

I converted the xsd grammar to an equivalent relax-ng (rng) grammar.

One major difference I see is in dimension handling.

  • I just used the name "dimension" rather than "shareddimension". For me, all dimensions (except anonymous ones) are shared.

  • The xsd separates out scalars from arrays. I always allowed the dimensions for a variable to be optional to handle the scalar case.

  • I attempted to be as consistent as possible, so I allowed any type including sequences and structures to be dimensioned. (but see previous commentary).

  • The dimensions of a variable are currently specified in the rng grammar as a sequence of elements named "Dimension" contained in the "variables" element type.
Other differences:
  • The Dataset element in the xsd has a couple of extra attributes. I added these.

  • The xsd appears to allow attributes to themselves have attributes. This needs discussion.

  • The URL basetype is in the xsd. But I do not see the justification for keeping it.

  • It appears that the Dataset contains a top level declaration. I chose to treat the Dataset itself as the top-level group.

  • Attribute declarations appear to have their own "namespace" attribute. Not sure why this is needed.

  • I do not understand the purpose of the "NewAttribute" attribute.

  • There may still be some minor differences in representing coordinate variables.

  • The xsd represents attribute values thus:
    <attribute name="a"><value>...</value><value>...</value></attribute>
    I chose to use attributes in the multi-valued case because I prefer not to use elements with content unless really necessary. So I represented the above as this.
    <attribute name="a"><value value="..."/><value value="..."/></attribute>

  • There is an issue of interleaving of definitions, or equivalently, what elements must occur in a fixed order.

  • Where should attributes be legal? I think the rng grammar and the xsd grammar agree on this: putting them almost everywhere, but it needs discussion.
  • I dropped Blobtype. I fail to see the need for this.

 

Testing the Relax-NG Grammar

You will need to copy three files:
  1. dap4.rng - this is the grammar file. it uses the Relax-NG schema language This grammar file can be obtained from http://dl.dropbox.com/u/53929684/dap4.rng.

  2. test.xml - this is a test file that I am growing to cover the whole grammar. This can be obtained from http://dl.dropbox.com/u/53929684/test.xml.

  3. jing.jar - Jing is a validator that takes the grammar and a test file and checks that the test file conforms to the grammar. This can be obtained from http://dl.dropbox.com/u/53929684/jing.jar.
To use this jar file, do the command:

 

java -jar jing.jar dap4.rng test.xml
No output is produced if the validation succeeds, otherwise, error messages are produced.

DAP4 Commentary: Possible Notation for Server Commands

Looking to the future, it is clear that eventually our query language, or more generically our previous discussion of URL Annotations must encompass three classes of computations.
  1. Queries in the DAP2 sense,
  2. Commands to control the client-side processing of requests on the server (i.e. thing like caching),
  3. Server-side processing.
I want to propose a notation for everything in the URL after the "?". I think this notation has ability to represent a wide variety of features without, I hope, being too generic.

The notation is basically nested functions combined with single assignment variables. A semantically nonsensical, but grammatical example would look something like this: "?_x=f(17,g(h(12))),f2(_x,[0:3:10])".

Everything past the "?" is in the form of a comma separated list of nested function invocations. Anything that begins with an underscore is considered a local, temporary, variable, anything that does not look like a function call (i.e. that is not a name followed immediately by a left parenthesis) is assumed to be a string constant. Each function has an arbitrary number of argument expressions separated by commas.

There would be several semantic rules.

  1. A variable may only be assigned to once (single assignment), but may be referenced as many times as desired after that.

  2. All functions have a defined "return type", which looks like a legal DDX minus certain things like groups, enumeration declarations, and dimension declarations. In addition, a function may be defined to have a "void" return type, which means it is executed for its side-effects on the server.

  3. Any expression that is not assigned to a variable and does not have a void return type will have its return value returned to the caller as part of a DATADDX.

Notes

My hypothesis is that this notation should also be able to handle most kinds of server side processing by defining and composing functions.

The standard projection+selection constraints of DAP2 can be represented using a special query() function whose argument is the standard DAP2 constraint, or alternatively, one could define a collection of nested functions to do the same thing, or alternatively, we could split the query part into two pieces separated by a semicolon. The first piece would be a constraint expression and the second piece (after the semicolon) would be in the nest function call form defined above.

An important aspect has to do with the construction of what may be referred to as a DATADDX. It defines the structure of a DDX that is the composition of the return types of the invoked functions that will return a (possibly structured) value. I need to work this out. BUT, in any case, the resulting DATADDX may have only have a loose relation to any DDX representing the raw dataset. This is because server-side computations will not have been represented in the original DDX, but only in the DATADDX.

I also hypothesize that Ferret notations

 http://.../thredds/dodsC/hfrnet/agg/6km_expr_{}{let deq1ubar=u[d=1,l=1:24@ave]}
could be represented in my proposed function notation without having to clutter up the URL format.

DAP4 Commentary: Sequences and Vlens

"Oh what a tangled web we weave when first we practice to build a type system."
Recently, I made the claim to James that the Sequence construct could serve to represent CDM/HDF5  vlen constructs.

His reply was as follows:

In my opinion we will want vlens to be in the data model, if not as a type, then as a feature of arrays - see the schema and my text on the Data Model page. The reason for that [is that] I have already tried using Sequence for vlen (in hdf4) and it was a failure. Not a failure from the technical POV but because neither server writers nor client writers nor client users ever 'got it.' I think one part of the problem for those people was that vlens in HDF4 do not support relational operators via the API while Sequences are supposed to. So the conceptual mismatch doomed the idea.

This is a very important observation, and has caused me to re-think how sequences and vlens should be handled in DAP4.

Vlens

Let us start by addressing the addition of vlens to the DAP4 data model.

Some possible ways to insert vlens into the dap4 data model include the following.

  1. In CDM, a vlen is marked by a "*" as a dimension name in the set of dimensions associated with a variable. the list of dimensions is allowed to have any number of occurrences of "*" [Aside, I will note that "unlimited" is also an option for CDM, but should not be needed for DAP4 because at the time of a request for data, the size of the unlimited dimension is known].

  2. James has proposed something similar except that the "*" is restricted to occurring as the last dimension.

  3. Another possibility is to create a new container object, call it Vlen, that (like Sequence) is inherently of variable length and is not dimensionable. [Aside: The term "vlen" is kind of odd. the term "list" would actually make more sense]

If we were to choose option 2 (terminal * only), then we must address the question of translation between CDM and DAP4. Going from DAP4 to CDM is straightforward because having the last dimension be "*" is legal in CDM. Going from CDM to DAP4 requires the introduction of a number of Structure elements.

Consider the following CDM example (using a pseudo-syntax)

 Int32 v[d1][*][d2][*][d3].
This would have to be represented in DAP4 something like this.
  Structure v_1 {
Structure v_2 {
Int32 v[d3];
}[d2][*];
}[d1][*];

In the lucky case that the last dimension is a already a "*":

  Int32 v[d1][*][d2][*];
then we have a simpler representation.
  Structure v_1 {
Int32 v[d2][*];
}[d1][*];

Commentary

  • As a personal matter, I would prefer to use the CDM representation. Adding a new container type (Vlen),while appealing semantically, only complicates the model more. James' approach requires the use of additional Structure definitions which, in my opinion, obscures the underlying semantics.

  • One thing that I need to check is how this affects the proposed on the wire format.

Sequences

Nathan Potter noted the following.
At one time we considered using ''nested'' sequences as a representation of an SQL database, where essentially the keys that link the tables in the DB define the structure of some nested sequence thing. I think it's a useful idea, but there is no implementation of it to play with. [Aside: I could swear that someone told me that they actually used a relational database as the back end to a sequence]
And later Nathan Potter noted the following:
I mentioned (in the same email that contained the nested sequence comment quoted above) that we wrote and released a server that represents a single database table or view as a single sequence. This is quite different from the point that I was making in the section quoted above that there may be a use case for nested sequences. (which was in response to your question "Are there any other legitimate examples for using NESTED sequences.") [Ndp 12:08, 26 February 2012 (PST)]

It is this ability to have selection constraints applied that separates sequences from vlens. It is clear that in the absence of selection constraints, there is no essential difference between sequences and vlens. Further, in the few examples I could find or were sent to me, it seemed that nested sequences were being used as, in effect, vlens.

So, we seem to have two very similar concepts (vlen and sequence), which complicates the DAP4 model. The question for me is:

Do we get rid of the Sequence concept, or at least define it as equivalent to the following?: Structure {...} [*].
My current belief is that we should keep sequences but with the following restrictions:
  1. Sequences can only occur as top-level, scalar, variables within a group.
  2. Sequences may not be nested in any other container (i.e. other sequences or structure)
This keeps sequences for the original purpose of acting as "relations". All other places where we might use a sequence before will now use a vlen.

As with vlens, the translation between DAP4 and CDM needs to be addressed.

  • The conversion from DAP4 to CDM can be addressed using the rule above, namely that a sequence is, in CDM, represented as
    Structure {...} [*]
    .

  • Translation from CDM to DAP4 allows for the option of never using sequences, but always using vlens. An alternate translation might be to say that if you have a top-level CDM structure whose only dimension is a vlen, then translate that to a sequence (in effect inverting the DAP4->CDM translation).

DAP4 Commentary: Characterization of URL Annotations

Characterization of URL Annotations

Requests for data using the DAP4 protocol will require a significant number of annotations specifying what is to be retrieved, commands to the server, and commands to the client.

This document is intended to just describe the information with which URLs need to annotated based on past experience. It also enumerates the possible URL components that can be used to encode the annotations. I will consider a specific encoding in a separate document.

Looking at the DAP2 URLs, we see three classes of annotations: protocol, server commands, client commands, and queries (aka constraints).

Protocol

For DAP2, the fact that the DAP2 protocol being used is inferred from context. In netcdf-C, for example, the fact that the dataset name is a URL is sufficient to indicate the use of the DAP2 protocol (although that will change). For some servers, such as TDS, the protocol is also inferrable from elements of the URL path. For example, in the URL
http://.../thredds/dodsC/...
the "dodsC" indicates the use of the DAP2 protocol. TDS also supports a schema called "dods:" that also indicates the use of DAP2.

Server Commands

Server commands in DAP2 are appended to the dataset URL to indicate attributes of the request. For example:
http://test.opendap.org/dataset.nc.dds
The defined kinds of server commands for DAP2 are as follows:
  • Component requests: ".dds", ".das"
  • Data requests with format: ".dods", ".asc"
  • Miscellaneous: ".html", ".ver"

Client commands

Client commands are interpreted by the client-side library to specify actions to be performed by the library. The existence of client commands is important because we want to communicate from the user to the library without requiring any knowledge by intermediate code layers. For example, netcdf C tools such as ncdump send URLs to the underlying netcdf DAP2 library without having to be cognizant of their structure. Currently, the primary use for client commands is caching, to indicate the degree of caching and prefetch to be used with a given request to the server.

Currently, client commands are represented as "name=value" pairs or just "name" enclosed in square braces: "[nocache]", for example. These commands are prefixed to the URL such as this.

[show=fetch]http://test.opendap.org/dataset.nc
The legal set of client commands is client library specific.

One notable problem with this form of client command is that it prevents generic URL parsers from parsing the URL because, of course, the square bracket notation is non-standard.

It should be noted that an alternative to using client commands in the URL is to use a configuration file (often referred to as the ".rc" file such as ".dodsrc"). This configuration file is assumed to be either in the caller's home directory or in the current working directory. It contains the necessary client commands to be applied. It is mildly less convenient for the user to use a .rc file than to embed a client command in the URL.

Queries

The third class of URL annotations specifies some form of query to control the information to be extracted from a dataset on the server. This information is then passed back to the client.

In DAP2, queries consisted of projections and selections specifying a subset of the data in a dataset.

A projection represents a path through the DDS parse tree annotated with constraints on the dimensions. For example, this query: "?P1.P2[0:2:10].F[1:3][4:4]".

A selection represents a boolean expression to be applied to the records of a sequence. Syntactically, a selection could cross sequences, thus implying a join of the sequences, but in practice this diss not allowed.

DAP2 queries also allowed the use of functions in the projections and selections to compute, for example, sums or averages. But the semantics was never very well defined. The set of allowable functions is server dependent.

Annotation Mechanisms

DAP4 will need to support at least the three classes of annotations described above. Whatever annotation mechanisms are chosen, the following properties seem desirable.
  • The resulting URL should be parseable by generic URL parsers =>Client commands should be embedded at the end of URLs, not the beginning.
  • Whatever annotation encoding is used, it is desirable if it is as uniform as possible.
As mechanisms, we have the following available to us:
  • The URL schema -- "http:" for example, or the TDS "dods:" schema. Using this is somewhat undesirable because it would need to encode also an underlying encrypted protocol like https: (versus http:).

  • URL path elements such as the current use of e.g. http://host/../dodsC/... by TDS.

  • URL query -- everything after the first '?' in the URL. URL queries technically have a defined form as name=value pairs, but in practice are pretty much free form.

  • URL fragment -- everything after the last '#'. Again these are pretty much free form.

  • Filename extensions -- everything between the data set name in the path and the start of the query. The DAP2 ".dds" and ".dods" are examples of this.

  • Alternate extension formats. Ethan Davis has proposed the use of a "+" notation instead of filename extensions: "+ddx+ascii", for example. This has the advantage of clearly not being confused with filename extensions while also making clear the additive nature of such annotation.
I should note that the Ferret server has taken to seriously abusing the URL format with URLs like this.
http://.../thredds/dodsC/hfrnet/agg/6km_expr_{}{let deq1ubar=u[d=1,l=1:24@ave]}
so we have much to aspire to :-)
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
  • feed AWIPS (17)
Browse by Topic
« March 2012 »
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
28
29
30
31
       
Today