Unidata Developer's Blog

« Previous page | Main

Showing entries tagged [opuls]

DAP4 Commentary: Characterization of URL Annotations

27 March 2012

Characterization of URL Annotations

Requests for data using the DAP4 protocol will require a significant number of annotations specifying what is to be retrieved, commands to the server, and commands to the client.

This document is intended to just describe the information with which URLs need to annotated based on past experience. It also enumerates the possible URL components that can be used to encode the annotations. I will consider a specific encoding in a separate document.

Looking at the DAP2 URLs, we see three classes of annotations: protocol, server commands, client commands, and queries (aka constraints).

Protocol

For DAP2, the fact that the DAP2 protocol being used is inferred from context. In netcdf-C, for example, the fact that the dataset name is a URL is sufficient to indicate the use of the DAP2 protocol (although that will change). For some servers, such as TDS, the protocol is also inferrable from elements of the URL path. For example, in the URL

http://.../thredds/dodsC/...

the "dodsC" indicates the use of the DAP2 protocol. TDS also supports a schema called "dods:" that also indicates the use of DAP2.

Server Commands

Server commands in DAP2 are appended to the dataset URL to indicate attributes of the request. For example:

http://test.opendap.org/dataset.nc.dds

The defined kinds of server commands for DAP2 are as follows:

Component requests: ".dds", ".das"
Data requests with format: ".dods", ".asc"
Miscellaneous: ".html", ".ver"

Client commands

Client commands are interpreted by the client-side library to specify actions to be performed by the library. The existence of client commands is important because we want to communicate from the user to the library without requiring any knowledge by intermediate code layers. For example, netcdf C tools such as ncdump send URLs to the underlying netcdf DAP2 library without having to be cognizant of their structure. Currently, the primary use for client commands is caching, to indicate the degree of caching and prefetch to be used with a given request to the server.

Currently, client commands are represented as "name=value" pairs or just "name" enclosed in square braces: "[nocache]", for example. These commands are prefixed to the URL such as this.

[show=fetch]http://test.opendap.org/dataset.nc

The legal set of client commands is client library specific.

One notable problem with this form of client command is that it prevents generic URL parsers from parsing the URL because, of course, the square bracket notation is non-standard.

It should be noted that an alternative to using client commands in the URL is to use a configuration file (often referred to as the ".rc" file such as ".dodsrc"). This configuration file is assumed to be either in the caller's home directory or in the current working directory. It contains the necessary client commands to be applied. It is mildly less convenient for the user to use a .rc file than to embed a client command in the URL.

Queries

The third class of URL annotations specifies some form of query to control the information to be extracted from a dataset on the server. This information is then passed back to the client.

In DAP2, queries consisted of projections and selections specifying a subset of the data in a dataset.

A projection represents a path through the DDS parse tree annotated with constraints on the dimensions. For example, this query: "?P1.P2[0:2:10].F[1:3][4:4]".

A selection represents a boolean expression to be applied to the records of a sequence. Syntactically, a selection could cross sequences, thus implying a join of the sequences, but in practice this diss not allowed.

DAP2 queries also allowed the use of functions in the projections and selections to compute, for example, sums or averages. But the semantics was never very well defined. The set of allowable functions is server dependent.

Annotation Mechanisms

DAP4 will need to support at least the three classes of annotations described above. Whatever annotation mechanisms are chosen, the following properties seem desirable.

The resulting URL should be parseable by generic URL parsers =>Client commands should be embedded at the end of URLs, not the beginning.
Whatever annotation encoding is used, it is desirable if it is as uniform as possible.

As mechanisms, we have the following available to us:

The URL schema -- "http:" for example, or the TDS "dods:" schema. Using this is somewhat undesirable because it would need to encode also an underlying encrypted protocol like https: (versus http:).
URL path elements such as the current use of e.g. http://host/../dodsC/... by TDS.
URL query -- everything after the first '?' in the URL. URL queries technically have a defined form as name=value pairs, but in practice are pretty much free form.
URL fragment -- everything after the last '#'. Again these are pretty much free form.
Filename extensions -- everything between the data set name in the path and the start of the query. The DAP2 ".dds" and ".dods" are examples of this.
Alternate extension formats. Ethan Davis has proposed the use of a "+" notation instead of filename extensions: "+ddx+ascii", for example. This has the advantage of clearly not being confused with filename extensions while also making clear the additive nature of such annotation.

I should note that the Ferret server has taken to seriously abusing the URL format with URLs like this.

http://.../thredds/dodsC/hfrnet/agg/6km_expr_{}{let deq1ubar=u[d=1,l=1:24@ave]}

so we have much to aspire to :-)

Posted by $entry.creator.screenName [ Comments [1] ]

Email this

DAP4 Commentary: The on-the-wire format

27 March 2012

Background

The current DAP2 clients, use two different approaches to managing the packet of data that is sent by the server.

The C++ libdap library uses what I will call an "eager" evaluation method. By this I mean that the whole packet is processed when received, is decomposed into its constituent parts (e.g. data arrays, sequence records, etc) and those parts are used to annotate the parsed DDS.

In contrast, the oc library uses a "lazy" evaluation method. That is, the incoming packet is sent immediately into a file or into a chunk of heap memory. Almost no preproccessing occurs. Data extraction occurs only when requested by the user code through the API.

Problem addressed

The relative merits and demerits of lazy versus eager are well known and will not be repeated here. Lazy evaluation of the DAP2 packet is hampered by the inlining of variable length data: sequences and strings specifically. If it were not for those, the lazy evaluator could compute directly the location of the desired subset of data as requested by the user, and do so without having to read any intermediate information. But when, for example, Strings are inlined, then it is necessary to walk the packet piece by piece to step over the strings. I plan to use lazy evaluation for my implementations of DAP4, and propose here the outline of a format for the on-the-wire data packet that makes lazy operation fast and simple without, I believe, interferring with eager evaluation.

Proposed solution

Since we have previously agreed on the use of multipart-mime, the incoming data is presumed to be sequence of variable length packets with a known length (for each packet) and a unique id for each packet.

Under these assumptions, I propose the following format.

The initial packet is of known computable length, aka "fixed length" for short. That is, its size can be computed solely knowing the DXD for the incoming data. This means that strings and sequences are not represented inline, but instead are represented by fixed-size "pointers" into other, following packets that contain the sequence and/or string data.
Each element in a string array in the initial packet is represented by three pieces of fixed size info:
1. the unique id of the packet containing the contents of the string.
2. the offset in the packet defined in (a).
3. the length of the string in bytes (assuming utf-8 encoding).
As an optimization, the string packet can be directly appended to the fixed size initial packet, in which case, the first item is not strictly necessary.
Given a sequence object either a scalar or as an array of sequences, the sequence is replaced by the following fixed size item:
- The unique id of the packet containing the sequence records
Further, each record of the sequence packet is assumed to be "fixed length" by applying the rules above. This means that knowing the total size of the packet containing the sequence records, it is possible to know the exact number of records in the packet without actually having to walk the sequence packet to count them.

Rationale for the solution

The above representation makes lazy evaluation very simple and a given item in a packet can be reached in o(1) time. Even with the case of nested sequences/vlens, the proper item can be reached in o(log n) time where n is the depth of the nesting. The cost is that a hash map is needed to map unique id's to offsets in the file or heap memory. The lazy versus eager cases also apply on the server side. Currently, for example, the opendap code on the thredds server takes the underlying data source (.nc file for example), converts it to DAP2 and annotates the DDS with the data. Then as a second pass, the annotations are converted as needed and sent out over the wire. A lazy version would associate elements of the underlying source with the DDS. Transfer of the data to the wire would then occur directly from the original source to the wire format a needed. As an aside, I have a (untested and unverified) hypothesis is that the proposed encoding will also simplify the use of lazy evaluation on the server side.

Updates

2012-02-20: The above encoding has as one consequence that all embedded counts that currently exist in DAP2 are superfluous. Ditto for the sequence record markers. It may still be desirable to include the counts for purposes of error checking, but they are not strictly necessary.

Posted by $entry.creator.screenName