Entries for: NetCDF

Chunking Data: Choosing Shapes

In part 1, we explained what data chunking is about in the context of scientific data access libraries such as netCDF-4 and HDF5, presented a 38 GB 3-dimensional dataset as a motivating example, discussed benefits of chunking, and showed with some benchmarks what a huge difference chunk shapes can make in balancing read times for data that will be accessed in multiple ways.

In this post, I'll continue looking at that example dataset to see how we can derive good chunk shapes, generalize to other datasets, look at how long it can take to rechunk a multidimensional dataset, and look at the use of Solid State Disk (SSD) for both accessing and rechunking data.


[Read More]

Chunking Data: Why it Matters

What is data chunking? How can chunking help to organize large multidimensional datasets for both fast and flexible data access?  How should chunk shapes and sizes be chosen?  Can software such as netCDF-4 or HDF5 provide better defaults for chunking? If you're interested in those questions and some of the issues they raise, read on ...

[Read More]

DAP4 Commentary: DAP4 On-The-Wire Format

Background

The current DAP2 clients use two different approaches to managing the packet of data that is sent by the server.

The C++ libdap library uses what I will call an "eager" evaluation method. By this I mean that the whole packet is processed when received, is decomposed into its constituent parts (e.g. data arrays, sequence records, etc) and those parts are used to annotate the parsed DDS.

In contrast, the oc library uses a "lazy" evaluation method. That is, the incoming packet is sent immediately into a file or into a chunk of heap memory. Almost no preproccessing occurs. Data extraction occurs only when requested by the user code through the API.

Problem addressed

The relative merits and demerits of lazy versus eager are well known and will not be repeated here.

Lazy evaluation of the DAP2 packet is hampered by the inlining of variable length data: sequences and strings specifically. If it were not for those, the lazy evaluator could compute directly the location of the desired subset of data as requested by the user, and do so without having to read any intermediate information. But when, for example, Strings are inlined, then it is necessary to walk the packet piece by piece to step over the strings.

I plan to use lazy evaluation for my implementations of DAP4, and propose here the outline of a format for the on-the-wire data packet that makes lazy operation fast and simple without, I believe, interferring with eager evaluation.

Proposed solution

Since we have previously agreed on the use of multipart-mime, the incoming data is presumed to be sequence of variable length parts with a unique id for each part and (optionally) a known length for each part. In order to accommodate streaming, the length is allowed to have the value -1, which indicates that the length was unknown at the time the part was sent out by the server and must be computed by the client when received.

Under these assumptions, I propose the format described in this grammar. That grammar has a number of semantic (context sensitive) constraints not representable in the given context-free grammar. In order to disambiguate the grammar, I had to add some extra tokens, BOA,EOA,BOS,EOS, to delimit certain unbounded lists. In a real implementation, the equivalent of these tokens would be handled by the enforcement of the semantic constraints.

Notes:

  1. The concept of group does not appear in the grammar. This is because a group is a lexical notion and thus only affects names, not the structure of the on-the-wire format.

  2. The format presented here supports a limited form of self-description in that various tags and counts are included in the format sufficient to allow for parsing an instance of the format.
Grammar Overview
A narrative description of the grammar is as follows. The names enclosed int {...} are non-terminals in the grammar.

{request}
The complete on-the-wire {request} consists of a {mainpart} followed by zero or more {sequencepart} instances.

{mainpart}
The {mainpart} consists of a part header followed by a {structure} followed by a {stringannex} part. The {mainpart} instance is assumed to be a multi-part mime part and The {structure} represents a single, top-level dataset.

The {mainpart} is of known computable length. That is, its size can be computed solely knowing the DXD for the incoming data. This means that strings and sequences are not represented inline, but instead are represented by "pointers" into subsequent {sequencepart} and {stringannex} parts that contain the sequence records and/or string data. Note that the "pointers" are actually the unique id's of those parts.

{partheader}
The {partheader} is at the beginning of every multipart-mime part. It includes a unique identifier, which is the multipart-mime unique identifier. The {partheader} also contains the length in bytes of the data part of the mime part. If this is -1, then the length is unknown and must be computed by the client. The {partheader} also contains a {parttype} that indicates the type of the part.

{parttype}
The {parttype} defines the type of part and is equivalent to this enumeration.
enum parttype {mainpart=1, sequencepart=2, stringannex=3}

{structure}
A {structure} consists of a {tag} and a {fieldlist}. The {tag}'s {name} gives the name attribute value of the DDX element for this structure. The {tag}'s {typecode} indicates that this is a {structure} and the {tag's} count gives the number of fields in the structure. The {tag} is part of the self description feature.

{tag}
A {tag} consists of a {name}, a {typecode} and a {count}. The {name} gives the name attribute of some DDX element. The {typecode} gives the type of the following component of the part. The count gives the number of items in that component.

It should be noted that the tag could be removed from the on-the-wire format because it is reconstructable from the DDX. Removing it, however, would mean that the format is not parseable at all without knowing the DDX.

{name}
A {name} is a character array of size 16. The purpose of the name is soley to aid in self-description, so it is an inlined fixed size string that is presumed to represent the 16 character prefix of the the name attribute of some DDX element.

{count}
A {count} is a 64-bit unsigned integer.

{typecode}
A {typecode} is a 32-bit unsigned integer encoding the equivalent of the following enumeration listing all the defined DAP4 primitive data types plus some structural types (structure,sequence,record).
enum typecode {char=1,
int8=2,int16=3,int32=4,int64=5,
uint8=6,uint16=7,uint32=8,uint64=9,
float32=10,float64=11,
opaque=12,string=13,
structure=14,sequence=15,record=16};

{fieldlist}
A {fieldlist} is a sequence of {field}s. The number of fields was defined by the count in the tag for the enclosing {structure}.

{field}
A {field} consists of a {tag} giving the field name, a {typecode} indicating the type of the field and a {count}, and finally, an {array}, where the count specifies the number of elements in the {array}.

{array}
An {array}, generically, consists of the concatenation of one or more elements, where the elements all have the same {typecode} value, and whose length is defined by the {count} in the {tag} of the defining {field}.
  • For most of the simple types, the {array} is a simple concatenation of instances of the specified type.

  • For three of the simple types (char, int16, and uint16), the {array} is a simple concatenation of instances of the specified type that is then padded with ASCII NUL characters until the length of the array in bytes is a multiple of four; this is an accomodation to XDR.

  • For an array of opaque, the array is preceded by a count of the size of each opaque instance (this is separate from the count in the field of the total number of opaque instances). Each instance is an array of 8 bit bytes (uint8).

  • For an array of strings, the array is a concatenation of {stringref}s (see below).

  • For {structure} arrays, the array is a concatenation of the structure instances.

  • For {sequence} arrays, the array is a concatenation of {sequenceref}s.

{stringref}
A {stringref} is a "pointer" to a string in the {stringannex}. It consists of an {offset}, which indicates the relative start of the string in the annex, and a {count} to indicate the length of the string in utf-8 bytes. Note that this count is technically redundant with respect to the same count in the string in the annex. Note also that the unique id of the annex is not needed because the string annex part is presumed to immediately follow the part containing the {stringref}.

{sequenceref}
A {sequenceref} is a concatenation of "pointers", where each "pointer" gives the unique id of some corresponding {sequencepart} instance in some following multi-part mime part.

{stringannex}
A {stringannex} is a part and consists of a {partheader} followed by a {stringlist}. Each {mainpart} or each {sequencepart} part is presumed to be followed immediately with an associated {stringannex} part. The annex holds the content of all the strings referenced in that preceding part. If the annex is empty, then it may be elided.

{stringlist}
A {stringlist} is a concatenation of {string} instances.

{string}
A {string} is a {count} followed by a {chararray} whose length is specified by that {count}. The string content is followed by padding of ASCII NUL characters sufficient to bring the total size of the {string} instance up to a multiple of four (an accomodation to XDR).

{sequencepartlist}
A {sequencepartlist} represents the content of the sequences referenced in the {mainpart} followed by the transitive closure of the content of all nested sequence instances referenced in all the sequence parts.

{sequencepart}
A {sequencepart} is similar in form to a {mainpart} in that it consists of a {partheader} followed by data and followed by an optional {stringannex}. It differs from a {mainpart} in that its size is a variable number of {record} instances (i.e. a {sequence}). The number of records can be computed knowing the length of the {partheader} and knowing the size of the {record}. Note that the record's size is computable from the DDX because, like a {mainpart}, all nested sequences and strings have been moved to subsequence parts.

{sequence}
A {sequence} is a tag indicating the {name} of the sequence from the DDX, the {typecode}, which is always the sequence typecode, followed by a {count} of the number of records, and a {recordarray}.

{recordarray}
A {recordarray} is a concatenation of {record} instances.

{record}
A {record} is essentially identical in format to a single {structure}.

Discussion

The format described above assumes that the on-the-wire encoding is loosely based on the XDR encoding. However, some variations are assumed.
  • The encoding assumes "receiver makes it right", which means that the byte order on the wire is that of the sender, and that order may differ from that of the receiver. It is the duty of the receiver to determine if byte re-ordering is necessary and perform it as needed.

  • The above format includes short and unsigned short types. Normally, in XDR, all short/ushort instances are promoted to int/uint values. This encoding does not assume that, but in order to be consistent with the XDR four-byte rule, arrays of shorts are treated like arrays of bytes and are extended with ASCII NUL characters to a multiple of four bytes in length.

netCDF Identifiers and Character Escape Mechanisms (sigh!)

netCDF Identifiers and Character Escape Mechanisms (sigh!)

Ideally, netCDF should allow any printable UTF-8 character to be used in an identifier. Currently, that is almost the case, with forward slash being the exception because of the syntax of HDF5 identifiers.

More and more, the netCDF API is being used as wrapper for a wide variety of other formats: HD5, HDF4, GRIB, BUFR, DAP2, DAP4, etc. During the process of defining translations to/from netCDF and these other format, it is necessary to implicitly or explicitly define netCDF identifiers from the schemas of these other formats.

The canonical example is HDF5. In HDF5, many API functions take a path, which is a sequence of identifiers separated by '/'. A path may be absolute ("/g1/g2/x") or relative ("y"). It appears to be the case that there is no way in HDF5 to specify an identifier containing '/', such cases are always interpreted as paths. So, if one naively defined, thru the netcdf-4 API, a variable named "/x/y", there is no apparent way to actually get this defined properly in HDF5. It is this fact that has led to the current, IMO undesirable, restriction that netCDF identifiers may not contain '/'.

Super Escapes

This situation is going to recur as the netcdf API is used to wrap other data formats. What we will need is a mechanism by which we can convert an identifer containing arbitrary UTF-8 characters into another identifier in some rather restricted set of legal identifier characters. In addition, I would impose the rule that the conversion is invertible.

This kind of "super-escaping" is very hard because in the worst case, we are likely to encounter the situation where legal identifier characters are restricted to something like the alphanumerics plus underscore.

Netcdf-5 format to be based on GRIB

NetCDF-5 will be taking advantage of recent advances in data storage technology.

[Read More]
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

Developers' blog

Recent Entries:
Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
« May 2013
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today