Unidata - To provide the data services, tools, and cyberinfrastructure leadership that advance Earth system science, enhance educational opportunities, and broaden participation. Unidata
         
  advanced  
 

NetCDF Streaming Format (Experimental)


Overview

NetCDF Streaming Format (ncstream) is a write-optimized encoding of CDM datasets. Ncstream consists of a series of header and data messages, in any order. Writes are always appended. Later messages override earlier ones whenever they overlap or conflict. To add or modify structural metadata, simply append a new header message. Each data message identifies the variable and the section (rectangular subset) of data it contains. A variable's data thus consists of the collection of data messages for it, if any.

Design Goals

Possible uses

Implementation

Messages are encoded using Google's Protobuf library.

"Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format."

The main advantage of protobuf over XML is performance, since both message size and parsing speed is improved. A very important feature of protobuf is the ability to evolve the message structure in a way that doesnt break previous code.

We dont use protobuf messages for the data, since protobuf messages are built in memory, and we need to be able to stream (write data directly from its source onto the output stream, eg socket). The data is simply linearized in the usual netCDF way, and written to the stream. A data message identifying the variable and the section that the data represents is part of every data message.

Variable length datatypes like String and Opaque use the vdataMessage for data transfer. First the number of objects is written, then each object, preceeded by its length in bytes as a vlen. Strings are encoded as UTF-8 bytes. Opaque is just a bag of bytes.

TDS 4.0 currently has a prototype service using ncstream similar to OPeNDAP, which can be used by Netcdf-Java 4.0 library. The classic model has been tested, and the extended model processing is mostly complete.

Grammer

An ncstream is a sequence of one or more messages:

   ncstream = {message}
   message = headerMessage | dataMessage | vdataMessage | errorMessage
   headerMessage = MAGIC_HEADER, vlen, NcStreamProto.Header
   dataMessage = MAGIC_DATA, vlen, NcStreamProto.Data, vlen, (byte)*vlen
   vdataMessage = MAGIC_DATA, vlen, NcStreamProto.Data, vn, (vlen, bytes)*vn
   errorMessage = MAGIC_ERR, vlen, NcStreamProto.Error
   
   vlen = variable length encoded positive integer == length of the following object in bytes
   vn = variable length encoded positive integer == number of objects that follow
   NcStreamProto.Header = Header message encoded by protobuf
   NcStreamProto.Data = Data message encoded by protobuf
   data = actual bytes of data, encoding described by the NcStreamProto.Data message
     primitives: 


   MAGIC_HEADER= 0xad, 0xec, 0xce, 0xda 
   MAGIC_DATA =  0xab, 0xec, 0xce, 0xba 
MAGIC_ERR = 0xab, 0xad, 0xba, 0xda

The protobuf messages are defined by

(these are files on Unidata's SVN server)

An ncstream dataset starts with MAGIC_START, followed by a set of messages.

	ncstreamDataset = MAGIC_START ncstream



   MAGIC_START = 0x43, 0x44, 0x46, 0x53 // 'CDFS'

Rules:

Data encoding

There is just enough information in the stream to break the stream into messages and to know what kind of message it is. To interpret the message correctly, one must have the correct proto file. To interpret the data stream correctly, one must have the header information. (is that really true? maybe only for structs)

NcStreamProto.Data contains the full variable name the data belongs to, the DataType and Section, if its big-endian or little-endian. ?? Note in Java, DataOutputStream always writes in big-endian order.

 message Data {
required string varName = 1;
required DataType dataType = 2;
required Section section = 3;
optional bool bigend = 4 [default=true];
}

Primitive types (byte, char, short, int, long, float, double): arrays of primitives are stored in row-major order. The endian-ness is specified in the NcStreamProto.Data message when needed.

Variable length types (String, Opaque): First the number of objects is written, then each object, preceeded by its length in bytes as a vlen. Strings are encoded as UTF-8 bytes. Opaque is just a bag of bytes. what about vlen? eg int (3, *) ??

Structure types (Structure, Sequence): An array of StructureData. Can be encoded in row or col (?). What about vlens ??

 


This document is maintained by John Caron and was last updated August 17, 2009

 

 
 
  Contact Us     Site Map     Search     Terms and Conditions     Privacy Policy     Participation Policy
 
National Science Foundation (NSF) UCAR Community Programs   Unidata is a member of the UCAR Community Programs, is managed by the University Corporation for Atmospheric Research, and is sponsored by the National Science Foundation.
P.O. Box 3000     Boulder, CO 80307-3000 USA     Tel: 303-497-8643     Fax: 303-497-8690