Ncstream Grammer

Version 1 (DRAFT)

An ncstream is an ordered sequence of one or more messages:

   ncstream = MAGIC_START, {message}, MAGIC_END
   message = headerMessage | dataMessage | sdataMessage | seqdataMessage | errorMessage
   headerMessage = MAGIC_HEADER, vlen, NcStreamProto.Header
   dataMessage = MAGIC_DATA, vlen, NcStreamProto.Data, vlen, (byte)*vlen
   sdataMessage = MAGIC_DATA, vlen, NcStreamProto.Data, vlen, NcStreamProto.StructureData
   seqdataMessage = MAGIC_DATA, vlen, NcStreamProto.Data, {MAGIC_VDATA, vlen, NcStreamProto.StructureData}, MAGIC_VEND
   errorMessage = MAGIC_ERR, vlen, NcStreamProto.Error
   
   vlen = variable length encoded positive integer == length of the following object in bytes
   vn = variable length encoded positive integer == number of objects that follow
   NcStreamProto.Header = Header message encoded by protobuf
   NcStreamProto.Data = Data message encoded by protobuf
   byte = actual bytes of data, encoding described by the NcStreamProto.Data message

primitives:

   MAGIC_START = 0x43, 0x44, 0x46, 0x53 
   MAGIC_HEADER= 0xad, 0xec, 0xce, 0xda 
   MAGIC_DATA =  0xab, 0xec, 0xce, 0xba 
MAGIC_VDATA = 0xab, 0xef, 0xfe, 0xba
MAGIC_VEND = 0xed, 0xef, 0xfe, 0xda
MAGIC_ERR = 0xab, 0xad, 0xba, 0xda MAGIC_END = 0xed, 0xed, 0xde, 0xde

The protobuf messages are defined by

(these are files on Unidata's SVN server). These are compiled by the protobuf compiler into Java and C code that does the actual encoding/decoding from the stream.

Rules

Data encoding

There is just enough information in the stream to break the stream into messages and to know what kind of message it is. To interpret the message correctly, one must have the definition of the variable.

NcStreamProto.Data contains the full variable name the data belongs to, the DataType and Section, if its big-endian or little-endian. ?? Note in Java, DataOutputStream always writes in big-endian order.

 message Data {
required string varName = 1;
required DataType dataType = 2;
required Section section = 3;
optional bool bigend = 4 [default=true];
}

Primitive types (byte, char, short, int, long, float, double): arrays of primitives are stored in row-major order. The endian-ness is specified in the NcStreamProto.Data message when needed.

Variable length types (String, Opaque): First the number of objects is written, then each object, preceeded by its length in bytes as a vlen. Strings are encoded as UTF-8 bytes. Opaque is just a bag of bytes. what about vlen? eg int (3, *) ??

Structure types (Structure, Sequence): An array of StructureData. Can be encoded in row or col (?). What about vlens ??

Should have a way to efficiently encode sparse data. Look at Bigtable/hBase.

Ncstream Dataset (Incomplete)

A ncstream dataset can be stored in a file and read as a CDM dataset. It is an alternate encoding of CDM files. The intention is that these are "write-optimized" append-only files.

A ncstream dataset starts with MAGIC_START, followed by a set of messages, followed by MAGIC_END:

	ncstreamDataset = MAGIC_START, {message}, MAGIC_END
   MAGIC_START = 0x43, 0x44, 0x46, 0x53 // 'CDFS'
   MAGIC_END =   0xed, 0xed, 0xde, 0xde

Rules:


This document is maintained by John Caron and was last updated Dec 2010