An ncstream is an ordered sequence of one or more messages:
ncstream = MAGIC_START, {message}, MAGIC_END
message = headerMessage | dataMessage | sdataMessage | seqdataMessage | errorMessage
headerMessage = MAGIC_HEADER, vlen, NcStreamProto.Header
dataMessage = MAGIC_DATA, vlen, NcStreamProto.Data, vlen, (byte)*vlen
sdataMessage = MAGIC_DATA, vlen, NcStreamProto.Data, vlen, NcStreamProto.StructureData
seqdataMessage = MAGIC_DATA, vlen, NcStreamProto.Data, {MAGIC_VDATA, vlen, NcStreamProto.StructureData}, MAGIC_VEND
errorMessage = MAGIC_ERR, vlen, NcStreamProto.Error
vlen = variable length encoded positive integer == length of the following object in bytes
vn = variable length encoded positive integer == number of objects that follow
NcStreamProto.Header = Header message encoded by protobuf
NcStreamProto.Data = Data message encoded by protobuf
byte = actual bytes of data, encoding described by the NcStreamProto.Data message
primitives:
MAGIC_START = 0x43, 0x44, 0x46, 0x53
MAGIC_HEADER= 0xad, 0xec, 0xce, 0xda
MAGIC_DATA = 0xab, 0xec, 0xce, 0xba
MAGIC_VDATA = 0xab, 0xef, 0xfe, 0xba
MAGIC_VEND = 0xed, 0xef, 0xfe, 0xda
MAGIC_ERR = 0xab, 0xad, 0xba, 0xda
MAGIC_END = 0xed, 0xed, 0xde, 0xde
The protobuf messages are defined by
(these are files on Unidata's SVN server). These are compiled by the protobuf compiler into Java and C code that does the actual encoding/decoding from the stream.
Rules
There is just enough information in the stream to break the stream into messages and to know what kind of message it is. To interpret the message correctly, one must have the definition of the variable.
NcStreamProto.Data contains the full variable name the data belongs to, the DataType and Section, if its big-endian or little-endian. ?? Note in Java, DataOutputStream always writes in big-endian order.
message Data {
required string varName = 1;
required DataType dataType = 2;
required Section section = 3;
optional bool bigend = 4 [default=true];
}
Primitive types (byte, char, short, int, long, float, double): arrays of primitives are stored in row-major order. The endian-ness is specified in the NcStreamProto.Data message when needed.
Variable length types (String, Opaque): First the number of objects is written, then each object, preceeded by its length in bytes as a vlen. Strings are encoded as UTF-8 bytes. Opaque is just a bag of bytes. what about vlen? eg int (3, *) ??
Structure types (Structure, Sequence): An array of StructureData. Can be encoded in row or col (?). What about vlens ??
Should have a way to efficiently encode sparse data. Look at Bigtable/hBase.
A ncstream dataset can be stored in a file and read as a CDM dataset. It is an alternate encoding of CDM files. The intention is that these are "write-optimized" append-only files.
A ncstream dataset starts with MAGIC_START, followed by a set of messages, followed by MAGIC_END:
ncstreamDataset = MAGIC_START, {message}, MAGIC_END
MAGIC_START = 0x43, 0x44, 0x46, 0x53 // 'CDFS'
MAGIC_END = 0xed, 0xed, 0xde, 0xde
This document is maintained by John Caron and was last updated Dec 2010