Old Problem - Sparse Data

Jeremy Beal (jbeal@nvmedia.com)
Mon, 19 May 1997 16:03:39 -0600

Greetings!

Forgive me for bringing up a difficulty which has been encountered
before, but I'm interested in seeing if anybody has made any headway.

We have an interest in storing a variety of numerical data as output
from a software simulation of a physical system. The data which must
be stored are physical quantities which depend on spatial coordinates
and time. The output data must be read by another program, possibly
running on a different platform, so we would like the data file to
be platform independent. There is a large quantity of output data so
we practically need a binary file. We are interested in netCDF as
a means to achieve an easily read platform independent binary data
file. We would also enjoy being able to use the tools which have
been developed for examining netCDF files.

Unfortunately, our data is not necessarily regularly patterned, and
it seems that it may not fit well within a standard netCDF file. Our
natural inclination would be to use the time value as the unlimited
dimension within the file and then define the coordinates of our
spatial grid points using three additional dimensions.

Problems:
	Our data is sparse in both a spatial and time sense;
	i.e. not every physical quantity is written out at each
	time step, nor at every spatial grid point within a
	given timestep.

	Our grids themselves can vary as a function of the
	timestep, i.e. a finer grid might be created inside
	of a cell for a single time step for needed accuracy.
	The finer grid would only exist for one or two timesteps
	and would then no longer be used for the rest of the
	simulation.

As I understand the structure of the netCDF file, the only
way that we could have a file contain all of the quantities
would be to set up the dimensions to enumerate every possible
grid location and timestep which is ever used within the
simulation and store within these dimensions. Nulls will be
written for any values which are not explicitly placed in the
file. This would waste a tremendous amount of space in the file
due to the sparseness of our data, so much so as to make it
unusable.

I've seen from the archives that people have used some
tricks to get around the time sparseness issue, including
sub-record schemes (good if the data is regularly patterned
in time) and using separate netCDF files for quantities which
are written at different frequencies. These won't work easily
for our problem because we can't generally predict in advance 
when the quantities may need to be written, and because they
don't address the issue of spatial data sparseness.

Our current code writes a platform dependent binary file
which must be run through a conversion prior to being
loaded by our second program. The binary file is written
efficiently using our own data format, which takes advantage
of subheaders at the beginning of each time record indicating
exactly what has been stored within the time record.

We see three possibilities for writing a platform independent
binary data file with a reasonable size:

1. Filter the output from our existing routines through
the XDR library in order to write a platform independent
binary file. The output side should be easy, just one
different step in writing to the file. It would require
some amount of coding on the input stage to the second
program, as we'd need to have it dissect the proprietary
binary (but platform independent) file. Here we keep the
efficient file size but lose the benefits of netCDF like
external utilities, simple function calls to retrieve values,
etc.

2. Do something clever using the existing netCDF routines.
This would be something like the sub-record scheme or multiple
file workaround, but would need to address all of our sparseness
problems. I haven't thought of anything too great yet...?

3. Modify the netCDF library to allow for sub-headers at each
record explicitly showing what is stored within that record.
I'm not sure how difficult this would be yet. We would still
lose the benefits of compatibility with netCDF with respect
to utilities, etc. In addition we would need to maintain the
code with respect to updates in netCDF if we wanted to take
advantage of benefits of the updates. However, it would provide
the benefits of nice standard functions to retrieve arbitrary
pieces of data. I'm sure that there would be a performance hit
on random access reads/writes because you would no longer have
a nice fixed record size. I don't know how much of a hit it 
would be.

Is anybody else facing a data storage problem with sparse
and general data?

Any suggestions?

Thanks sincerely,


Jeremy Beal
jbeal@nvmedia.com