[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #API-226991]: NetCDF file size reduction when using missing values



Hi Coy,

> I am working on a NetCDF file format using NetCDF-4 libraries but
> writing in NetCDF-3 format. In my scenario my URI is based on time, I
> believe I must use the same dimensions for each of my data arrays. But
> the amount of data I need to store for each time block changes. I have
> set up my code to set the dimensions to the largest needed size and
> write 'empty' values when data is missing. I believe I have done this
> correctly as ncdump shows a dash instead of a number for these missing
> values. The problem is that my files are the same size whether I have
> lots of missing values or none at all. Is there something I can do to
> reduce the size of the file when it contains a fair number of missing
> values? I have attached two netcdf files to illustrate this problem.
> They are both the same file size, but one contains a fair number of
> missing values towards the end.

If you can write netCDF-4 classic model format, then you can use compression
on variables that have missing values and get smaller files.  These files can
be read by netCDF-3 applications that have been relinked with a netCDF-4 library
and the uncompression will occur transparently, without any changes to the
reading program.  Reading compressed data will be slower than uncompressed data,
but in many cases the smaller size is worth the time, and if there is enough
cache allocated, the uncompression will only occur once on the first read of 
each chunk of data.  See these two FAQs for more information:

  http://www.unidata.ucar.edu/netcdf/docs/faq.html#fv9

You may also want to look at the other FAQs on format variants:

  http://www.unidata.ucar.edu/software/netcdf/docs/faq.html#formats-and-versions

Alternatively, you can use multiple time dimensions.  For example, if some 
variables
have data once per second and others have data once per minute, it would be best
to use two different time dimensions and associated time coordinate variables 
for 
the two data rates.

Another alternative would be to use an index for observations and store the 
time of
each observation as data, rather than trying to share a single time dimension.  
Some
examples of this are illustrated in the proposed CF point observation 
conventions:

  https://cf-pcmdi.llnl.gov/trac/wiki/PointObservationConventions

> On a separate note, I don't always know how big my data array will need
> to be when I first create my NetCDF files. Is there a way to expand an
> array once it has been created or add a new array?

Yes, that's the purpose of declaring the size of a dimension to be "unlimited"
when you create it.  Data can be efficiently appended along an unlimited 
dimension
for any variable that has a shape that uses that unlimited dimension.  NetCDF-3
classic files may only have one unlimited dimension, but that restriction is 
removed
for netCDF-4 files.

You can also add a new variable to an existing netCDF file, but unless you have
planned for this in advance by allocating extra header space when you first 
create
the file, it can result in an expensive operation of moving all the data to make
space.  In netCDF-4 files, this is not a problem, and you can efficiently add
variables, dimensions, or attributes at any time without the library moving or
copying data.

For more on this, see the user's guide chapter on file structure and 
performance:

  http://www.unidata.ucar.edu/netcdf/docs/netcdf.html#Structure

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: API-226991
Department: Support netCDF
Priority: Normal
Status: Closed