[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #AIQ-275071]: [netcdf-hdf] Unexpected overall file size jump



James,

Sorry it's taken so long to respond to your question about netCDF-4 file size.

The problem is revealed by running "ncdump -s -h" on the netCDF-4 files, which 
shows
the the variables that use the unlimited dimension nsets get "chunked" into 3-D 
tiles, and
the netCDF-4 library chooses default chunk sizes that cause the file expansion 
you see:

One simple solution would be to write the netCDF-4 data with the unlimited 
dimension
nsets changed instead to a fixed size dimension, an operation supported by the 
"-u"
option of the nccopy utility.  Then the variable data would all be stored 
contiguously
instead of chunked, as is required when a variable uses the unlimited dimension.

Another possibility would be to explicitly set the chunksizes for the output to 
better
values than determined by the current library algorithm for selecting default 
chunk
sizes.  We're discussing whether we could fix the default chunk size algorithm 
to avoid
extreme file size expansion, such as you have demonstrated in this case.

For example, the library currently sets the default chunksizes for the 
measurements 
variable as this output from netcdf -h -s shows:

        float measurements(nsets, n_variables, npoints) ;
                measurements:_Storage = "chunked" ;
                measurements:_ChunkSizes = 1, 9, 120669 ;

resulting in 20 chunks, each of size 1*9*120669*4 = 4344084 bytes, for a total 
of
86881680 bytes, about 87 Mbytes.

Better choices of chunksizes would be (1, 11, 152750) with 5 chunks or
(1, 1, 152750) with 55 chunks or or (1, 11, 76375) with 110 chunks for
example, none of which would waste any space in the chunks and which
would all result in total storage of 33605000, about 34 Mbytes.

It looks like the current default chunking can result in a large amount
of wasted space in cases like this.

Thanks for pointing out this problem.  In summary, to work around it currently
you either have to avoid using the unlimited dimension for these netCDF-4 files
or you have to explicitly set the chunk sizes using the appropriate API call to
not waste as much space as for the current choice of default chunk sizes.

I'm currently working on making it easy to specify chunksizes in the output of
nccopy, but I don't know whether that will make the upcoming 4.1.2 release.  If
not, it will be available separately in subsequent snapshot releases and should
help deal with problems like this, if we don't find a better algorithm for 
selecting
default chunksizes.

--Russ



Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: AIQ-275071
Department: Support netCDF
Priority: Normal
Status: Closed