[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #AIQ-275071]: [netcdf-hdf] Unexpected overall file size jump



Hi James,

I wrote:
> The problem is revealed by running "ncdump -s -h" on the netCDF-4 files, 
> which shows
> the the variables that use the unlimited dimension nsets get "chunked" into 
> 3-D tiles, and
> the netCDF-4 library chooses default chunk sizes that cause the file 
> expansion you see.
> 
> One simple solution would be to write the netCDF-4 data with the unlimited 
> dimension
> nsets changed instead to a fixed size dimension, an operation supported by 
> the "-u"
> option of the nccopy utility.  Then the variable data would all be stored 
> contiguously
> instead of chunked, as is required when a variable uses the unlimited 
> dimension.

When I tested this by using "nccopy -u" on your file, it changed nsets from an 
unlimited
dimension to a fixed-size dimension, but the result was still chunked the same 
as for
the input file, rather than stored contiguously, so it didn't make the file 
smaller.

However, when I used a new nccopy capability to rechunk the input in a way that 
made it 
contiguous (1 chunk) or used chunk sizes that evenly divided the dimension 
sizes, it 
worked as expected, reducing the size to essentially the same as the netCDF-3 
file:

  $ ls -l Fluid_Meas.*
  -rw-rw-r--  1 russ ustaff  42186296 Dec 22 13:26 Fluid_Meas.snc
  -rw-rw-r--  1 russ ustaff  95528366 Dec 22 13:27 Fluid_Meas.snc-nccopy-k3
  $ nccopy -k3 -c  nsets/,n_variables/,npoints/ Fluid_Meas.snc tmp.nc
  $ ls -l tmp.nc
  -rw-rw-r-- 1 russ ustaff 42251686 Jan  6 15:21 tmp.nc

This shows that the netCDF-4 version of the netCDF-3 file Fluid_Meas.snc, with
the right chunk sizes, has only slightly more overhead than the netCDF-3 file, 
even
without compression.  The "-c  nsets/,n_variables/,npoints/" option says to 
chunk
all variables that use the dimensions "nsets", "n_variables", or "npoints" with 
a
chunksize the same as the dimension length.  The time for nccopy to read and 
rechunk
the original 42 MB file to the netCDF-4 output file was fast on my desktop 
machine:

  real  0m0.34s
  user  0m0.07s
  sys   0m0.19s

and the times were similar if nccopy instead used the 95 MB netCDF-4 file that 
used 
the default chunk sizes:

  real  0m0.25s
  user  0m0.08s
  sys   0m0.14s

> Another possibility would be to explicitly set the chunksizes for the output 
> to better
> values than determined by the current library algorithm for selecting default 
> chunk
> sizes.  We're discussing whether we could fix the default chunk size 
> algorithm to avoid
> extreme file size expansion, such as you have demonstrated in this case.
> 
> For example, the library currently sets the default chunksizes for the 
> measurements
> variable as this output from netcdf -h -s shows:
> 
> float measurements(nsets, n_variables, npoints) ;
> measurements:_Storage = "chunked" ;
> measurements:_ChunkSizes = 1, 9, 120669 ;
> 
> resulting in 20 chunks, each of size 1*9*120669*4 = 4344084 bytes, for a 
> total of
> 86881680 bytes, about 87 Mbytes.
> 
> Better choices of chunksizes would be (1, 11, 152750) with 5 chunks or
> (1, 1, 152750) with 55 chunks or or (1, 11, 76375) with 110 chunks for
> example, none of which would waste any space in the chunks and which
> would all result in total storage of 33605000, about 34 Mbytes.
> 
> It looks like the current default chunking can result in a large amount
> of wasted space in cases like this.
> 
> Thanks for pointing out this problem.  In summary, to work around it currently
> you either have to avoid using the unlimited dimension for these netCDF-4 
> files
> or you have to explicitly set the chunk sizes using the appropriate API call 
> to
> not waste as much space as for the current choice of default chunk sizes.
> 
> I'm currently working on making it easy to specify chunksizes in the output of
> nccopy, but I don't know whether that will make the upcoming 4.1.2 release.  
> If
> not, it will be available separately in subsequent snapshot releases and 
> should
> help deal with problems like this, if we don't find a better algorithm for 
> selecting
> default chunksizes.

Options supporting specifying custom chunksizes to nccopy are now supported in 
the 
latest snapshot release, so will be in the upcoming version 4.1.2.  All the 
choices
for chunksizes described above work fine, and result in a data file essentially 
as
small as the netCDF-3 data file.  For example, to specify 110 chunks, each of 
size
1 x 1 x 76375 for variables using the "nsets", "n_variables", and "npoints" 
dimensions
of size 5, 11, and 152750 respectively, an nccopy invocation would be:

  $ nccopy -c nsets/1,n_variables/1,npoints/76375 Fluid_Meas.snc tmp.nc
  $ ls -l tmp.nc
[TODO]
  $ ncdump -h -s tmp.nc | grep "measurements"
[TODO]

and the output file is a netCDF-4 classic model file if you didn't specify an
output file type and are copying a netCDF-3 classic file.  Also, for the case of
copying a netCDF classic format file with specified chunk sizes, a default of
size 1 is assumed for any dimension not mentioned, so the above could be done 
more
simply with just:

  $ nccopy -c npoints/76375 Fluid_Meas.snc tmp.nc

Note that you can do considerably better with output file size if you specify
compression, as can be accomplished with the -d (deflate) option to nccopy:

  $ nccopy -d 1 -c npoints/76375 Fluid_Meas.snc tmp.nc
  $ ls -l tmp.nc
  -rw-rw-r-- 1 russ ustaff 32686447 Jan 10 09:17 tmp.nc

and you can do even better by specifying the "shuffle" option to improve the 
compression (which is still lossless) by reordering the data with all the
first bytes for each chunk in a single block, followed by all the second bytes, 
all the thrid bytes, and all the fourth bytes:

  $ nccopy -d 1 -s -c npoints/76375 Fluid_Meas.snc tmp.nc
  $ ls -l tmp.nc
  -rw-rw-r-- 1 russ ustaff 27864220 Jan 10 09:17 tmp.nc

You can verify with ncdump that all the data is still identical to the original.
Furthermore, by chunking the data, the compression is more useful for accessing
a subset of the data, as each chunk is independently compressed and the library
only uncompresses the chunks needed for the data requested, rather than all the
data in the fall.  Uncompressed chunks are cached, so accessing the same data
usually doesn't incur the cost of uncompressing the data again.

Anyway, the new version of nccopy is in the current snapshot.  More complete
documentation and more testing will be in the upcoming 4.1.2 release.  We're
still considering tweaking the current default chunk size algorithm to avoid
the file size increase in your example.  Your example has been quite helpful
for testing and refining the chunking/rechunking functionality in the nccopy
utility.

--Russ



--Russ


Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: AIQ-275071
Department: Support netCDF
Priority: Normal
Status: Closed