[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #VCQ-846449]: Re: NetCDF Java Read API



Hi Greg,

Chunking is a property associated with variables rather than files.
Any compressed variable is chunked by default, with each chunk
compressed and uncompressed independently.  The chunking for a
variable is determined when it is created (as is the compression
level).  Chunking and compression are properties of a variable that
cannot be changed after the variable is defined (which for the C++
interface means after any data has been written to the file).  If
chunking parameters are not specified when a variable is defined,
default chunking is used, which may not be optimal.  Expected access
patterns for a variable can help determine good chunking parameters.

All this is documented in the netCDF-4 C Users Guide, but not in the
C++ Users Guide, which is still just for netCDF-3.  A better
introduction to chunking might be the 10 "slides" (really short web
pages) on chunking and compression from the 2008 netCDF training
workshop at

  http://www.unidata.ucar.edu/netcdf/workshops/2008/nc4chunking/

> Regarding chunking, I open the file with code that resembles:
>
> size_t ncChunkSize_bytes = yBins * xBins;
> size_t* chunkSizePtr = &ncChunkSize_bytes;
>
> NcFile* ncFile = new NcFile( tempName.c_str(), NcFile::Replace,
> chunkSizePtr, 0, ncFileFormat );

The unfortunately named chunkSizePtr in this ncFile constructor has
nothing to do with the per-variable chunk sizes (one component for
each dimension of the variable).

> Can you identify anything particularly offensive about this method
> of opening or writing a NetCDF file?

Since you don't have any unlimited dimensions, the default chunking is
the full dimension size for each fixed dimension, which in your case
means a single chunk.  So reading a single value out of this variable
would require uncompressing all of the data.  Furthermore, the
default chunk cache size is smaller than this chunk, so does no
good at all.  (In release 4.0.1 we're chaingin the default chunk cache
size to always hold at least one chunk.)

> For instance, if I set my chunk size to the dimension of one grid,
> should I also call 'put' once for each forecast grid? Perhaps that
> way, a reader would not be forced to read the entire block at once.

Right, the idea is to have each chunk big enough that read accesses by
chunks are efficient, but uncompressing a single chunk is not a
bottleneck.  You can also set chunks s that data can be read along any
dimension axis without favoring one order of reading over another.

> If there are any published guides that describe using the C++ API
> with chunking, please pass them along too.

There's a good explanation of how important chunking can be to
performance, with an instructive example, in this paper:

  http://www.hdfgroup.org/pubs/papers/2008-06_netcdf4_perf_report.pdf

--Russ


Russ Rew                                         UCAR Unidata Program
address@hidden                     http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: VCQ-846449
Department: Support netCDF
Priority: Normal
Status: Closed