Re: HDF5 chunking questions...

Ed Hartnett wrote:

Mike Folk <mfolk@xxxxxxxxxxxxx> writes:

Ed,
There's a fairly extensive chapter on chunking and chunk caching at
http://hdf.ncsa.uiuc.edu/UG41r3_html/Perform.fm2.html#149138.  This
covers the material Quincey provided, and quite a bit more.

Thanks, I've read that...

   Unfortunately there aren't generic instructions for this sort of thing,
it's very application-I/O-pattern dependent.  A general heuristic is to pick
lower and upper bounds on the size of a chunk (in bytes) and try to make the
chunks "squarish" (in n-D).  One thing to keep in mind is that the default
chunk cache in HDF5 is 1MB, so it's probably worthwhile to keep chunks under
half of that.  A reasonable lower limit is a small multiple of the block size
of a disk (usually 4KB).

Can the chunk cache size be increased programmatically?

1 MB seems low for scientific applications. Even cheap consumer PCs come
with about half a gig of RAM. Scientific machines much more
so. Wouldn't it be helpful to have 100 MB, for example?

   Generally, you are trying to avoid the situation below:

       Dataset with 10 chunks (dimension sizes don't really matter):
       +----+----+----+----+----+
       |    |    |    |    |    |
       |    |    |    |    |    |
       | A  | B  | C  | D  | E  |
       +----+----+----+----+----+
       |    |    |    |    |    |
       |    |    |    |    |    |
       | F  | G  | H  | I  | J  |
       +----+----+----+----+----+

       If you are writing hyperslabs to part of each chunk like this:
       (hyperslab 1 is in chunk A, hyperslab 2 is in chunk B, etc.)
       +----+----+----+----+----+
       |1111|2222|3333|4444|5555|
       |6666|7777|8888|9999|0000|
       | A  | B  | C  | D  | E  |
       +----+----+----+----+----+
       |    |    |    |    |    |
       |    |    |    |    |    |
       | F  | G  | H  | I  | J  |
       +----+----+----+----+----+

       If the chunk cache is only large enough to hold 4 chunks, then chunk
   A will be preempted from the cache for chunk E (when hyperslab 5 is
   written), but will immediately be re-loaded to write hyperslab
   6 out.

OK, great. Let me see if I can start to come up with the rules by
which I can select chunk sizes:

1 - Min chunk size should be 4 KB.
2 - Max chunk size should allow n chunks to fit in the chunk cache,
where n is around the max number of chunks the user will access at
once in a hyper-slab.

   Unfortunately, our general purpose software can't predict the I/O pattern
that users will access the data in, so it is a tough problem.  One
the one hand,
you want to keep the chunks small enough that they will stick around in the
cache until they are finished being written/read, but you want the chunks to
be larger so that the I/O on them is more efficient. :-/

I think we can make some reasonable guesses for netcdf-3.x access
patterns, so that we can at least ensure the common tasks are working
fast enough.

Obviously any user can flummox our optimizations by doing some odd
things we don't expect. As my old engineering professors told me: you
can make it foolproof, but you can't make it damn-foolproof.

Ed
perhaps we should have 3 modes of chunking, that the user can choose:

1) preseve the record oriented nature of our current unlimited dimension to optimize sequential reading of the array. 2) choose an optimal chunk size (some small multiple of disk block size: 8K, 16K, 32K?) and subdivide the dimensions evenly to optimize over all types of subsetting.
   3) full user spec of chunk size and chunk dimension size.