Re: HDF5 chunking questions...

Hi Ed,

> > >     Unfortunately there aren't generic instructions for this sort of 
> > > thing,
> > >it's very application-I/O-pattern dependent.  A general heuristic is to 
> > >pick
> > >lower and upper bounds on the size of a chunk (in bytes) and try to make 
> > >the
> > >chunks "squarish" (in n-D).  One thing to keep in mind is that the default
> > >chunk cache in HDF5 is 1MB, so it's probably worthwhile to keep chunks 
> > >under
> > >half of that.  A reasonable lower limit is a small multiple of the block 
> > >size
> > >of a disk (usually 4KB).
> 
> 1 MB seems low for scientific applications. Even cheap consumer PCs come
> with about half a gig of RAM. Scientific machines much more
> so. Wouldn't it be helpful to have 100 MB, for example?
    Yes, we've kicked that around, we should bump it up to something more
reasonable in a future release.

> > >     Generally, you are trying to avoid the situation below:
> > >
> > >         Dataset with 10 chunks (dimension sizes don't really matter):
> > >         +----+----+----+----+----+
> > >         |    |    |    |    |    |
> > >         |    |    |    |    |    |
> > >         | A  | B  | C  | D  | E  |
> > >         +----+----+----+----+----+
> > >         |    |    |    |    |    |
> > >         |    |    |    |    |    |
> > >         | F  | G  | H  | I  | J  |
> > >         +----+----+----+----+----+
> > >
> > >         If you are writing hyperslabs to part of each chunk like this:
> > >         (hyperslab 1 is in chunk A, hyperslab 2 is in chunk B, etc.)
> > >         +----+----+----+----+----+
> > >         |1111|2222|3333|4444|5555|
> > >         |6666|7777|8888|9999|0000|
> > >         | A  | B  | C  | D  | E  |
> > >         +----+----+----+----+----+
> > >         |    |    |    |    |    |
> > >         |    |    |    |    |    |
> > >         | F  | G  | H  | I  | J  |
> > >         +----+----+----+----+----+
> > >
> > >         If the chunk cache is only large enough to hold 4 chunks, then 
> > > chunk
> > >     A will be preempted from the cache for chunk E (when hyperslab 5 is
> > >     written), but will immediately be re-loaded to write hyperslab
> > >     6 out.
> 
> OK, great. Let me see if I can start to come up with the rules by
> which I can select chunk sizes:
> 
> 1 - Min chunk size should be 4 KB.
> 2 - Max chunk size should allow n chunks to fit in the chunk cache,
> where n is around the max number of chunks the user will access at
> once in a hyper-slab.
    Generally, yes.

> > >
> > >     Unfortunately, our general purpose software can't predict the I/O 
> > > pattern
> > > that users will access the data in, so it is a tough problem.  One
> > > the one hand,
> > >you want to keep the chunks small enough that they will stick around in the
> > >cache until they are finished being written/read, but you want the chunks 
> > >to
> > >be larger so that the I/O on them is more efficient. :-/
> 
> I think we can make some reasonable guesses for netcdf-3.x access
> patterns, so that we can at least ensure the common tasks are working
> fast enough.
    Cool.

> Obviously any user can flummox our optimizations by doing some odd
> things we don't expect. As my old engineering professors told me: you
> can make it foolproof, but you can't make it damn-foolproof.
    :-)

        Quincey