netcdf-hdf mailing list is no longer active. The list archives are made available for historical reasons.
Ed Hartnett wrote:
Mike Folk <mfolk@xxxxxxxxxxxxx> writes:Ed, There's a fairly extensive chapter on chunking and chunk caching at http://hdf.ncsa.uiuc.edu/UG41r3_html/Perform.fm2.html#149138. This covers the material Quincey provided, and quite a bit more.Thanks, I've read that...Unfortunately there aren't generic instructions for this sort of thing, it's very application-I/O-pattern dependent. A general heuristic is to pick lower and upper bounds on the size of a chunk (in bytes) and try to make the chunks "squarish" (in n-D). One thing to keep in mind is that the default chunk cache in HDF5 is 1MB, so it's probably worthwhile to keep chunks under half of that. A reasonable lower limit is a small multiple of the block size of a disk (usually 4KB).Can the chunk cache size be increased programmatically? 1 MB seems low for scientific applications. Even cheap consumer PCs come with about half a gig of RAM. Scientific machines much more so. Wouldn't it be helpful to have 100 MB, for example?Generally, you are trying to avoid the situation below: Dataset with 10 chunks (dimension sizes don't really matter): +----+----+----+----+----+ | | | | | | | | | | | | | A | B | C | D | E | +----+----+----+----+----+ | | | | | | | | | | | | | F | G | H | I | J | +----+----+----+----+----+ If you are writing hyperslabs to part of each chunk like this: (hyperslab 1 is in chunk A, hyperslab 2 is in chunk B, etc.) +----+----+----+----+----+ |1111|2222|3333|4444|5555| |6666|7777|8888|9999|0000| | A | B | C | D | E | +----+----+----+----+----+ | | | | | | | | | | | | | F | G | H | I | J | +----+----+----+----+----+ If the chunk cache is only large enough to hold 4 chunks, then chunk A will be preempted from the cache for chunk E (when hyperslab 5 is written), but will immediately be re-loaded to write hyperslab 6 out.OK, great. Let me see if I can start to come up with the rules by which I can select chunk sizes: 1 - Min chunk size should be 4 KB. 2 - Max chunk size should allow n chunks to fit in the chunk cache, where n is around the max number of chunks the user will access at once in a hyper-slab.Unfortunately, our general purpose software can't predict the I/O pattern that users will access the data in, so it is a tough problem. One the one hand, you want to keep the chunks small enough that they will stick around in the cache until they are finished being written/read, but you want the chunks to be larger so that the I/O on them is more efficient. :-/I think we can make some reasonable guesses for netcdf-3.x access patterns, so that we can at least ensure the common tasks are working fast enough. Obviously any user can flummox our optimizations by doing some odd things we don't expect. As my old engineering professors told me: you can make it foolproof, but you can't make it damn-foolproof. Ed
perhaps we should have 3 modes of chunking, that the user can choose:1) preseve the record oriented nature of our current unlimited dimension to optimize sequential reading of the array. 2) choose an optimal chunk size (some small multiple of disk block size: 8K, 16K, 32K?) and subdivide the dimensions evenly to optimize over all types of subsetting.
3) full user spec of chunk size and chunk dimension size.