Re: [netcdf-hdf] a question about HDF5 and large file - why so long to write one value?


On Aug 20, 2007, at 5:02 PM, Ed Hartnett wrote:

Quincey Koziol <koziol@xxxxxxxxxxxx> writes:

        The problem is in your computation of the chunk size for the
dataset, in libsrc4/nc4hdf.c, around lines 1059-1084.  The current
computations end up with a chunk of size equal to the dimension size
(2147483644/4 in the code below), i.e. a single 4GB chunk for the
entire dataset.  This is not going to work well, since HDF5 always
reads an entire chunk into memory, updates it and then writes the
entire chunk back out to disk. ;-)

        That section of code looks like it has the beginning of some
heuristics for automatically tuning the chunk size, but it would
probably be better to let the application set a particular chunk
size, if possible.


Ah ha! Well, that's not going to work!

What would be a good chunksize for this (admittedly weird) test case:
writing one value at a time for a huge array. Would a chunksize of one
be crazy? Or the right size?

I do think it's better to force the user to give you a chunk size. Definitely _don't_ use a chunk size of one, the B-tree to locate the chunks will be insanely huge. :-(

However, if you are going to attempt to create a heuristic for picking a chunk size, here's my best current thoughts on it: try to get a chunk of a reasonable size (1MB, say) (but make certain that it will contain at least one element, in the case of _really_ big compound datatypes :-), then try to make the chunk as "square" as possible (i.e. try to get the chunk size in all dimensions to be equal). That should give you something reasonable, at least... ;-)

        Quincey

Attachment: smime.p7s
Description: S/MIME cryptographic signature