best practice of using parallel hdf5

First, I would like to thank the HDF5 developers for providing the
excellent library and documentation.

We use parallel hdf5 to store the results of our simulation. Our
simulation code is an MPI application and is routinely run with a few
hundreds of processors. The simulation is time consuming and can take
weeks to finish. Currently we save the results in one file (i.e., every
processor write to the same file), one datagroup, one dataset, with
unlimited time dimension. Before saving each record, the time dimension
is extended.

Recently, we had a hardware problem on one of the computation node, and
the simulation crashed. As a result, our hdf5 file was corrupted, and we
lost all the results of that simulation. This lead me to wondering what
the best practice is of using parallel hdf5. I hope the list can provide
some guidance.

In the event of system crash, how can I prevent the file corruption and 
how can I minimize the loss of data?
Should I flush the buffer after each output, or close the dataset after
each output, or save each record in a new datagroup, or save each record
in a new file? How much of data loss would I expect in the worst
scenario (e.g., the system crashes during disk I/O)?

Thanks,

-- 
Eh Tan
Staff Scientist
Computational Infrastructure for Geodynamics
2750 E. Washington Blvd. Suite 210
Pasadena, CA 91107
(626) 395-1693
http://www.geodynamics.org

==============================================================================
To unsubscribe netcdf-hdf, visit:
http://www.unidata.ucar.edu/mailing-list-delete-form.html
==============================================================================