Hi Eh, I'll pipe up here, since this is more HDF5 specific... On Nov 27, 2006, at 8:24 PM, Eh Tan wrote:
First, I would like to thank the HDF5 developers for providing the excellent library and documentation. We use parallel hdf5 to store the results of our simulation. Our simulation code is an MPI application and is routinely run with a few hundreds of processors. The simulation is time consuming and can takeweeks to finish. Currently we save the results in one file (i.e., everyprocessor write to the same file), one datagroup, one dataset, withunlimited time dimension. Before saving each record, the time dimensionis extended.Recently, we had a hardware problem on one of the computation node, and the simulation crashed. As a result, our hdf5 file was corrupted, and we lost all the results of that simulation. This lead me to wondering what the best practice is of using parallel hdf5. I hope the list can providesome guidance.In the event of system crash, how can I prevent the file corruption andhow can I minimize the loss of data?Should I flush the buffer after each output, or close the dataset after each output, or save each record in a new datagroup, or save each recordin a new file? How much of data loss would I expect in the worst scenario (e.g., the system crashes during disk I/O)?
Generally, it's a good idea to call H5Fflush (or the equivalent netCDF API call) after each major "phase" of writing to the file. This will flush metadata changes out to the disk. However, it is still possible that incremental changes may be made to the file as metadata is evicted from the HDF5 internal caches that would create a "corrupt" file if the rest of the changes don't make it into the file. Flushing too often may create additional I/O though, so you'll need to find a balance that's appropriate for your application.
We have some funding from Sandia National Lab to improve this situation by essentially "journaling" the changes to the file, which will always leave the file in a known "good" state, although possibly missing some of the last metadata changes. I expect that this will take 6-8 months to deliver, however, so it's not a short-term solution.
Thanks, -- Eh Tan Staff Scientist Computational Infrastructure for Geodynamics 2750 E. Washington Blvd. Suite 210 Pasadena, CA 91107 (626) 395-1693 http://www.geodynamics.org====================================================================== ========To unsubscribe netcdf-hdf, visit: http://www.unidata.ucar.edu/mailing-list-delete-form.html====================================================================== ========
Description: S/MIME cryptographic signature