Re: best practice of using parallel hdf5

NOTE: The netcdf-hdf mailing list is no longer active. The list archives are made available for historical reasons.

To: netcdf-hdf@xxxxxxxxxxxxxxxx
Subject: Re: best practice of using parallel hdf5
From: Quincey Koziol <koziol@xxxxxxxxxxxx>
Date: Tue, 28 Nov 2006 15:16:25 -0600

Hi Eh,
        I'll pipe up here, since this is more HDF5 specific...

On Nov 27, 2006, at 8:24 PM, Eh Tan wrote:

First, I would like to thank the HDF5 developers for providing the
excellent library and documentation.

We use parallel hdf5 to store the results of our simulation. Our
simulation code is an MPI application and is routinely run with a few
hundreds of processors. The simulation is time consuming and can take
weeks to finish. Currently we save the results in one file (i.e.,every
processor write to the same file), one datagroup, one dataset, with
unlimited time dimension. Before saving each record, the timedimension
is extended.
Recently, we had a hardware problem on one of the computation node,andthe simulation crashed. As a result, our hdf5 file was corrupted,and welost all the results of that simulation. This lead me to wonderingwhatthe best practice is of using parallel hdf5. I hope the list canprovide
some guidance.
In the event of system crash, how can I prevent the file corruptionand
how can I minimize the loss of data?
Should I flush the buffer after each output, or close the datasetaftereach output, or save each record in a new datagroup, or save eachrecord
in a new file? How much of data loss would I expect in the worst
scenario (e.g., the system crashes during disk I/O)?

Generally, it's a good idea to call H5Fflush (or the equivalentnetCDF API call) after each major "phase" of writing to the file.This will flush metadata changes out to the disk. However, it isstill possible that incremental changes may be made to the file asmetadata is evicted from the HDF5 internal caches that would create a"corrupt" file if the rest of the changes don't make it into thefile. Flushing too often may create additional I/O though, so you'llneed to find a balance that's appropriate for your application.

We have some funding from Sandia National Lab to improve thissituation by essentially "journaling" the changes to the file, whichwill always leave the file in a known "good" state, although possiblymissing some of the last metadata changes. I expect that this willtake 6-8 months to deliver, however, so it's not a short-term solution.


        Quincey

Thanks,

--
Eh Tan
Staff Scientist
Computational Infrastructure for Geodynamics
2750 E. Washington Blvd. Suite 210
Pasadena, CA 91107
(626) 395-1693
http://www.geodynamics.org
==============================================================================
To unsubscribe netcdf-hdf, visit:
http://www.unidata.ucar.edu/mailing-list-delete-form.html
==============================================================================

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Follow-Ups:
- Re: best practice of using parallel hdf5
  - From: Eh Tan

References:
- best practice of using parallel hdf5
  - From: Eh Tan

2006 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdf-hdf archives: