[netcdf-hdf] NetCDF: HDF error, and now what?

NetCDF gurus:

 

After successfully prototyping our parallel netcdf code, we have rolled it
into a large community app (MFIX) and are now getting sporadic "NetCDF: HDF
error" errors during runs.  This, unsurprisingly, coincides with failure to
write portions of related variable fields.

 

These happen during put_vars(), and occurs across all PEs at that random
time, and also only one associated PE's subsequent close() as well.  In one
of the smallest cases, we are writing ~100, 600K files.  This problem will
strike every 15 or 20 files, and will vary both in the file and the fields
that are affected.  With larger files it occurs more frequently - almost
every other file with the 300MB files we need for production.  Again, it
occurs in different fields and files within runs and from run to run.  We
are using netcdf 4.1.3 and hdf 1.8.7.

 

My question is, how can I possibly drill further into this problem?  I am at
a loss as to how to proceed.  It would be nice to force HDF to be more
specific, or course, but all debugging suggestions most welcome.

 

Thanks,

John Urbanic