"John Urbanic" <urbanic@xxxxxxx> writes:

> NetCDF gurus:
> After successfully prototyping our parallel netcdf code, we have rolled it
> into a large community app (MFIX) and are now getting sporadic "NetCDF:
> HDF error" errors during runs.  This, unsurprisingly, coincides with
> failure to write portions of related variable fields.
> These happen during put_vars(), and occurs across all PEs at that random
> time, and also only one associated PE's subsequent close() as well.  In
> one of the smallest cases, we are writing ~100, 600K files.  This problem
> will strike every 15 or 20 files, and will vary both in the file and the
> fields that are affected.  With larger files it occurs more frequently -
> almost every other file with the 300MB files we need for production. 
> Again, it occurs in different fields and files within runs and from run to
> run.  We are using netcdf 4.1.3 and hdf 1.8.7.
> My question is, how can I possibly drill further into this problem?  I am
> at a loss as to how to proceed.  It would be nice to force HDF to be more
> specific, or course, but all debugging suggestions most welcome.

If you build netCDF with --enable-logging, then put the following in
your code:


(There is also a fortran version.)

You will then get a ton of output. Trying changing the "3" to a "1" to
get less output, or to a 5 to get more.

If this doesn't work, fire up the parallel debugger and see where HDF5
and netCDF are failing to get along...

Good luck,


Ed Hartnett  -- ed@xxxxxxxxxxxxxxxx