Re: [netcdf-hdf] [netcdfgroup] NetCDF: HDF error, and now what?

  • To: "John Urbanic" <urbanic@xxxxxxx>
  • Subject: Re: [netcdf-hdf] [netcdfgroup] NetCDF: HDF error, and now what?
  • From: Ed Hartnett <ed@xxxxxxxxxxxxxxxx>
  • Date: Mon, 24 Oct 2011 02:33:09 -0600
"John Urbanic" <urbanic@xxxxxxx> writes:

> NetCDF gurus:
>
>  
>
> After successfully prototyping our parallel netcdf code, we have rolled it
> into a large community app (MFIX) and are now getting sporadic "NetCDF:
> HDF error" errors during runs.  This, unsurprisingly, coincides with
> failure to write portions of related variable fields.
>
>  
>
> These happen during put_vars(), and occurs across all PEs at that random
> time, and also only one associated PE's subsequent close() as well.  In
> one of the smallest cases, we are writing ~100, 600K files.  This problem
> will strike every 15 or 20 files, and will vary both in the file and the
> fields that are affected.  With larger files it occurs more frequently -
> almost every other file with the 300MB files we need for production. 
> Again, it occurs in different fields and files within runs and from run to
> run.  We are using netcdf 4.1.3 and hdf 1.8.7.
>
>  
>
> My question is, how can I possibly drill further into this problem?  I am
> at a loss as to how to proceed.  It would be nice to force HDF to be more
> specific, or course, but all debugging suggestions most welcome.

If you build netCDF with --enable-logging, then put the following in
your code:

nc_set_log_level(3);

(There is also a fortran version.)

You will then get a ton of output. Trying changing the "3" to a "1" to
get less output, or to a 5 to get more.

If this doesn't work, fire up the parallel debugger and see where HDF5
and netCDF are failing to get along...

Good luck,

Ed

-- 
Ed Hartnett  -- ed@xxxxxxxxxxxxxxxx