[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #XKU-673050]: Producing corrupt netcdf/hdf5 files with netcdf-4.2



Hi Timothy.

> I am attempting to run the PISM ice sheet model on the Ranger
> supercomputer, and I am getting corrupted output files.  I have
> investigated this issue for some time, but have been unable to find the
> problem in the application level code.
> 
> The problem is evident in the output files, which are written without
> error, but contain bad data.  It seems that part of the data that process-0
> is writing is not making it into the file, and when we look at process-0's
> region of the output file using ncdump, we see that the value at many
> points is '_', rather than a floating point value.  The data values written
> by other processes look fine.

Note that '_' is just ncdump's way of displaying a fill value, which
is either the value of the "_FillValue" attribute for the variable you
are looking at, or if there is no such attribute, then the default
fill value attribute for the data type, which is the value of
NC_FILL_FLOAT as defined in netcdf.h, around 9.97e+36.

So ncdump is indicating you're trying to display some values that
haven't been written yet.  One way this could happen is if you didn't
call nc_close (or the equivalent nf90_close or nf_close for Fortran)
function to flush the memory buffers to disk before exiting (or before
looking at the file with ncdump).

> This problem only shows up if we open the file with nc_create/open_par and
> write the data in parallel.  If we do a serial write, funneling all data
> through process-0, the output is good.
> 
> The software stack we are using looks like this:
> netcdf-4.2
> hdf5-1.8.9
> mvapich2-1.2
> lustre file system
> linux kernel 2.6.18-238.19.1.el5.TACC (centos release 4.9)
> 
> I have built the netcdf and hdf5 packages myself, but below that are system
> provided modules.
> 
> I have not been able to reproduce the problem on any other systems at my
> disposal.
> 
> Any advice you can give will be appreciated.  Let me know if there is other
> information I can provide.

Presumably you configured the hdf5 library for parallel I/O, as
described here:

  http://www.unidata.ucar.edu/software/netcdf/docs/build_parallel.html

If you've done all this correctly, did you also configure netCDF-4
with "--enable-parallel-tests", and did you notice whether the
parallel tests succeeded when you ran "make check" on the resulting
netCDF build?

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: XKU-673050
Department: Support netCDF
Priority: Normal
Status: Closed