[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20000306: netCDF 3.4 problem



>To: address@hidden
>cc: address@hidden
>From: Ruth Curry <address@hidden>
>Subject: netCDF 3.4 problem
>Organization: Woods Hole Oceanographic Institution
>Keywords: 200003061959.MAA03196

Hi Ruth,

> I use netCDF to store gridded data as part of a software package
> (HydroBase) that has been in use for awhile now -- and pretty
> thoroughly tested.  Recently, I updated the software to reflect
> ANSI/POSIX standards and have come across an occasional problem with
> CDF files that are generated with the version 3.4 libraries.  This
> problem is absent from the earlier implementation v 2.4.3.
>
> First of all, I am using the straight C interface to the netCDF
> libraries.
>
> Under certain circumstances, the netcdf file that is generated by my
> software has a bunch of null values beginning at the start of the
> file to some offset in the file (e.g. 4096 bytes from the start of
> file).  Eventually some data gets written, but of course the file is
> not readable because the first three bytes are not the requisite hex
> bytes : 4344 4601 0000 0000 0000 000a that identifies it as a netcdf
> file.  There are never any error messages reported during the write
> operations.  In fact all goes rather smoothly until I try to read
> back one of these files and the nc_open() routine reports that it is
> not a netcdf file.
>
> The problem is dependent on the size of the arrays being used to
> write the data out.  My software module dynamically allocates memory
> according to computational need.  The netcdf system works fine for
> smaller grids (e.g. size < 3002 * sizeof (double)), but begins to
> have problems for an array size >= 3081 * sizeof (double)).  The
> problem is completely repeatable -- i.e. in different runs with the
> same gridsize, the number of initial null bytes is the same.  The
> numberof lead null values, however, does vary among runs using
> different gridsizes.
>
> I have watched the behavior of the code in a debugger and have a
> fairly good idea of how it is all working.  A difficulty arises
> because the output stream gets buffered and subsequently written out
> in blocks to the netcdf file.  I am fairly certain that none of the
> data arrays or other areas of the dynamic memory segment are being
> overrun during the computation phase.  There are no segment
> violations or bus errors.

Unfortunately it's relatively easy in C to have subtle pointer
problems result in bugs that don't trigger segmentation violations or
bus errors, but that just overwrite data elsewhere.  The best way to
check for this kind off problem is to use something like Purify or
Sun's dbx Run Time Checking features (maybe HPUX has something
similar).

> After the initialization of the cdf_file with nc_create(), the
> output file contains 32 bytes of null values. (I am using the flag
> NC_CLOBBER) to initialize the file.  My program next does some
> computation then calls netcdf library functions to define
> dimensions, variables, and attributes, finally leaving cdf-define
> mode with a call to nc_enddef().  At this point, nothing further has
> been written out to the file (it is all being buffered by the system
> = HPUX 10.20 on a 735/125 workstation.)
>
> The software then begins to write out data values to the file with:
>
>  nc_put_vara_float()
>  nc_put_vara_int()
>  nc_put_vara_short()
>
> When the buffer becomes full, it is flushed and this is when data
> begins to appear in the output file. In my code this always occurs
> with the call to nc_put_vara_short().  (I'm not sure that there is
> any significance to that -- or if it just happens to be the point in
> my code where that file buffer gets flushed?)  Again, for runs where
> the gridsize is on the smaller side, the initial null values placed
> in the file by nc_create() are overwritten and the first 8 bytes are
> the requisite string.  When the problem occurs, the first few bytes
> following all the leading null values are sometimes the requisite
> string -- but sometimes not.
>
> I deduce from all this, that the buffered output is at times being
> corrupted -- and somehow this is related to the size of data arrays
> being written with the nc_vara_type() functions.  The same software,
> linked with netCDF version 2.4.3 libs, does not manifest this
> behavior.
>
> This is a long-winded explanation, I realize.  If this problem rings
> any bells with you, I'd appreciate your take on it.....  I'm not
> sure how to investigate (with a debugger, for instance) the status
> of the file buffer.  Any advice?

Three possible explanations come to mind:

 1.  You are not calling the nc_close function, which is necessary to
     flush any buffers before exiting.  It's not enough to just depend
     on the system to call close on the associated file descriptor,
     you must explicitly call nc_close to make sure all the data gets
     written.  Usually not calling nc_close causes problems at the end
     of the file rather than in the header, but there may be
     circumstances where what you describe could be a symptom of
     neglecting to properly close the netCDF file.

 2.  You've encountered a variation of the "redef bug", which is a
     fairly obscure bug fixed in the netCDF 3.5 beta release.  Here's
     the description from the 3.5 Release Notes at
     http://www.unidata.ucar.edu/packages/netcdf/release-notes-3.5beta2.html

        Fixed the "redef bug" that occurred when nc_enddef() or
        nf_enddef() is called after nc_redef() or nf_redef(), the file
        is growing such that the new beginning of a record variable is
        in the next "chunk", and the size of at least one record
        variable exceeds the chunk size (see netcdf.3 man page for a
        description of this tuning parameter and how to set it). This
        bug resulted in corruption of some values in other variables
        than the one being added.

 3.  You've found a new bug that we should try to duplicate and fix.
     I'm skeptical about this, since netCDF 3.4 is widely used and
     the only bugs that have been reported are the ones described in
     the 3.5 release notes, but I'd be interested in trying to
     duplicate the problem if it's possible to isolate it to a test
     case that's small enough to send.

Please check 1. first, by inserting an nc_close call before you exit
(if its missing) and see if the problem goes away.  If that doesn't
work, please try the netCDF 3.5 beta release available from 

  ftp://ftp.unidata.ucar.edu/pub/netcdf/netcdf-3.5-beta2.tar.Z

(or in ZIP form from the same directory), to see if its a bug that's
already been fixed.  If neither of these solve the problem, then we'll
try to duplicate it here ...

--Russ