[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #RPZ-106941]: Bug in netcdf (including 4.1.2-beta2)



Hi Jeorg,

> I have seen some problems with netcdf caused by a large blksize when
> writing a file on a lustre file system. On our systems (x86_64 based
> linux system, self compiled with intel compilers 11.1.046) fstat returns
> a recommended block size of 2097152, and while I haven't looked any
> further, a lvl field is all zeroed (which is incorrect ;) ).
> 
> If I change the return value of blksize (to 8192, but some larger values
> work, too), the field is written back correctly. Also if I open the
> netcdf file with disabled buffering:
> 
> NCID = NCCRE(NCFILE(1:LNCFILE),NCCLOB,IER)
> ! JH: This fixes the problem as well.
> !NCID = NCCRE(NCFILE(1:LNCFILE),NCCLOB+nf_share,IER)
> 
> the field is written back correctly. I have seen the error with netcdf
> versions 3.6.2, 3.6.3, 4.1.1, 4.1.2-beta1, 4.1.2-beta2. For now we are
> using a patched version of netcdf (as above), but would obviously be
> interested in a proper fix :)
> 
> Unfortunately I don't know the application (grib to netcdf converter),
> nor too much about netcdf (I am working as "application support" for the
> Australian Bureau of Meteorology, but haven't used much netcdf or grib).
> And it's rather complicated to package up the application (it has
> implicit dependencies on several shared data files, so it would need
> some time to create a test case for you, and also I have to find out if
> I can give you the data files in the first place).
> 
> Do you have any suggestion on what to do next? Any debugging features I
> could enable?

Make sure your netCDF library is built without turning off assertions.
The default is to leave assertion checking on, and it would only be
turned off if you configured with something like CFLAGS="-DNDEBUG".
The best thing would be if some assertion was violated while you were
running the test you described, in which case we would be very
interested in a gdb backtrace resulting from the assertion violation.

We don't have access to a Lustre file system on which to reproduce and
debug this problem, and haven't seen reports of it on other platforms.
When you build netCDF from source, does running "make check" on such a
platform (with asertion checking left on) result in any errors?

Otherwise, this sounds like a serious problem, if it just returns
zeros instead of crashing.  We would need some way to reproduce it
here on a relatively small test case.

For now, I'll check to see if we can find a file system we can
configure with a much larger block size to test if we can see a bug.
I'll let you know if I can reproduce the problem that way.

Thanks for reporting the problem.

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: RPZ-106941
Department: Support netCDF
Priority: Normal
Status: Closed