[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #RPZ-106941]: Bug in netcdf (including 4.1.2-beta2)



Hi Joerg,

> sorry for the long delay, but it took a while find time to narrow the
> problem down and create an appropriate test case. This test case is
> based on the output of ncgen of the correct netcdf file, but the order
> of some operations was changed and fill mode was disabled (I also
> replaced most of the actual data with a simple '0').

Thanks very much for your assistance in verifying this bug and
providing us with a way to reproduce it without a Lustre file system.

I've reproduced the bug in our latest daily snapshot release, using a
little simpler configuration, --disable-netcdf-4 --disable-dap, and
using gcc.  I've also verified that all 3 of your workarounds fix the
bug in the example you have provided.  Finally, I've verified the
"make check" completes successfully when built with your modification
that returns a blksize of 2**21, emulating the Lustre file system, so
our existing tests don't detect this bug.

The area of code in which the bug apparently occurs was written and
optimized by a developer who's no longer with us.  Previously a couple
of us have tried to understand the optimized buffering design well
enough to refactor this code for easier maintenance, without success.
It looks like we now have a reason to try a bit harder, because I
think understanding the code will be necessary to fix this bug.  I'm
relating all this so that you'll understand that this probably won't
be fixed by tomorrow :-).

In the meantime, I'm adding a note about this to our "Known
Problems" page:

  http://www.unidata.ucar.edu/software/netcdf/docs/known_problems.html

(Please let me know if you think that description is inaccurate in any
way or you want to suggest changes.)

I'm also considering announcing the problem on the netcdfgroup mailing
list, as otherwise writing corrupt files may go undetected!

In the meantime, understanding and finding a fix for this problem will
be a priority, even though it has apparently been a bug in the library
for years, at least since the release of netCDF-3.6.2.  I also
intend to determine when the bug first appeared in a netCDF release,
as that may help in fixing the problem.

--Russ

> To reproduce this issue:
> ========================
> 1) Start with a clean copy of netcdf-4.1.2-beta2.tar.gz
> 
> 2) Apply the following patch, which just returns a blksize of 2MB
> (independent of the actual file system. 2MB is what lustre would return,
> but the problem can be reproduced on any other file system as well, as
> long as the buffer size is 2MB):
> 
> *** posixio.c.orig 2011-03-08 03:32:08.000000000 +0000
> --- posixio.c 2011-03-08 03:32:26.000000000 +0000
> ***************
> *** 110,115 ****
> --- 110,116 ----
> static size_t
> blksize(int fd)
> {
> + return 2097152;
> #if defined(HAVE_ST_BLKSIZE)
> struct stat sb;
> if (fstat(fd, &sb) > -1)
> 
> 
> 3) Configure with:
> ./configure --disable-netcdf-4
> 
> 4) Build (I am using Intel compiler 11.1.046):
> make -j 8
> 
> 5) Copy the attached file xx.c to the netcdf root directory, then
> compile and link with:
> 
> icc xx.c -I ./include/ ./liblib/.libs/libnetcdf.a\
> ./libsrc/.libs/libnetcdf3.a -lcurl
> 
> (or do a make install etc)
> 
> 6) Run 'a.out'. It will create a file 'test_correct_new.nc'
> -rw-r--r-- 1 joergh sun_staff 4303312 Mar 8 03:38 test_correct_new.nc
> 
> 7) Check the field 'lvl':
> 
> > ./ncdump/ncdump test_correct_new.nc | grep lvl
> > lvl = 31 ;
> > float lvl(lvl) ;
> > lvl:long_name = "vertical levels" ;
> > lvl:type = "pressure" ;
> > lvl:units = "hPa" ;
> > lvl:positive = "down" ;
> > float zonal_wnd(time, lvl, lat, lon) ;
> > lvl = 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> The field is all 0, but the correct value are:
> 
> > lvl = 1000, 995, 990, 985, 975, 950, 925, 900, 875, 850, 800, 750, 700, 600,
> (note, some other fields are also incorrectly set to 0).
> 
> Work arounds:
> =============
> I know of three ways to avoid this problem and get the correct results:
> 
> a) Don't disable fill mode, i.e. comment out line 121: //stat =
> nc_set_fill(ncid, NC_NOFILL, &old_fill_mode); /* set nofill */
> 
> b) Force blksize to return a smaller value, i.e. replace the line
> return 2097152;
> of the patch in 3) above with:
> return 2097152/2;
> My original work around for the customer was to use a block size
> of 2093576, which was the maximum size which would still produce
> correct results. Unfortunately a different data set triggered
> (what I assume was) the same bug later. So imho there is no guarantee
> that a certain buffer size really fixes the problem - at least till
> the reason for the corrupt file is understood.
> 
> c) Enabling share mode (NC_SHARE) in line 119
> stat = nc_create("test_correct_new.nc", NC_CLOBBER|NC_SHARE, &ncid);
> 
> From what I have seen during debugging:
> - the level data is correctly written to netcdf's memory buffers
> (in the 'lower half').
> - When the application jumps ahead to write at offset 4187152
> (i.e. level 30 of the last field), the lvl field is moved
> from the 'lower half' of the buffer to the 'upper half'.
> - But later on the upper page is not paged out/written back
> to disk, it is just overwritten.
> 
> At this stage I've decided to stop debugging, since I am sure it's much
> easier for you now to find the problem with this test case than it would
> be for me to debug netcdf :)
> 
> Let me know if you have any problems reproducing the test case, need
> additional information, or want me to test a patch.
> 
> Our customer here would just be very interested to get an official fix
> soon (which doesn't have any of the negative performance impact my work
> arounds have).

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: RPZ-106941
Department: Support netCDF
Priority: Critical
Status: Closed