[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problem on SGI



>To: address@hidden
>From: Matthew Bettencourt <address@hidden>
>Subject: Re: 20030903: Problem on SGI 
>Organization: 
>Keywords: C++ sync bug

Matt,

> I am seeing a problem on an SGI O3K machine with netcdf under C++,  I 
> have a multithreaded program writting to many different netCDF files and 
> intermittently this causes a netCDF error of "File exists" at many 
> different places.  Sometimes on a put_rec and sometimes on a sync, never 
> on a file create.    Here is part of the traceback
> 
>     19 __exit(0x3, 0x4, 0xffff, 0x0, 0x0, 0x1018b9e8, 0x1200f948, 
> 0x42142f0) 
> ["/xlv55/patches/5194/work/irix/lib/libc/libc_n32_M4/gen/cuexit.c":60, 
> 0xfb05a9c]
>     20 nc_advise(0x15, 0x11, 0x10124b20, 0x1a, 0x0, 0x1018b9e8, 
> 0x1200f948, 0x42142f0) 
> ["/Work/mattchl/debug/netcdf-3.5.1-beta11/src/libsrc/v2i.c":130, 0x100e04d8]
>     21 ncsync(0x15, 0x4, 0xffff, 0x0, 0x0, 0x1018b9e8, 0x1200f948, 
> 0x42142f0) 
> ["/Work/mattchl/debug/netcdf-3.5.1-beta11/src/libsrc/v2i.c":257, 0x100e08bc]
>     22 NcFile::sync(void)(0x105d1df0, 0x4, 0xffff, 0x0, 0x0, 0x1018b9e8, 
> 0x1200f948, 0x42142f0) 
> ["/Work/mattchl/debug/netcdf-3.5.1-beta11/src/cxx/netcdf.cpp":225, 
> 0x100ceba8]
>     23 DataStorage::storeList(void)(this = 0x10164064) 
> ["/Work/mattchl/debug/MCELSystem/MCEL/DataStorage.cc":90, 0x100aa27c]
> 
> 
> Now, if I look at the file datastruct I see
> (dbx) p *(NcFile*)0x105d1df0
> class NcFile {
>      the_id = 26
>      in_define_mode = 0
>      the_fill_mode = Fill__6NcFile=0
>      dimensions = 0x10c654d8
>      variables = 0x10571b40
>      globalv = 0x10c4c7d8
> }
> 
> Here is the odd thing.  If you look at the traceback you will see that 
> the NcFIle::sync call is called with a the_id = 26.  However, once we go 
> into the call ncsync(the_id) the id is 21 (ncsync(0x15,..)
>      if (ncsync(the_id) == ncBad)
> 
> 
> Has anyone seen this before??  Any help would be great,  I am thinking 
> this is a bug with the SGI compiler but I am not sure.  I have put locks 
> around all my netcdf calls so I am only calling one routine at a time so 
> I don't think it is a thread safty issue.
> 
> If anyone has seen this please help point me in the right direction.....

I think you may have uncovered a netCDF bug (or bugs) in the C++
interface.

I just looked at the NcVar::sync() method, which is called for each
variable in the file when NcFile::sync() is called, and it doesn't
make sense to me.  What I don't understand is why it resets the
variable cursor and deletes the cur_rec[] array without making a new
one.

There are currently no tests run from "make test" for the C++ sync(),
so you may be running up against something that isn't used much.

You are encountering the error on SGI, but I suspect from looking at
the code that you would get similar errors on any platform.
I know you previously wrote:

> Now, on the SGI, and only the, SGI data inside the netcdf lib get 
> corrupted.  I send in a valid record id for put_rec(data,id) and inside 
> the lib this get set to some goofy value.

but have you tried this code on any other platforms?

I'll have to look at this some more and maybe come up with a test case
that shows whether it could ever have worked.  Maybe this will also
lead to a fix ...

--Russ

_____________________________________________________________________

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://my.unidata.ucar.edu