[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #TLB-677315]: FW: Net CDF-4 Issues



Coy,

> I am running multiple threads/processes that are each writing different
> data to different NetCDF files at the same time. So chances are the two
> assertions are from two different threads. To make the non-thread-safe
> NetCDF dlls work for me, I have each thread taking turns. That is, only
> one thread can have a file open at a time. Thus, before opening a file,
> a thread has to take control. After it closes the file, it releases
> control. I wonder if there is some lag time after closing the file where
> the NetCDF dll is still actively doing something that can get corrupted
> if another thread tries to use the dll to open a different file. Is
> there a better way to handle multi-threading.

The netCDF-3 library maintains a single global linked list of open netCDF
files declared in libsrc/nc.c:

  /* list of open netcdf's */
  static NC *NClist = NULL;

Whenever a file is opened, it is added to this list, if not already
there.  When a file is closed, it gets deleted from this list.  Even
if two threads are only dealing with different files, it seems
possible that they could corrupt this linked list by trying to add or
delete entries simultaneously, for example if one was opening a file
while another was deleting an adjacent file on the list, the links to
the previous and next item on the list could get messed up.  I'm not
sure this is what is happening, but it seems like a possibility, from
what you have described.

> Unfortunately, I cannot reproduce the assertion easily. I am running my
> code on 6 PCs and see this type of failure about once every 3 to 5 days
> on only on PC. I'll let you know if I come by some new information that
> might help.
> 
> The "Arg list too long" error is from Windows but is the same error
> message thrown when trying to write a NetCDF file larger then 2GB
> without setting the NC_64BIT_OFFSET flag. I'm perplexed because I am
> setting the NC_64BIT_OFFSET  flag but I am also reading a 32 GB file at
> the same time. So I am wondering if somehow the NC_64BIT_OFFSET  flag
> gets turned off by the thread that is reading in the data and thus
> keeping the thread writing out the data from being able to write more
> then 2GB. When I get some time, I will try to find a good way for you to
> reproduce this error.

If the multiple thread problem mentioned above is happening, it may be that
the wrong flags get associated with an open file, so what you are seeing
might be another symptom of the same problem.

> I would not be surprised if both these issues and some others I have
> been seeing have to do with multi-threading. So any information you can
> give me in regards to safely using the NetCDF in a
> multi-threaded/process situation could be very helpful.

Maybe you could detect the possibility of a corrupted open file list
by adding a function to check that befrore and after every open and
close ...

> 
> -----Original Message-----
> From: Unidata netCDF Support [mailto:address@hidden]
> Sent: Thursday, May 28, 2009 10:36 AM
> To: Chanders Coy CCH
> Cc: address@hidden; Chanders Coy CCH
> Subject: [netCDF #TLB-677315]: FW: Net CDF-4 Issues
> 
> Hi Coy,
> 
> > I switched over to using the NetCDF-4 dlls so that I could write
> larger
> > files using the NC_64BIT_OFFSET flag. I now have two problems I hope
> you
> > can help me with.
> >
> > First off, the NetCDF-4 dll crashed after running a couple of days.
> The
> > only change I made to my code, which has run weeks at a time without
> > issue, is using the NetCDF-4 library/dlls and using the
> NC_64BIT_OFFSET
> > flag. I have attached the screen shot. Do the error messages mean
> > anything to you?
> 
> Yes, although I'm not sure why you are getting two assertion violations.
> The program should exit the first time an assertion violation is
> detected.
> Were these generated by two different threads?  Do you have multiple
> threads or programs trying to write the same file concurrently?  The
> netCDF library is not thread-safe, so that might explain how these
> assertion
> violations came about.
> 
> These are occurring in the netCDF-3 library, in part of the code that
> hasn't
> been changed for about 10 years.  I can't find a previous report of
> seeing
> either of these assertion violations from any users, and I've never seen
> them
> before either.  An assertion violation is the result of a basic
> assumption in
> the library code being violated, and could come from a bug in the
> library, a
> memory or other hardware error, or a program error causing some data
> structure
> in the library to be inadvertently overwritten.
> 
> Diagnosis of the cause of the assertion violation typically requires a
> session
> in an interactive debugger to examine the values of internal pointers
> and data
> structures.  We would have to be able to reproduce the error here to
> figure out
> what is going on.
> 
> Is there any possibility that you could duplicate the problem in a small
> self-contained program that gets the same assertion violation every time
> it
> is run?  Otherwise this will be a difficult problem to diagnose and fix
> ...
> 
> > Second, I have written a C++ class that can write and read my NetCDF
> > files. (Visual Studio 2005, Windows XP, NetCDF-4 library, C interface,
> > NC_64BIT_OFFSET). I have a project that uses this class to write files
> > as large as 32GB. I have another project that uses this class to read
> in
> > those files and display the data. Both these projects appear to work
> > correctly. My problem is when I try to read in the large files, do a
> > little processing on the data, and then write the data out to a new
> > file. In this case, the reading, processing, and writing are all done
> in
> > separate threads using their own instantiation of the same class. When
> I
> > do this, I get the error "Arg list too long" when trying to write the
> > data to the new file when the new file has grown to 2GB in size. Since
> > the code that is failing is exactly the same code that wrote the
> > original 32GB file, it appears to me that some how the NC_64BIT_OFFSET
> > flag is getting blocked when creating the new file. Thus I get the
> same
> > error I would get if I did not create the file with the
> NC_64BIT_OFFSET
> > flag. I don't think this is a single-thread issue as I am ensuring
> that
> > only one instance of the class can have the NetCDF file open at a
> time.
> > Do you have any ideas of what is going on?
> 
> The error "Arg list too long" wouldn't seem to have anything to do with
> netCDF, as that error message is not one that the library ever
> generates.
> You can see all the error messages that come from netCDF in the files
> libsrc/error.c and libsrc4/ncfunc.c in the code for the nc_strerror
> function.
> 
> You might check all the possible reasons writing a large file could fail
> that are listed in the answer to the FAQ "Why do I get an error message
> when I try to create a file larger than 2 GiB with the new library?":
> 
> 
> http://www.unidata.ucar.edu/netcdf/docs/faq.html#Large%20File%20Support1
> 2
> 
> Other than those, I don't have any good ideas about what the problem
> could
> be.  Again, if you can create a small self-contained program that we
> could
> run here to reproduce the problem, we might be able to diagnose and fix
> it.
> 
> --Russ
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                     http://www.unidata.ucar.edu
> 
> 
> 
> Ticket Details
> ===================
> Ticket ID: TLB-677315
> Department: Support netCDF
> Priority: Normal
> Status: Closed
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                     http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: TLB-677315
Department: Support netCDF
Priority: Normal
Status: Closed