[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #VZU-531672]: NetCDF error during model runtime



Hi Akshay,

> I wanted to follow up on the netCDF error, and was wondering if you had had 
> the chance to look at the log yet. Just to update: I re-set the program that 
> crashed, and now all 8 of them been running without incident for close to two 
> weeks.

Sorry but there was no log sent with or attached to your original question.
If you want me to look at a log, you'll have to make it available.  If you
originally attached it and the attachment didn't get through, please send 
it again.  There was no reference to a log in your original question.

> Nevertheless, I want to try and track the problem down, or at least to 
> anticipate it in the future. Are there certain/specific conditions under 
> which an IO error -31 would occur? Perhaps checking for those conditions 
> could elucidate a solution...

NetCDF functions can return -31 for a "system error", which means an error
from a system call, such as when you try to open a file that doesn't exist, 
or try to write a file and there's no space left on the device.  When you 
get such an error, you should call nc_strerror(errno) (from C) or 
nf90_strerror(errno) (from Fortran-90) or NF_STRERROR(errno) from Fortran-77 
to get a string describing the netCDF error or system error in more detail.
See the documentation for these functions in the appropriate language
reference manual:

  http://www.unidata.ucar.edu/netcdf/docs/

If you don't call the appropriate function right after the error is returned,
there is no way to tell which of many system errors occurred.  The string
returned by the function, when printed, will tell you what error the operating
system returned.

I'll repeat the question I asked in response to your original request for 
help, because the answer is relevant to providing an answer:

  Are you trying to write in the same file from multiple processes or threads   
  concurrently?  NetCDF 3.6.3 is only designed to permit one writer and 
  multiple readers, not multiple writers.  There is no filesystem setting that 
  will make multiple concurrent writes safe or reliable with netCDF-3.

If your answer to the above question is "yes", then you should expect 
non-deterministic errors when using netCDF-3.  In this case, the solution
is to use one of the parallel I/O libraries for netCDF access.

--Russ

> -----Original Message-----
> From: Akshay Ashok
> Sent: Tuesday, July 06, 2010 4:45 PM
> To: 'address@hidden'
> Subject: RE: [netCDF #VZU-531672]: NetCDF error during model runtime
> 
> Hi Russ,
> 
> I upgraded to NetCDF 4.1.1, and re-ran all 8 programs simultaneously again. 
> This morning one of the programs crashed again, but this time I managed to 
> capture the screen output to a logfile. I've attached the relevant part of 
> the logfile, along with a normally-running comparison log for reference.
> 
> This time I receive an input/output error (-31). The thing that is surprising 
> is that this is the only program out of 8 that has crashed (they've been 
> running for 6 days now; of course, there's no telling what will happen 
> next...;) ) and it occurs after many successful reads/writes.
> 
> Akshay
> 
> -----Original Message-----
> From: Unidata netCDF Support [mailto:address@hidden]
> Sent: Friday, July 02, 2010 2:25 PM
> To: Akshay Ashok
> Cc: address@hidden
> Subject: [netCDF #VZU-531672]: NetCDF error during model runtime
> 
> Hi Akshay,
> 
> > I am running the CMAQ model v4.7, which uses the netCDF file format for 
> > data storage. I have netCDF version 3.6.3, and each computational job is 
> > running on an 8-core parallel processor configuration (with 8 such jobs 
> > running in parallel, reading from and writing to 26TB xfs RAID 6 arrays). 
> > Recently, there have been several netCDF errors which cause the CMAQ 
> > program to quit: error -43 (error processing attribute FTYPE), error -51 
> > (unknown file fromat) and sometimes error -37 (disk synch error, I think). 
> > These errors occur when opening a file for the first time (ie. The CMAQ 
> > program checks for the existence of the file, and writes to a new file if 
> > the file is not found). Also, the errors seem to happen to any of the 8 
> > parallel jobs at different run times, but always when the file is being 
> > opened as new for the first computational timestep.
> >
> > I was wondering if you had any suggestions as to how to tackle this 
> > problem. The netCDF setup has worked fine for previous runs, and the only 
> > thing that has changed is the filesystem (we migrated to the 
> > above-mentioned new xfs filesystem recently). On this note, are there any 
> > specific filesystem settings that need to be configured in order for netCDF 
> > to perform currectly?
> 
> Are you trying to write in the same file from multiple processes or threads 
> concurrently?  NetCDF 3.6.3 is only designed to permit one writer and 
> multiple readers, not multiple writers.  There is no filesystem setting that 
> will make multiple concurrent writes safe or reliable with netCDF-3.
> 
> Perhaps you should consider using netCDF-4 or parallel netCDF, either of 
> which supports multiple concurrent writes on an underlying parallel file 
> system.
> 
> If you are not attempting multiple concurrent writes, then the problem you 
> are reporting sounds like a new problem we haven't seen before.  Is it 
> practical to isolate the problem to a small program we could use to reproduce 
> it here?
> 
> --Russ
> 
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                      http://www.unidata.ucar.edu
> 
> 
> 
> Ticket Details
> ===================
> Ticket ID: VZU-531672
> Department: Support netCDF
> Priority: High
> Status: Closed
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: VZU-531672
Department: Support netCDF
Priority: High
Status: Closed