[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #VZU-531672]: NetCDF error during model runtime



Hi Akshay

> Please find the log file attached. The log was not attached to the original 
> email, rather it was attached to the second email (right after an error 
> occurred).

OK, thanks, I didn't notice the attachment on the second email that
CC:ed me.  With the volume of support requests we get for netCDF, it
helps us to keep them organized in our support system, separated from
ordinary email.  If you Cc: address@hidden, that will
happen automatically and will make sure someone responds knowing the
complete history of the support question, even if I'm on vacation (or
Steve is, as was the case with the staff member to who your email was
addressed).

The log shows:

     After NEXTIME: returned JDATE, JTIME 2005115 010000
  ncredef: ncid 27: Input/output error
     Error opening history for update
     netCDF error number  -31  processing file "CTM_CONC_1"
     Unknown Error

Because the error is from "ncredef" (a netCDF version 2 function)
rather than "nc_redef" (the name for the corresponding function in
netCDF-3 and netCDF-4), I assume CMAQ is using the netCDF-2
API (still supported in netCDF-3 and netCDF-4 releases), which prints
error messages and exits by default or continues if the application
has specified that behavior in an error handling option.  The netCDF-3
API instead returns error codes and continues by default, expecting
the application to examine and handle the error code appropriately.

The string "Input/output error" is apparently what the operating
system function perror() handed back to ncredef() as the system error
message corresponding to whatever occurred when ncredef() was called.  

That error message is not very helpful, andcould indicate various errors for 
different file systems.  I'm guessing the error code EIO was returned when an 
open() or write() call got the error status 5, which on a Linux system (is that 
what CMAQ is running on?) is defined in the errno.h include file as

  #define       EIO              5      /* I/O error */

There may be something in the xfs documentation about possible causes
for EIO returned from an I/O call.

> In response to your question: I believe the CMAQ program has a parallel I/O 
> module that gathers the streams and manages reads and writes, but I am not 
> entirely sure if the write operations to a single file are serially 
> sequenced; I will have to check with the CMAQ developers.

OK, because if you have multiple processes writing to the same file
concurrently, netCDF-3 will not work reliably and could fail in
nondeterministic ways.  The netCDF-3 library has no way to determine
if it is being used this way, so you must make sure writes are
serialized.  Otherwise, you should be using something like parallel
netCDF:

  http://trac.mcs.anl.gov/projects/parallel-netcdf

which provides parallel I/O for netCDF-3 files with a slightly
different API, or netCDF-4, which supports two kinds of parallel I/O.

> In order to use the parallel I/O libraries of netCDF-4, is there any 
> modification that I need to make to the code? Or do I just have to build the 
> netCDF libraries appropriately?

netCDF-4 can either make use of HDF5's parallel I/O or pnetcdf's
parallel I/O.  In either case, you need to make some modifications to
the code, and consult the C or Fortran users guide for the appropriate
calls.  See this section in the languge-independent Users Guide for
more information:

  http://www.unidata.ucar.edu/netcdf/docs/netcdf.html#Parallel-Access

--Russ

> -----Original Message-----
> From: Unidata netCDF Support [mailto:address@hidden]
> Sent: Wednesday, July 14, 2010 6:51 PM
> To: Akshay Ashok
> Cc: address@hidden; Akshay Ashok
> Subject: [netCDF #VZU-531672]: NetCDF error during model runtime
> 
> Hi Akshay,
> 
> > I wanted to follow up on the netCDF error, and was wondering if you had had 
> > the chance to look at the log yet. Just to update: I re-set the program 
> > that crashed, and now all 8 of them been running without incident for close 
> > to two weeks.
> 
> Sorry but there was no log sent with or attached to your original question.
> If you want me to look at a log, you'll have to make it available.  If you
> originally attached it and the attachment didn't get through, please send
> it again.  There was no reference to a log in your original question.
> 
> > Nevertheless, I want to try and track the problem down, or at least to 
> > anticipate it in the future. Are there certain/specific conditions under 
> > which an IO error -31 would occur? Perhaps checking for those conditions 
> > could elucidate a solution...
> 
> NetCDF functions can return -31 for a "system error", which means an error
> from a system call, such as when you try to open a file that doesn't exist,
> or try to write a file and there's no space left on the device.  When you
> get such an error, you should call nc_strerror(errno) (from C) or 
> nf90_strerror(errno) (from Fortran-90) or NF_STRERROR(errno) from Fortran-77
> to get a string describing the netCDF error or system error in more detail.
> See the documentation for these functions in the appropriate language
> reference manual:
> 
> http://www.unidata.ucar.edu/netcdf/docs/
> 
> If you don't call the appropriate function right after the error is returned,
> there is no way to tell which of many system errors occurred.  The string
> returned by the function, when printed, will tell you what error the operating
> system returned.
> 
> I'll repeat the question I asked in response to your original request for
> help, because the answer is relevant to providing an answer:
> 
> Are you trying to write in the same file from multiple processes or threads
> concurrently?  NetCDF 3.6.3 is only designed to permit one writer and
> multiple readers, not multiple writers.  There is no filesystem setting that
> will make multiple concurrent writes safe or reliable with netCDF-3.
> 
> If your answer to the above question is "yes", then you should expect
> non-deterministic errors when using netCDF-3.  In this case, the solution
> is to use one of the parallel I/O libraries for netCDF access.
> 
> --Russ
> 
> > -----Original Message-----
> > From: Akshay Ashok
> > Sent: Tuesday, July 06, 2010 4:45 PM
> > To: 'address@hidden'
> > Subject: RE: [netCDF #VZU-531672]: NetCDF error during model runtime
> >
> > Hi Russ,
> >
> > I upgraded to NetCDF 4.1.1, and re-ran all 8 programs simultaneously again. 
> > This morning one of the programs crashed again, but this time I managed to 
> > capture the screen output to a logfile. I've attached the relevant part of 
> > the logfile, along with a normally-running comparison log for reference.
> >
> > This time I receive an input/output error (-31). The thing that is 
> > surprising is that this is the only program out of 8 that has crashed 
> > (they've been running for 6 days now; of course, there's no telling what 
> > will happen next...;) ) and it occurs after many successful reads/writes.
> >
> > Akshay
> >
> > -----Original Message-----
> > From: Unidata netCDF Support [mailto:address@hidden]
> > Sent: Friday, July 02, 2010 2:25 PM
> > To: Akshay Ashok
> > Cc: address@hidden
> > Subject: [netCDF #VZU-531672]: NetCDF error during model runtime
> >
> > Hi Akshay,
> >
> > > I am running the CMAQ model v4.7, which uses the netCDF file format for 
> > > data storage. I have netCDF version 3.6.3, and each computational job is 
> > > running on an 8-core parallel processor configuration (with 8 such jobs 
> > > running in parallel, reading from and writing to 26TB xfs RAID 6 arrays). 
> > > Recently, there have been several netCDF errors which cause the CMAQ 
> > > program to quit: error -43 (error processing attribute FTYPE), error -51 
> > > (unknown file fromat) and sometimes error -37 (disk synch error, I 
> > > think). These errors occur when opening a file for the first time (ie. 
> > > The CMAQ program checks for the existence of the file, and writes to a 
> > > new file if the file is not found). Also, the errors seem to happen to 
> > > any of the 8 parallel jobs at different run times, but always when the 
> > > file is being opened as new for the first computational timestep.
> > >
> > > I was wondering if you had any suggestions as to how to tackle this 
> > > problem. The netCDF setup has worked fine for previous runs, and the only 
> > > thing that has changed is the filesystem (we migrated to the 
> > > above-mentioned new xfs filesystem recently). On this note, are there any 
> > > specific filesystem settings that need to be configured in order for 
> > > netCDF to perform currectly?
> >
> > Are you trying to write in the same file from multiple processes or threads 
> > concurrently?  NetCDF 3.6.3 is only designed to permit one writer and 
> > multiple readers, not multiple writers.  There is no filesystem setting 
> > that will make multiple concurrent writes safe or reliable with netCDF-3.
> >
> > Perhaps you should consider using netCDF-4 or parallel netCDF, either of 
> > which supports multiple concurrent writes on an underlying parallel file 
> > system.
> >
> > If you are not attempting multiple concurrent writes, then the problem you 
> > are reporting sounds like a new problem we haven't seen before.  Is it 
> > practical to isolate the problem to a small program we could use to 
> > reproduce it here?
> >
> > --Russ
> >
> >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: VZU-531672
> > Department: Support netCDF
> > Priority: High
> > Status: Closed
> >
> >
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                      http://www.unidata.ucar.edu
> 
> 
> 
> Ticket Details
> ===================
> Ticket ID: VZU-531672
> Department: Support netCDF
> Priority: High
> Status: Closed
> 
> 
> Hi Russ
> 
> Please find the log file attached. The log was not attached to the original 
> email, rather it was attached to the second email (right after an error 
> occurred).
> 
> In response to your question: I believe the CMAQ program has a parallel I/O 
> module that gathers the streams and manages reads and writes, but I am not 
> entirely sure if the write operations to a single file are serially 
> sequenced; I will have to check with the CMAQ developers.
> 
> In order to use the parallel I/O libraries of netCDF-4, is there any 
> modification that I need to make to the code? Or do I just have to build the 
> netCDF libraries appropriately?
> 
> Thanks
> Akshay
> 
> -----Original Message-----
> From: Unidata netCDF Support [mailto:address@hidden]
> Sent: Wednesday, July 14, 2010 6:51 PM
> To: Akshay Ashok
> Cc: address@hidden; Akshay Ashok
> Subject: [netCDF #VZU-531672]: NetCDF error during model runtime
> 
> Hi Akshay,
> 
> > I wanted to follow up on the netCDF error, and was wondering if you had had 
> > the chance to look at the log yet. Just to update: I re-set the program 
> > that crashed, and now all 8 of them been running without incident for close 
> > to two weeks.
> 
> Sorry but there was no log sent with or attached to your original question.
> If you want me to look at a log, you'll have to make it available.  If you
> originally attached it and the attachment didn't get through, please send
> it again.  There was no reference to a log in your original question.
> 
> > Nevertheless, I want to try and track the problem down, or at least to 
> > anticipate it in the future. Are there certain/specific conditions under 
> > which an IO error -31 would occur? Perhaps checking for those conditions 
> > could elucidate a solution...
> 
> NetCDF functions can return -31 for a "system error", which means an error
> from a system call, such as when you try to open a file that doesn't exist,
> or try to write a file and there's no space left on the device.  When you
> get such an error, you should call nc_strerror(errno) (from C) or 
> nf90_strerror(errno) (from Fortran-90) or NF_STRERROR(errno) from Fortran-77
> to get a string describing the netCDF error or system error in more detail.
> See the documentation for these functions in the appropriate language
> reference manual:
> 
> http://www.unidata.ucar.edu/netcdf/docs/
> 
> If you don't call the appropriate function right after the error is returned,
> there is no way to tell which of many system errors occurred.  The string
> returned by the function, when printed, will tell you what error the operating
> system returned.
> 
> I'll repeat the question I asked in response to your original request for
> help, because the answer is relevant to providing an answer:
> 
> Are you trying to write in the same file from multiple processes or threads
> concurrently?  NetCDF 3.6.3 is only designed to permit one writer and
> multiple readers, not multiple writers.  There is no filesystem setting that
> will make multiple concurrent writes safe or reliable with netCDF-3.
> 
> If your answer to the above question is "yes", then you should expect
> non-deterministic errors when using netCDF-3.  In this case, the solution
> is to use one of the parallel I/O libraries for netCDF access.
> 
> --Russ
> 
> > -----Original Message-----
> > From: Akshay Ashok
> > Sent: Tuesday, July 06, 2010 4:45 PM
> > To: 'address@hidden'
> > Subject: RE: [netCDF #VZU-531672]: NetCDF error during model runtime
> >
> > Hi Russ,
> >
> > I upgraded to NetCDF 4.1.1, and re-ran all 8 programs simultaneously again. 
> > This morning one of the programs crashed again, but this time I managed to 
> > capture the screen output to a logfile. I've attached the relevant part of 
> > the logfile, along with a normally-running comparison log for reference.
> >
> > This time I receive an input/output error (-31). The thing that is 
> > surprising is that this is the only program out of 8 that has crashed 
> > (they've been running for 6 days now; of course, there's no telling what 
> > will happen next...;) ) and it occurs after many successful reads/writes.
> >
> > Akshay
> >
> > -----Original Message-----
> > From: Unidata netCDF Support [mailto:address@hidden]
> > Sent: Friday, July 02, 2010 2:25 PM
> > To: Akshay Ashok
> > Cc: address@hidden
> > Subject: [netCDF #VZU-531672]: NetCDF error during model runtime
> >
> > Hi Akshay,
> >
> > > I am running the CMAQ model v4.7, which uses the netCDF file format for 
> > > data storage. I have netCDF version 3.6.3, and each computational job is 
> > > running on an 8-core parallel processor configuration (with 8 such jobs 
> > > running in parallel, reading from and writing to 26TB xfs RAID 6 arrays). 
> > > Recently, there have been several netCDF errors which cause the CMAQ 
> > > program to quit: error -43 (error processing attribute FTYPE), error -51 
> > > (unknown file fromat) and sometimes error -37 (disk synch error, I 
> > > think). These errors occur when opening a file for the first time (ie. 
> > > The CMAQ program checks for the existence of the file, and writes to a 
> > > new file if the file is not found). Also, the errors seem to happen to 
> > > any of the 8 parallel jobs at different run times, but always when the 
> > > file is being opened as new for the first computational timestep.
> > >
> > > I was wondering if you had any suggestions as to how to tackle this 
> > > problem. The netCDF setup has worked fine for previous runs, and the only 
> > > thing that has changed is the filesystem (we migrated to the 
> > > above-mentioned new xfs filesystem recently). On this note, are there any 
> > > specific filesystem settings that need to be configured in order for 
> > > netCDF to perform currectly?
> >
> > Are you trying to write in the same file from multiple processes or threads 
> > concurrently?  NetCDF 3.6.3 is only designed to permit one writer and 
> > multiple readers, not multiple writers.  There is no filesystem setting 
> > that will make multiple concurrent writes safe or reliable with netCDF-3.
> >
> > Perhaps you should consider using netCDF-4 or parallel netCDF, either of 
> > which supports multiple concurrent writes on an underlying parallel file 
> > system.
> >
> > If you are not attempting multiple concurrent writes, then the problem you 
> > are reporting sounds like a new problem we haven't seen before.  Is it 
> > practical to isolate the problem to a small program we could use to 
> > reproduce it here?
> >
> > --Russ
> >
> >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: VZU-531672
> > Department: Support netCDF
> > Priority: High
> > Status: Closed
> >
> >
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                      http://www.unidata.ucar.edu
> 
> 
> 
> Ticket Details
> ===================
> Ticket ID: VZU-531672
> Department: Support netCDF
> Priority: High
> Status: Closed
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: VZU-531672
Department: Support netCDF
Priority: High
Status: Closed