[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20041027: error writing to NFS netCDF file on Linux cluster



Hi Serban,

In reference to:
> We have a CFD Fortran MPI/netCDF parallel code which exhibits
> "Input/output error" (Error 5) upon calling nf_enddef().  The code runs
> with 24 MPI processes.  At the end of computation, the resulting data is
> written to disk via netCDF.  Each MPI process writes to its own file;
> there is no simultaneous access to any single file.  Each file's size is
> approximately 31 to 32 Megabytes when no error occurs.  When the error
> occurs, typically only the file's header is written, which is 409,600
> bytes;  occasionally a few megabytes of data are written.  We don't have
> a parallel file system, only NFS.  MPI is MPICH -1.2.5..12/Myrinet.
>  
> Observations:
>  
> -- Error only occurs while writing files to a directory of an NFS
> filesystem (desired).
> -- Error does not occur (works fine!) when writing to local /tmp.  (each
> process writes to its local /tmp).  (not desired, since
>    result files are scattered across the cluster).
> -- We have 2 NFS filesystems we've tried:  On one, about 23 out of 24
> processes report the error (one error per process);
>     on the other, about 15 out of 24 processes report the error.
>  
> Could you advise us as to the cause of the error and how we might fix it?
>  
> The compiler and library versions are:
>  
> bash-2.05$ ifc -V
> Intel(R) Fortran Compiler for 32-bit applications, Version 7.1   Build
> 20031225Z
> Copyright (C) 1985-2003 Intel Corporation.  All rights reserved.
> FOR NON-COMMERCIAL USE ONLY
>  
> GNU ld version 2.11.90.0.8 (with BFD 2.11.90.0.8)
>   Supported emulations:
>    elf_i386
>    i386linux
>    elf_i386_glibc21
>  
> netcdf is version 3.5.1
> mpich is version 1.2.5..12

You asked:

> I would like to inquire if your specialist responsible for the NETCDF
> package had any more hints on why that error is occurring.  To your
> knowledge were these kinds of errors encountered by other netcdf users?
> Should NFS be tuned to netcdf? If so, how can we do that?

There is no need to tune netCDF to NFS.  Many users access netCDF
files via NFS and we also test with NFS.  There have been a few
reported problems with NFS, but all of them have been resolved as NFS
configuration problems, for example:

 - an NFS system not configured to permit writing large files
 - user-specific disk quotas on an NFS-mounted disk
 - a very small value configured for NFS write buffer size (problem
   went away when write buffer size raised to 32768)

Are you setting the "NF_SHARE" flag when you open the file, indicating
synchronizing of shared reads?  The NF_SHARE flag is appropriate when
one process may be writing the dataset and one or more other processes
reading the dataset concurrently; it means that dataset accesses are
not buffered and caching is limited.

Since the symptoms you are seeing occur on a cluster, are you trying
to write to a file from multiple processes?  You can read a single
netCDF file with multiple concurrent reading processes, but the netCDF
library doesn't support multiple concurrent writers.  If you are
calling nf_enddef() because you have updated the file metadata (adding
another attribute or variable, for example), then you would still need
to call nf_sync() in the reading programs to see any changes to the
file metadata, but I don't see how not doing this could cause an
intermittent I/O error in the writing program ...

Sorry I can't be more help, but the generic "Input/output error"
doesn't give much of a clue to the source of the error in calling
nf_enddef().  To make any progress on this problem, we would have to
be able to reproduce it.

--Russ

_____________________________________________________________________

Russ Rew                                         UCAR Unidata Program
address@hidden          http://www.unidata.ucar.edu/staff/russ