[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20041027: error writing to NFS netCDF file on Linux cluster



Dear Professor Constantinescu,

We might not be able to help you very much because the problem appears
to be due to the behavior of NFS on your Linux cluster rather than with
the netCDF library itself.  As you observed

    -- Error only occurs while writing files to a directory of an NFS
       filesystem (desired).
    -- Error does not occur (works fine!) when writing to local /tmp.
       (each process writes to its local /tmp). (not desired, since
       result files are scattered across the cluster).

The primary person responsible for the netCDF package is attending a
conference at this time.  He did, however, have the following to say:

    ... the problem described will be difficult to debug because it
    appears to be dependent on an NFS problem with a Linux cluster
    that we probably can't reproduce here.  If he could supply a small
    complete example that failed, we could try to duplicate the problem,
    but if it depends on the details of the NFS implementation and
    running on a cluster, that may be difficult.

    Version 3.6.0-beta6 of netCDF is also now available, although I
    don't recognize any bugs we fixed from 3.5.1 that would be relevant
    to this problem.

    Professor Constantinescu may not know about the parallel netCDF
    package available from

      http://www-unix.mcs.anl.gov/parallel-netcdf/

    that may be a better solution to his problem.  It would require
    changes to his code, since the netCDF interface is a little
    different, but it is based on MPI and has been successfully used in
    several similar modeling projects.  The pnetcdf developers may also
    be more familiar with the symptoms he describes, since they have
    debugged many problems with parallel netCDF I/O, MPI, and clusters.
    There is a mailing list address@hidden for discussion
    of their parallel netCDF software that might be able to help.

    --Russ

Can you reduce the scope of the problem to a small example?

Is the parallel netCDF package a possible solution for you?

Regards,
Steve Emmerson

--------Begin Original Message

From: Serban G Constantinescu <address@hidden>
To: address@hidden
Subject: e-mail about netcdf problems on a 32 bit PC cluster

I am contacting you about a SUPPORT REQUEST FORM that I filled yesterday
about The problems which we have when we try to write large amounts of
data in netcdf Using a massively parallel fortran90 code.
 
Email was submitted from following website:
 http://my.unidata.ucar.edu/content/support/email_support.php
 
Could you please confirm you received it?
Do you know about how much time it takes to get an answer?
 
Thank you for your help.
Best regards
 
George Constantinescu
Assistant Professor
Dept. Civil and Environmental Engineering
The University of Iowa
 
Package:  netCDF Fortran (77 + 90)
Package version:  3.5.1
Operating System:  Redhat Linux 2.4.9-e.49smp #1 SMP
Hardware Information:   64-node, 128-CPU, Linux-based computing cluster
running MPICH -1.2.5..12 from Myrinet, Sun Grid Engine 5.3, and Sun
Control Station 2.0. Compute nodes (64) are x86-based Sun Fire V60x
servers (see: http://www.sun.com/servers/entry/v60x/). Head nodes (2)
are x86-based Sun Fire V65x servers  (see:
http://www.sun.com/servers/entry/v65x/). Compute nodes have two 36 GB
disk drives. Apple Stoarge Array for shared storage. SMC network for
transmitting data from the nodes to the Apple storage array (three SMC 3
SMC TigerSwitch 10/100/1000 8624T 24-port switches). Myrinet switch for
internode communications.
Subject:  nf_enddef() Input/output error
 
Description:
 
Hello,
 
We have a CFD Fortran MPI/netCDF parallel code which exhibits
"Input/output error" (Error 5) upon calling nf_enddef().  The code runs
with 24 MPI processes.  At the end of computation, the resulting data is
written to disk via netCDF.  Each MPI process writes to its own file;
there is no simultaneous access to any single file.  Each file's size is
approximately 31 to 32 Megabytes when no error occurs.  When the error
occurs, typically only the file's header is written, which is 409,600
bytes;  occasionally a few megabytes of data are written.  We don't have
a parallel file system, only NFS.  MPI is MPICH -1.2.5..12/Myrinet.
 
Observations:
 
-- Error only occurs while writing files to a directory of an NFS
filesystem (desired).
-- Error does not occur (works fine!) when writing to local /tmp.  (each
process writes to its local /tmp).  (not desired, since
   result files are scattered across the cluster).
-- We have 2 NFS filesystems we've tried:  On one, about 23 out of 24
processes report the error (one error per process);
    on the other, about 15 out of 24 processes report the error.
 
Could you advise us as to the cause of the error and how we might fix it?
 
The compiler and library versions are:
 
bash-2.05$ ifc -V
Intel(R) Fortran Compiler for 32-bit applications, Version 7.1   Build
20031225Z
Copyright (C) 1985-2003 Intel Corporation.  All rights reserved.
FOR NON-COMMERCIAL USE ONLY
 
GNU ld version 2.11.90.0.8 (with BFD 2.11.90.0.8)
  Supported emulations:
   elf_i386
   i386linux
   elf_i386_glibc21
 
netcdf is version 3.5.1
mpich is version 1.2.5..12

--------End Original Message