[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 970605: netcdf and ffio on Cray



Elizabeth:

I'm coming into this discussion late, so I'm wondering what the status
of this problem is. We received another problem report which may be related.
I have attached this.

Improved performance on parallel machines is one of the goals of netcdf-3.
I would be more than happy to work with you or someone on your staff to enable
netcdf-3 to take full advantage of your machines. By working with your experts,
it might decrease our support load from separate sites. Our access to CRAY
machines is limited to those at NCAR, a t90 at gfdl and a t3e at gfdl. We
figure out things as best we can based on this sample, but really have no
sense of a recommended "forward compatibility" strategy.

A brief outline of the netcdf internal architecture follows.

There is an abstract i/o interface used internally which is called "ncio".
This interface was designed with multiprocessing in mind.

The primitive "read" operation looks like this:

int ncio_get(ncio *const nciop,
                        off_t offset, size_t extent,
                        int rflags,
                        void **const vpp);

Where
        'nciop'  is an opaque handle to the lower layer
        'offset' is the position in the file where the data begins
        'extent' is the amount of data to be read
        'rflags' indicates whether this is a read-only operation or
                read-modify-write operation
and
        'vpp'    is returned containing a pointer to memory
                which is 'extent' size, containing the data from the
                file.

The  returned integer is an error code.

This operation can serve to lock a particular region of a file while
an the data is examined or modified. Each call to ncio_get()
is followed by a call to ncio_rel() when the use of this region is
completed:

int ncio_rel(ncio *const nciop,
                 off_t offset, int rflags);

Where
        'nciop'  is an opaque handle to the lower layer
        'offset' is used to indicate which region to release
                (this must be the same as used for a previous, unreleased
                called to ncio_get())
and
        'rflags' indicates whether the region was actually modified (ignored if
        the ncio_get() was read-only).

We provide two implementations of the ncio interface with the
netcdf package. Once does POSIX read() and write() calls and
manages malloc()'ed space. The other uses CRAY ffio in a simple
way modeled after Jeff Kuehn's contribution to netcdf-2.

There is no reason that operations on distinct regions of a netcdf
file could not proceed in parallel. The above _interface_ and it's use
by the higher netcdf layer is designed to provide proper barriers.
----> The _implementation_ of ncio we provide does not. <----

It should be straightforward to provide a specialized implementation
which would take advantage of parallel architecture.
We have done implementation of a similar interface which uses POSIX fcntl()
locking to provide safe sharing by separate programs on typical unix.

-glenn
--- Begin Message ---
  • Subject: Problem reading NetCDF files
  • Date: Wed, 18 Jun 1997 13:55:04 -0400 (EDT) address@hidden
Matt-

I've narrowed the problem you were having reading netCDF files down to
what appears to be an error in the Cray PUTENV function.  This function
is called to temporarily reset the NETCDF_FFIOSPEC environment variable
and reopen the file to access the "time" coordinates efficiently. When
these coordinates have been read, the file is closed and
NETCDF_FFIOSPEC is reset to its original value, again by calling
PUTENV.

Ramesh is looking into the problem.  In the meantime, I can offer you a
temporary get-around in the form of a modified version of "readnc_1.F"
in
         /home/jps/codes/netcdf/v0.7/special/readnc_1.F

In this version, I've disabled the environment-variable-resetting.
While this will keep you from bombing, it may lead to longer runs times
due to the inefficiency of retrieving time coordinate values using
larger FFIO cache pages.  You may want to temporarily use a larger
number of smaller pages (ie, same amount of memory) in your
specification of NETCDF_FFIOSPEC, say "cachea:32:16" instead of
"cachea:256:2".

Let me know how it goes...

John









--- End Message ---