[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: netCDF performance on T3E



> >To: address@hidden
> >From: Al Bourgeois <address@hidden>
> >Subject: netCDF performance on T3E
> >Organization: Lockheed Martin
> >Keywords: 199901192047.NAA24096
>
> I am involved in porting an EPA air quality model from the T3D to the
> T3E, and am having serious problems with netCDF performance on the T3E.
> (I'm working with David Wong, who I believe is corresponding with you
> regarding v3.4 installation on the T3D.)
>
> I am comparing netCDF v3.4 on the T3E with v1.? on the T3D. Because we
> have not been able to get v3.4 installed on the T3D, we have been using
> a very old version (vintage 1993) that was hacked (by some folks at
> Lawrence Livermore Lab).
>
> Here is what I am observing:
>
> On the T3E, netCDF seems to be grabbing huge amounts of memory, and
> running very very slowly. For example, NCOPN grabs 3MB of memory each
> time it is called to read existing files. Our model opens quite a few
> files, and this is costing a whopping 8MW of memory just in file opens!
> (Our T3E nodes have 16MW of memory, so this severely restricts the size
> of the problem we can fit on a given number of processors.) This doesn't
> happen on the T3D with the older netCDF.
>
> On the T3E, reading one model file variable with NCVGT takes an average
> of about 3 seconds, compared with an average of about .2 seconds on the
> T3D. (NCVGT is being called to do each file variable read, with DELTS =
> 75,  18,  21,  1.) Since the model reads hundreds of variables, this is
> really slowing things down.
>
>
> The netCDF library I'm using was installed by the support group for the
> T3E I'm using, so I don't know what options they used to compile it. I
> did try linking in with a version 3.3.1 library which was installed by
> an EPA employee, and I observe the same problems.
>
> David Wong just ran the same model on the T3D with the v3.4 library
> binary you gave him, and he hit the processor memory limit.
>
> If this would be easier to discuss on the phone, my number is (919)
> 541-0915, or I would be glad to call someone there.
>
> Thanks.
>
> - -Al

Al:

It is very difficult for us to support all the different high performance
computing situations, since we only have access to a few machines which
have specific versions of compilers and libraries and so on.
So, support for netcdf on T3E and such relies heavily on community
interactions and the specialized support staff at various  supercomputer
centers.

We haven't gotten any complaints regarding performance issues or
excessive memory use from other T3E sites. I do seem to recall that were
some specific problems with specific versions of the ffio global I/O,
which were addressed by CRAY SPR's. See the attached message.
It could also be the case that one can control the amount of memory
used in the ffio layer by ffio options and that the default is
too large for your application.

A good place to start understanding the T3E netcdf work is
http://www.nersc.gov/~rkowen/netcdf/index.html
Robert Owen has led the effort to tailor netcdf to T3E under a grant
to NERSC. We are working on a release which contains Robert's work.
If you are interested in using this, let us know and we can make it
available to you. Otherwise, Robert has made some patch files available
from the web site above.

Hope this helps.

-glenn
--- Begin Message ---
  • Subject: SPR information
  • Date: Tue, 4 Aug 1998 08:47:55 -0700 (PDT)
Hello Glenn,

Here is the information regarding the SPRs we talked 
about yesterday. 

Level    Number Title

Critical 712309 Global I/O hangs when ffflush() is used
Critical 712310 Data corruption problems with global I/O that don't
                exist with "par_io"

The two critical SPRs above have been fixed.
The fixes will be in Craylibs 3.0.2.3 (the next release).  

With warm regards,

Harsh


--- End Message ---