[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20030618: NetCDF performance problems



>To: address@hidden, 
>From: Gottfried Necker <gottfried.necker@xxxxxxxxxxx>
>Subject: NetCDF performance problems.
>Organization: .
>Keywords: 200306180911.h5I9BvLd025090 netCDF 3.5.1-beta10 Fujitsu VPP

Gottfried,

I tried profiling netcdf-beta5 and netcdf-beta10 this morning on the
nc_test test program provided with the distribution, on a Solaris 8
platform, and you were right, a platform-independent performance
problem has been introduced that we need to fix before release.

Here's a comparison of the number of times px_pgin and px_pgout are
called using the netcdf-3.5.1 beta5 versus beta10 releases, using
gprof to profile and just capturing the number of calls with grep:

  test/gf/beta5-xpg/src/nc_test$ gprof nc_test | grep px_pg
                  0.00        0.07    2788/2788        px_pgin [14]     
                  0.00        0.00      39/2129        px_pgout [9]
   ...

  test/gf/beta10-xpg/src/nc_test$ gprof nc_test | grep px_pg
                  0.02        1.43  112164/112164      px_pgin [5]
                  0.00        0.00      43/30624       px_pgout [9]
   ...

Since you've given us enough information to reproduce the problem here
on a Solaris platform, we should be able to fix the problem here and
put it into the next release.  

I let you know when we have figured out what the problem is and have a
patch to test.  Thanks!

--Russ

On Fri, 20 Jun 2003 11:57:56 +020, you wrote:

 > Hi Gottfried,
 > 
 > > > 
 > > > Another possibility would be providing you with some versions between
 > > > beta3 and beta10 that would help isolate which changes caused the
 > > > problem.
 > > I tried with netcdf-3.5.1-beta5 and there's no problem. I went back to
 > > beta10 and got the problem again. I diffed the libsrc directory and
 > > the only substantial difference between these versions is in
 > > posixio.c, where the call to ftruncate is replaced by calls to seek. I
 > > will try to put the code with ftruncate into beta10 to see what
 > > happens. But I don't have the time to do it now. I will try this on
 > > friday.
 > 
 > Thanks, just this information is a big help.  I'm also anxious to hear
 > what you find out when substituting ftruncate for the call to lseek.
 > The revision notice we have on that change was:
 > 
 >   ... eliminated unnecessary use of ftruncate(), because it fails on
 >   FAT32 file systems under Linux.
 > 
 > If this causes a performance problem on other systems, maybe we can
 > find a better fix for the Linux problem.
 > 
 > --Russ

I first thought, there is a single problem, but now I think, there are
two. The beta10 uses too much system time on the NFS and waits for I/O
on the local file systems. If I put the posixio.c (rev. 1.69) into the
beta10 source and recompile the library, the waiting for I/O problem
is gone. But now I can see the system time problem also on the local
file system. I did a PC sampling on my program with the beta10 and
compared it with beta5 and found out, that some routines (px_pgin and
px_pgout) are called many times more with beta10 than with beta5.

If these routine are really called so often, this would explain the
huge difference in system time usage. But I have no idea, why this
happens.
To illustrate the problem, here's the output of timex for beta10 (with
posixio 1.69):
real        11:19.21
user         7:41.07
sys          2:17.28
vu-user      6:03.32
vu-sys          0.00

For comparison the output for beta5:
real        10:39.11
user         9:46.32
sys             4.26
vu-user      7:56.27
vu-sys          0.00

Actually the problem is slightly worse than shown here, because the
beta10 calculation is stopped earlier.

I don't know, what could cause such a problem, but I suspect it is
also present on other platforms. But maybe on these platforms you
don't pay such a high price for calling pg_* too often.

Gottfried