[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1 (fwd)



This email has been forwarded to the netCDF support email archive for
archiving.

------- Forwarded Message

Return-Path: address@hidden
Delivery-Date: Thu Feb 14 16:49:43 2002
Received: from arsc.edu (mcgrew.arsc.edu [199.165.84.136])
        by unidata.ucar.edu (UCAR/Unidata) with ESMTP id g1ENngx21288;
        Thu, 14 Feb 2002 16:49:43 -0700 (MST)
Organization: Arctic Region Supercomputing Center
Keywords: 200202122006.g1CK6Lx24308
Received: from tanana.arsc.edu (tanana.arsc.edu [199.165.84.149])
        by arsc.edu (2000-04-24.ARSC) with ESMTP id OAA18619;
        Thu, 14 Feb 2002 14:49:41 -0900 (AST)
Received: from localhost (jlm@localhost)
        by tanana.arsc.edu (2000-04-25.ARSC) with ESMTP id OAA13249;
        Thu, 14 Feb 2002 14:49:41 -0900 (AST)
X-Authentication-Warning: tanana.arsc.edu: jlm owned process doing -bs
Date: Thu, 14 Feb 2002 14:49:41 -0900
From: John Metzner <address@hidden>
To: Steve Emmerson <address@hidden>
cc: address@hidden
Subject: Re: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1 
In-Reply-To: <address@hidden>
Message-ID: <address@hidden>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

Steve,

        The macros.make diff does not show anything significant, just the 
$SRCDIR and $prefix differences I would have expected.

        I see a number of *.o file size differences between the "good" and 
locally built files in src/nctest.  The "good"  

chilkoot$ ls -l *.o
- -rw-------   1 jlm      cray        6848 Feb 13 19:18 add.o 
- -rw-------   1 jlm      cray       79448 Feb 13 19:18 atttests.o
- -rw-------   1 jlm      cray       35640 Feb 13 19:18 cdftests.o
- -rw-------   1 jlm      cray       21544 Feb 13 19:18 dimtests.o
- -rw-------   1 jlm      cray        3904 Feb 13 19:18 driver.o
- -rw-------   1 jlm      cray        1768 Feb 13 19:18 emalloc.o
- -rw-------   1 jlm      cray        1536 Feb 13 19:18 error.o
- -rw-------   1 jlm      cray        2608 Feb 13 19:18 misctest.o
- -rw-------   1 jlm      cray       27480 Feb 13 19:18 rec.o
- -rw-------   1 jlm      cray       13552 Feb 13 19:18 slabs.o
- -rw-------   1 jlm      cray        7912 Feb 13 19:18 val.o   
- -rw-------   1 jlm      cray       13368 Feb 13 19:18 vardef.o
- -rw-------   1 jlm      cray        5840 Feb 13 19:18 varget.o
- -rw-------   1 jlm      cray        6200 Feb 13 19:18 vargetg.o
- -rw-------   1 jlm      cray        6040 Feb 13 19:18 varput.o
- -rw-------   1 jlm      cray        6312 Feb 13 19:18 varputg.o
- -rw-------   1 jlm      cray       30280 Feb 13 19:18 vartests.o
- -rw-------   1 jlm      cray        5104 Feb 13 19:18 vputget.o
- -rw-------   1 jlm      cray        7376 Feb 13 19:18 vputgetg.o

        The locally built "bad":  (differences flagged w/ !!!)

chilkoot$ ls -l *.o
- -rw-------   1 jlm      software    7000 Feb 13 20:29 add.o        !!!
- -rw-------   1 jlm      software   79448 Feb 13 20:29 atttests.o
- -rw-------   1 jlm      software   35640 Feb 13 20:29 cdftests.o
- -rw-------   1 jlm      software   21544 Feb 13 20:29 dimtests.o
- -rw-------   1 jlm      software    3904 Feb 13 20:29 driver.o
- -rw-------   1 jlm      software    1768 Feb 13 20:29 emalloc.o
- -rw-------   1 jlm      software    1536 Feb 13 20:29 error.o
- -rw-------   1 jlm      software    2608 Feb 13 20:29 misctest.o
- -rw-------   1 jlm      software   27480 Feb 13 20:29 rec.o
- -rw-------   1 jlm      software   13576 Feb 13 20:29 slabs.o      !!!
- -rw-------   1 jlm      software    7560 Feb 13 20:29 val.o        !!!
- -rw-------   1 jlm      software   13368 Feb 13 20:29 vardef.o
- -rw-------   1 jlm      software    5840 Feb 13 20:29 varget.o
- -rw-------   1 jlm      software    6200 Feb 13 20:29 vargetg.o
- -rw-------   1 jlm      software    6040 Feb 13 20:29 varput.o
- -rw-------   1 jlm      software    6312 Feb 13 20:29 varputg.o
- -rw-------   1 jlm      software   30280 Feb 13 20:29 vartests.o
- -rw-------   1 jlm      software    5160 Feb 13 20:29 vputget.o    !!!
- -rw-------   1 jlm      software    7520 Feb 13 20:29 vputgetg.o   !!!

        The src/nctest/nctest binaries are different sizes, of course.  The
"good":

- -rwx------   1 jlm      cray     1966216 Feb 14 12:28 nctest

The local "bad":

- -rwx------   1 jlm      software 1965960 Feb 13 20:29 nctest


        Also, I found a core file in the bad directory from the 'make test'
run.  Thought it might mean something to you. 

chilkoot$ debugview core
CrayTools DebugView 3.0.0.35 (Cray version) Mar 12 2001  14:24:46
- ------------------------------------------------------------------
No symbols are available for debugging because the executable has
been stripped or is not accessible.  Source-level debugging is not
available, and in some cases, TotalView may fail when allocating
memory for the assembly-code listing.  If you are debugging a core
file, running totalview specifying only the core file may help.
- -------------------------------------------------------------------

 ***** START OF SYMBOLIC DUMP *****

 LIST OF PROCESS STATES

 PIDs 8610:  Signal SIGFPE <Floating point exception>

 DISPLAYING PIDs 8610:  Signal SIGFPE <Floating point exception>

 Signal SIGFPE in routine ncx_putn_float_float at address 0p113671d
 ncx_putn_float_float was called by putNCv_float at line 1913 (address 
0p134147d)
 putNCv_float was called by nc_put_vara_float at line 5675 (address 0p177240d)
 nc_put_vara_float was called by nc_put_varm at line 11048 (address 0p251461a)
 nc_put_varm was called by ncvarputg at line 624 (address 0p275263c)
 ncvarputg was called by test_varputgetg at line 119 (address 0p12067b)
 test_varputgetg was called by $STKOFEN at line 52 (address 0p545453b)
 $STKOFEN was called by test_ncvarputg at line 52 (address 0p3273b)
 test_ncvarputg was called by main at line 66 (address 0p12644d)
 main was called by $START$ at line 350 (address 0p1121c)

***** END OF SYMBOLIC DUMP *****


        Any thoughts on where to go next to get a good 'make test' run?  I'm
thinking of building a 'chroot' environment where I can guarantee I've 
eliminated any /usr/local/lib libraries without affecting the real users on 
the system.  I can make any changes I want within it to isolate the cause of 
the failed 'make test'

        Thanks for all your time and quick responses.  It is much appreciated.

Regards,
John Metzner - Cray, Inc                          address@hidden
Arctic Region Supercomputing Center               address@hidden
910 Yukon Drive Rm. 106E                          Phone: (907)474-5431
Fairbanks, AK 99775-6020                          FAX:   (907)474-1820

On Thu, 14 Feb 2002, Steve Emmerson wrote:

> Date: Thu, 14 Feb 2002 15:07:03 -0700
> From: Steve Emmerson <address@hidden>
> To: John Metzner <address@hidden>
> Cc: address@hidden
> Subject: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1 
> 
> John,
> 
> >Date: Thu, 14 Feb 2002 12:48:03 -0900
> >From: John Metzner <address@hidden>
> >Organization: Arctic Region Supercomputing Center
> >To: Steve Emmerson <address@hidden>
> >Subject: Re: 20020212: netcdf 3.5.0 ncvarput failure - Cray SV1
> >Keywords: 200202122006.g1CK6Lx24308
> 
> The above message contained the following:
> 
> >     I'm still working on trying to get netCDF 3.5.0 built and tested on
> > our Cray SV1ex.  I tried turning down the optimization level as you 
> > suggested
> > to no avail, same error during 'make test'.  This was done after a 'make
> > distclean', making sure there was no config.cache and resetting the 
> > environment variables.  There is one (that I know of) local change to the 
> > default library search path which causes /usr/local/lib to be prepended
> > to the library search path (even prempting -L on the command line) which I 
> > pulled out.  I ran through the full build & test sequence again and got the
> > same error as below.  
> >     I did pull the netCDF-3.5.0 package inside Cray Corporate, built and
> > tested the package there on a SV1ex.  It worked, so the problem is some 
> > local
> > system change which is getting in the way.
> >     I pulled the package from Cray Corporate back out to the site with 
> > the "good" libraries and build products.  I reran the 'make test' on it, 
> > again
> > without error.  
> >     Next I copied the locally built libsrc/libnetcdf.a and 
> > cxx/linetcdf_c++.a into the proper location for the "good" package from Cray
> > Corporate.  A 'make test' ran again without error.  I was trying to 
> > determine
> > if the problem was in the test code or the libraries built locally.  Is that
> > a valid test?
> 
> If your locally-built libnetcdf.a library, when copied into the Cray
> Corporate package, results in that package correctly executing a "make
> test", then it would seem that the problem lies in the building and/or
> execution of the netCDF-2 test program rather than with the netCDF
> library functions.
> 
> A good way to look at the differences in the build environments is to
> use the "diff" utility on the file "macros.make", which is located in
> the top-level source directory.  Does it show anything significant?
> 
> Another thing to check is whether or not the files in the netCDF-2 test
> directory, "nctest", are the same.
> 
> Regards,
> Steve Emmerson   <http://www.unidata.ucar.edu>
> 

>From address@hidden Fri Feb 15 12:09:40 2002
>Subject: Re: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1 

Steve,
        I did a bit more testing with the nctest code, comparing builds 
between here at ARSC and inside Cray Corporate.  I was able to get the 
nctest code to build and run successfully here when I changed the CFLAGS 
entry in the macros.make file from "-O3" to "-h inline3,scalar3,task1,vector0".
Also "-O0" would work, but not "-O1" (-O1 is equivalent to -h inline1,scalar1,
task1,vector1).
        I found that the versions of C/C++ compilers was slightly different
between here and the Cray Corporate machine.  We are running version 3.5.0.1
and the corporate system was 3.5.0.3.  When I changed to the same 3.5.0.1
compiler on the corporate machine, I got the same failure.  The problem was
still there when I switched to 3.5.0.2 on the corporate system.  So, Cray 
made some change to the compiler at 3.5.0.3 which allows nctest to not error
out on a floating point exception.  
        You might want to enter this into you problem/fix database in case 
some other poor Cray soul gets bit by it.  Thanks for all your help and 
quick responses.  It's great to get this kind of support on an open source
package, pretty rare too.

Thanks again,
John Metzner - Cray, Inc                          address@hidden
Arctic Region Supercomputing Center               address@hidden
910 Yukon Drive Rm. 106E                          Phone: (907)474-5431
Fairbanks, AK 99775-6020                          FAX:   (907)474-1820

On Thu, 14 Feb 2002, Steve Emmerson wrote:

> Date: Thu, 14 Feb 2002 15:07:03 -0700
> From: Steve Emmerson <address@hidden>
> To: John Metzner <address@hidden>
> Cc: address@hidden
> Subject: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1 
> 
> John,
> 
> >Date: Thu, 14 Feb 2002 12:48:03 -0900
> >From: John Metzner <address@hidden>
> >Organization: Arctic Region Supercomputing Center
> >To: Steve Emmerson <address@hidden>
> >Subject: Re: 20020212: netcdf 3.5.0 ncvarput failure - Cray SV1
> >Keywords: 200202122006.g1CK6Lx24308
> 
> The above message contained the following:
> 
> >     I'm still working on trying to get netCDF 3.5.0 built and tested on
> > our Cray SV1ex.  I tried turning down the optimization level as you 
> > suggested
> > to no avail, same error during 'make test'.  This was done after a 'make
> > distclean', making sure there was no config.cache and resetting the 
> > environment variables.  There is one (that I know of) local change to the 
> > default library search path which causes /usr/local/lib to be prepended
> > to the library search path (even prempting -L on the command line) which I 
> > pulled out.  I ran through the full build & test sequence again and got the
> > same error as below.  
> >     I did pull the netCDF-3.5.0 package inside Cray Corporate, built and
> > tested the package there on a SV1ex.  It worked, so the problem is some 
> > local
> > system change which is getting in the way.
> >     I pulled the package from Cray Corporate back out to the site with 
> > the "good" libraries and build products.  I reran the 'make test' on it, 
> > again
> > without error.  
> >     Next I copied the locally built libsrc/libnetcdf.a and 
> > cxx/linetcdf_c++.a into the proper location for the "good" package from Cray
> > Corporate.  A 'make test' ran again without error.  I was trying to 
> > determine
> > if the problem was in the test code or the libraries built locally.  Is that
> > a valid test?
> 
> If your locally-built libnetcdf.a library, when copied into the Cray
> Corporate package, results in that package correctly executing a "make
> test", then it would seem that the problem lies in the building and/or
> execution of the netCDF-2 test program rather than with the netCDF
> library functions.
> 
> A good way to look at the differences in the build environments is to
> use the "diff" utility on the file "macros.make", which is located in
> the top-level source directory.  Does it show anything significant?
> 
> Another thing to check is whether or not the files in the netCDF-2 test
> directory, "nctest", are the same.
> 
> Regards,
> Steve Emmerson   <http://www.unidata.ucar.edu>
>