[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: netcdf install problems



> From address@hidden Mon Mar 28 14:40:29 1994
> Keywords: 199403232343.AA14882
> Date: Wed, 23 Mar 94 16:42:52 MST
> From: address@hidden (Dave Resch)
> To: address@hidden
> Cc: address@hidden, address@hidden
> Subject: netcdf install problems
> 
> I have fixed several environment, compiler, and loader problems that we were
> having installing netcdf on the Cray-3 machine (graywolf).
> 
> The following tests now all execute correctly:
>       xdr/test 
>       nctest/test 
>       fortran/test 
>       ncdump/test
>         ncgen/test
> 
> The formatted test for libsrc/test also passes.  The binary test for 
> libscr/test
> fails.  There is exactly 1 32-bit floating point value that differs from the 
> reference file provided:
> 
> test.cdf test_cdf.sav differ: char 2060, line 5
> 
> If I do an "od" on each of the two files and then a diff on the od outputs, 
> there are just a couple of differing bits:
> 
> 115c115
> < 0000000000400 0000000000000000000000 0000000000470060703600
> - ---
> > 0000000000400 0000000000000000000000 0000000000030060703600
> 
> 
> If I copy the binary files to a Sun machine (32-bit IEEE) and look at the 
> differences in floating point format, I see:
> 
> 115c115
> < 0004000   0.0000000e+00  0.0000000e+00  1.2611686e-44 -6.1102905e+00
> - ---
> > 0004000   0.0000000e+00  0.0000000e+00  0.0000000e+00 -6.1102905e+00
> 
> 
> 
> Looking at the character of the data around the differing bits, I see
> that a sequence of FP values is being generated ...,53.0,54.0,55.0
> There is then a section of values (where the difference occurs) and 
> then the sequence picks up again 56.0,57.0,58.0,...
> 
> 
> 0003440   4.7000000e+01  4.8000000e+01  4.9000000e+01  5.0000000e+01
> 0003460   5.1000000e+01  5.2000000e+01  5.3000000e+01  5.4000000e+01
> 0003500   5.5000000e+01 -6.1102905e+00  0.0000000e+00 -6.1102905e+00
> 0003520   0.0000000e+00 -6.1102905e+00  0.0000000e+00 -6.1102905e+00
> *
> 0003620   0.0000000e+00 -6.1102905e+00  0.0000000e+00  2.1562500e+00
> 0003640   0.0000000e+00 -6.1102905e+00  0.0000000e+00 -6.1102905e+00
> *
> 0004000   0.0000000e+00  0.0000000e+00  1.2611686e-44 -6.1102905e+00
> 0004020   0.0000000e+00 -6.1102905e+00  0.0000000e+00 -6.1102905e+00
> *
> 0004400   0.0000000e+00  1.4257908e+01  6.6542599e+22 -1.3775465e-40
> 0004420  -1.3775465e-40 -1.3775465e-40 -1.3775465e-40  5.6000000e+01
> 0004440   5.7000000e+01  5.8000000e+01  5.9000000e+01  6.0000000e+01
> 0004460   6.1000000e+01  6.2000000e+01  6.3000000e+01  6.4000000e+01
> 
> 
> I attempted to look into what the test code was doing at that point,
> but had some trouble following what was going on in the test code
> and the sequence of calls being generated:
> 
> fill_seq() -> ncvarput1() -> NCvar1io -> xdr_NCv1data ...
> 
> 
> I then built the test and all of the sources in libsrc with the 
>  -DCDEBUG and -DVDEBUG flags since those looked relevant.  The following
> is the output generated around where the incorrect value is written to
> the binary file test.cdf:
> 
> ncvarput1: Float offset 1848
> shape    0 7 8
> coords   0 6 5
>         NCcoordck: coords 37894, count 3, ip 37896
>         NCcoordck: ip 112010, *ip 6, up 207607, *up 8
>         NCcoordck: ip 112007, *ip 6, up 207606, *up 7
> ncvarput1: Float offset 1852
> shape    0 7 8
> coords   0 6 6
>         NCcoordck: coords 37894, count 3, ip 37896
>         NCcoordck: ip 112010, *ip 7, up 207607, *up 8
>         NCcoordck: ip 112007, *ip 6, up 207606, *up 7
> ncvarput1: Float offset 1856
> shape    0 7 8
> coords   0 6 7
>         NCcoordck: coords 37894, count 3, ip 37896
>         NCcoordck: ip 112010, *ip 0, up 207607, *up 8
>         NCcoordck: ip 112007, *ip 0, up 207606, *up 7
> ncvarput1: Float offset 2332
> shape    0 7 8
> coords   1 0 0
>         NCcoordck: coords 37894, count 3, ip 37896
>         NCcoordck: ip 112010, *ip 1, up 207607, *up 8
>         NCcoordck: ip 112007, *ip 0, up 207606, *up 7
> ncvarput1: Float offset 2336
> 
> 
> Again, it seems that we are writing a single floating point value on each
> call until we get to offset 1856 (03500), i.e., the offsets are increasing 
> by 4 for each call.  The next offset is 2332 (04440).   
> 
> 
> Now some questions:
> 
> 1.) Do you have any suggestions on how to proceed from here?  

On your sun, compare the output of `ncdump` for the two files.

> 2.) Are there additional debugging switches/tools that would be of help?  

Not really, except ncdump. The -DCDEBUG and -DVDEBUG are probably not
necessary here.

> 3.) What is being written to the binary file around the location where
>     the difference occurs?
> 4.) Do you have any ideas as to why the differing bits are being written
>     incorrectly?
> 
> Thanks,
> 
> Dave Resch
> Cray Computer Corporation
> address@hidden
> (719) 540-4323

Sorry it took so long to get back to you. Somehow this question was languishing
in the inbox of a person who was on vacation.

First of all, the fact that the files do not match bit for bit when the
source machine is not IEEE should not necessarily be a concern.
One might well imagine that the mapping from cray floating point to IEEE
floating point would have some fuzziness due to the differing precisions of
the representation. In the other tests (nctest, xdrtest, ...) the floating
point numbers we put into the files all have exact representations in the 
floating point formats we know about. The purpose of 'cdftest' was more for 
checking
that libsrc still compiles and executes. If 'ncdump' of your output "test.cdf"
and "test_cdf.sav" are the same, then I wouldn't worry about it. In fact, I
wouldn't worry about it if the numbers came back "close" within the larger
of the floating point epsilons of the two systems.

The value in you are questioning is in the double precision netcdf variable
"Double" at index {0, 3, 0}. It is set by at line 453 in cdtest.c:

assert( ncvarput1(id, Double_id, indices[1],(ncvoid *)&zed) != -1 ) ;

This line "puts" a value of zero in the that position. The surrounding
values are (double) 9999.

This may indicate that your xdr_double() routine is not working correctly for
the value 
double zed = 0.0 ;
in converting from cray floating point to IEEE.
You can test this in isolation by pruning down xdrtest.c extremely
to just write out/ read back a single double.

(Don't be too ashamed, the "reference implementation"
from Sun Micro has problems converting vax floating point numbers in
the IEEE subnormal range. It maps them to very large numbers!
Our xdrtest used to point this out, then we got more forgiving.)

Otherwise, it might also indicate a compiler problem with the initializer
expression
double zed = 0.0 ;

Hope this helps.

-glenn