[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [netCDF #KLB-596506]: apparent bug in netcdf-4.2



Wei-Keng,

Thanks for your insights!  In particular, your explanation for why I
couldn't reproduce the bug with a classic format copy of the
test_orig.nc 64-bit offset file is exactly right, as I used nccopy,
which produced a copy without the 512-byte alignments.  In fact, if I
use nccopy instead of cp to copy the original file, it preserves the
64-bit offset (CDF2) characteristic of the file, but also removes the
512-byte alignments, so the bug no longer occurs.  The nccopy utility
uses only the C netCDF API for I/O, so can't detect alignment
characteristics of the input file and doesn't copy them.

I propose trimming the Cc: addresses for continuing discussion of your
other points, just because I think we may get into details of little
interest to others.  I'll promise to post the resolution of the bug fix
to the original list of addresses when it's done, tested, and committed.

I'll also try to keep progress summaries in the Jira ticket so people
can follow what's going on if they're really interested:

  https://bugtracking.unidata.ucar.edu/browse/NCF-234  

--Russ

> Here is what I found. When entering nc_redef and exiting with nc_enddef,
> NC_begin in nc.c is called and it recalculates the file starting
> offsets for all variables, regardless the offsets in the existing file
> (i.e. ncp->begin_var and ncp->vars.value[*]->begin). Is it possible that
> the "gaps" between variables due to PnetCDF's 512-byte alignment are
> mistaken as part of the variables when the header size grows and all
> variables must be "shifted" down? In file nc.c, line 604, the assertion
>                 assert(gnu_off > old_off);
> check if the new begin of a variable is larger then the old one.
> I wonder why the assertion did not fail.
> 
> It is interesting that the CDF-1 format has no such problem, but
> PnetCDF does the same alignment for CDF-1 as well. Did you
> use nccopy to convert the CDF-2 to CDF-1 and run the test? In that
> case, I guess all alignments have been removed during the copy.
> 
> Now the question is whether netCDF would like to support this non-record
> variable alignment that PnetCDF is using. If not, PnetCDF can simply
> remove this feature to be compatible with netCDF (performance might
> suffers in some cases, though). If yes, what are the good and bad
> for other netCDF tools and backward compatibility?
> 
> One more thing, when you mentioned
> > the library has made the additional assumption of 4-byte alignment
> > within the fixed data section at least since version 3.6.0 in 2004
> > (I've tested that and intervening versions have the same bug).
> 
> Do you mean "within the the fixed data section" that each fixed =
> variable's
> begin must aligned with 4 bytes? Is netCDF doing this alignment? In file
> nc.c lines 209 and 210
>                 (*vpp)->begin =3D index;
>                 index +=3D (*vpp)->len;
> It does not appear to me the alignment is done.
> 
> PnetCDF's default alignment is 512 bytes which is already a multiple
> of 4, and fulfills the requirement of 4-byte alignment.
> I checked the file offsets of all variables in test_orig.nc and they
> all seem to be aligned with 512 byte boundaries. Did you find otherwise?
> Am I missing something here?
> 
> 
> Wei-keng
> 
> On Mar 5, 2013, at 10:33 AM, Unidata netCDF Support wrote:
> 
> > Hi Wei-keng,
> >=20
> >> This variable alignment is a PnetCDF behavior.
> >> The default alignment value for each non-record variable is 512 bytes =
> in PnetCDF.
> >>=20
> >> According to CDF-1 and CDF-2 file format specifications, each =
> variable has
> >> a field named "begin" which is the variable's file starting location.
> >> var :=3D name  nelems  [dimid ...]  vatt_array  nc_type  vsize  begin
> >>=20
> >> We believe PnetCDF's variable alignment does not violate the CDF =
> spec. and hence
> >> implemented this default alignment in hope to improve performance. =
> This alignment
> >> can be turned off by setting the two hints below.
> >> MPI_Info_set(info, "nc_header_align_size", "1");
> >> MPI_Info_set(info, "nc_var_align_size",    "1");
> >>=20
> >> I wonder if you can send us the file and program to reproduce the =
> corruption problem.
> >=20
> > Yes, I got it from Jim Edwards, and it's a 1.4GB file, along with a
> > Makefile and the original Fortran problem that demonstrated the bug:
> >=20
> >  ftp://ftp.cgd.ucar.edu/pub/eaton/nfbug.tar
> >=20
> > I converted to a C program demonstrating the bug, that's available
> > here:
> >=20
> >  https://bugtracking.unidata.ucar.edu/browse/NCF-234
> >=20
> > Now that we know what caused the observed symptoms, it would probably
> > be easy to create a smaller example by running ncgen linked with the
> > pnetcdf library on a CDL file that just has the first few of 288
> > variables.  It might even suffice to include just the first 1 or 2 of
> > the scalar integer variables that are stored with 512-byte alignment:
> >=20
> >  int timemgr_rst_nstep_rad_prev ;
> >  int timemgr_rst_type ;
> >=20
> > and the first vector variable=20
> >=20
> >  double grid1d_lon(gridcell) ;
> >=20
> > but I'm attaching a truncated version of the CDL file for the file =
> that
> > has the whole header, but datavalues for only the first 11 variables.
> >=20
> > The file can be read by our netCDF library, but when it's opened for
> > writing and a change is made to the schema, calling nc_redef(ncid) and
> > eventually nc_enddef(ncid), the variable offsets in the header are
> > rewritten assuming 4-byte alignment, so subsequent reads get bad data
> > values.  For the file in question, rewriting bad offsets only happens
> > when the file is a "CDF2" 64-bit offset format file.  The bug demo
> > program behaves correctly if run on a classic format version of the
> > original file.
> >=20
> > I think you're right that the file format specification permits the
> > way pnetcdf is making use of the variable offsets in the header, but
> > the library has made the additional assumption of 4-byte alignment
> > within the fixed data section at least since version 3.6.0 in 2004
> > (I've tested that and intervening versions have the same bug).
> >=20
> > So we should take responsibility for fixing this ...
> >=20
> > --Russ
> >=20
> >> Wei-keng
> >>=20
> >> On Mar 4, 2013, at 6:49 PM, Jim Edwards wrote:
> >>=20
> >>> Hi Russ,
> >>>=20
> >>> That turns out to have been the problem.   The original file was =
> created with pnetcdf.
> >>>=20
> >>> Jim
> >>>=20
> >>>=20
> >>>=20
> >>> On Mon, Mar 4, 2013 at 3:12 PM, Jim Edwards <address@hidden> =
> wrote:
> >>> Russ,
> >>>=20
> >>> We think that the original file may have been written with pnetcdf.  =
>  We are going to try to recreate the file with netcdf and again with =
> pnetcdf and see if that explains the issue.
> >>>=20
> >>> Jim
> >>>=20
> >>>=20
> >>> On Mon, Mar 4, 2013 at 2:31 PM, Samuel Levis <address@hidden> =
> wrote:
> >>> Not exactly. I tried 2-degree to 2-degree, 2-degree to 0.5, 2-degree =
> to 0.25, and others. All cases worked except the ones with the =
> 0.5-degree file as output.
> >>>=20
> >>> I also tried 0.5-degree to 0.5-degree (mapping the file into itself) =
> and that failed. When I say failed, I mean that the output file ends up =
> with junk in it.
> >>>=20
> >>> Sam
> >>>=20
> >>>=20
> >>> On 03/04/2013 02:26 PM, Jim Edwards wrote:
> >>>> Hi Russ,
> >>>>=20
> >>>> Another piece of information.   This program interpolates data from =
> a file of one resolution (2 degree in this case) to another.  When the =
> output file is low resolution, 1/2 degree or lower, the output file =
> looks fine, no corruption that we can detect.   It's only when the =
> output file is higher resolution (1/4 degree) that this problem comes =
> about.
> >>>>=20
> >>>> Jim
> >>>>=20
> >>>> On Mon, Mar 4, 2013 at 2:04 PM, Jim Edwards <address@hidden> =
> wrote:
> >>>> Hi Russ,
> >>>>=20
> >>>> It looks like that file was originally created on bluefire on =
> 11/21/11, I don't have any information about which netcdf library was =
> used, but I think that some adjustment may have been made inside netcdf =
> for performance on gpfs filesystems.
> >>>>=20
> >>>> But doesn't your own
> >>>> int nc__enddef(int ncid, size_t h_minfree, size_t v_align,
> >>>>                    size_t v_minfree, size_t r_align);
> >>>>=20
> >>>>=20
> >>>> allow for changing this alignment?   I don't know that that was =
> done for this file, but it would seem to suggest that there is no =
> assumption being violated about these alignments.  Or that one part of =
> netcdf is assuming something which another part is not.
> >>>>=20
> >>>>=20
> >>>>=20
> >>>> On Mon, Mar 4, 2013 at 12:53 PM, Unidata netCDF Support =
> <address@hidden> wrote:
> >>>> Hi Jim,
> >>>>=20
> >>>> I'm curious how the original file you provided was created and =
> perhaps
> >>>> modified.  It has a peculiar alignment characteristic that I =
> haven't
> >>>> seen before, and if there are more netCDF files being created the =
> same
> >>>> way, we may nned to adapt.
> >>>>=20
> >>>> Could you tell me the history of the file, what file system it was
> >>>> written on, and whether the netCDF library with which it was =
> written
> >>>> was modified in any way?
> >>>>=20
> >>>> The file has this characteristic, which would indicate a non-Posix
> >>>> file system: it is using 512-byte alignment of data values rather =
> than
> >>>> the 4-byte alignment assumed by netCDF. So, for example, the data
> >>>> block for fixed-size variables begins with 9 scalar integers that
> >>>> should take 4 bytes each. The offsets computed for these values =
> from
> >>>> the beginning of the fixed-size data block are 0, 4, 8, 12, 16, 20,
> >>>> 24, 28, 32, so there is no padding or wasted space. The offsets =
> from
> >>>> the beginning of the fixed-size data block that are actually stored =
> in the
> >>>> header for these variables are 0, 512, 1024, ... , 4096. If the =
> file
> >>>> system used to write the data originally could not write data on
> >>>> 4-byte boundaries, I think that violates the assumption of netCDF =
> and
> >>>> POSIX I/O. Nevertheless, if the nc_endef() call pays attention to =
> the
> >>>> file offsets for each variable that are stored in the header (as =
> the
> >>>> netCDF library does when reading the file), rather than computing =
> them
> >>>> from assuming 4-byte alignment, perhaps this file can be modified
> >>>> correctly.
> >>>>=20
> >>>> The function where we might be able to adapt to this is
> >>>> nc3internal.c:NC_begins(), which is called from
> >>>> nc3internal.c:NC_enddef().  In any case it's a netCDF bug to write
> >>>> something that can't be later read correctly, so if our unmodified
> >>>> library wrote that file and we can't adapt to it, then it was a bug
> >>>> to not emit an error message for trying to create a file on the =
> original
> >>>> non-POSIX file system.  Also, the data seems to all be there in the
> >>>> "corrupted" file, which can be fixed by just restoring the variable
> >>>> offsets in the file header to the peculiar values in the original =
> ...
> >>>>=20
> >>>> --Russ
> >>>>=20
> >>>> Russ Rew                                         UCAR Unidata =
> Program
> >>>> address@hidden                      =
> http://www.unidata.ucar.edu
> >>>>=20
> >>>>=20
> >>>>=20
> >>>> Ticket Details
> >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >>>> Ticket ID: KLB-596506
> >>>> Department: Support netCDF
> >>>> Priority: Normal
> >>>> Status: Closed
> >>>>=20
> >>>>=20
> >>>>=20
> >>>>=20
> >>>> --
> >>>> Jim Edwards
> >>>>=20
> >>>> CESM Software Engineering Group
> >>>> National Center for Atmospheric Research
> >>>> Boulder, CO
> >>>> 303-497-1842
> >>>>=20
> >>>>=20
> >>>>=20
> >>>> --
> >>>> Jim Edwards
> >>>>=20
> >>>> CESM Software Engineering Group
> >>>> National Center for Atmospheric Research
> >>>> Boulder, CO
> >>>> 303-497-1842
> >>>=20
> >>> --
> >>> Samuel Levis -
> >>> address@hidden
> >>>=20
> >>> National Center for Atmospheric Research
> >>> PO Box 3000, Boulder CO 80307-3000      <- use for mail
> >>> 3090 Center Green Dr., Boulder CO 80301 <- vs. shipping
> >>>=20
> >>> tel
> >>> 303 497-1627
> >>> ; fax -1348; skype: samuellevis2
> >>>=20
> >>> http://www.cgd.ucar.edu/tss
> >>>=20
> >>>=20
> >>> Terrestrial Sciences Section in the
> >>> Climate & Global Dynamics Division
> >>>=20
> >>>=20
> >>>=20
> >>>=20
> >>> --
> >>> Jim Edwards
> >>>=20
> >>> CESM Software Engineering Group
> >>> National Center for Atmospheric Research
> >>> Boulder, CO
> >>> 303-497-1842
> >>>=20
> >>>=20
> >>>=20
> >>> --
> >>> Jim Edwards
> >>>=20
> >>> CESM Software Engineering Group
> >>> National Center for Atmospheric Research
> >>> Boulder, CO
> >>> 303-497-1842
> >>=20
> >>=20
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >=20
> >=20
> >=20
> > Ticket Details
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Ticket ID: KLB-596506
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed<test_orig.cdl-trunc>