[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Fwd: Re: valid_min, valid_max, scaled, and missing values]




-------- Original Message --------
Subject: Re: valid_min, valid_max, scaled, and missing values
Date: Fri, 23 Feb 2001 17:02:09 -0700
From: John Caron <address@hidden>
Organization: UCAR/Unidata
To: address@hidden
References: <address@hidden>

One problem is that the the statement that the "attribute type should
match the data type" does not specify whether to compare packed or
unpacked. This is a little more obvious when seen through the java API
which has only "Numbers" as attribute data types. "is in the units of
the packed data" is unambiguous. 

One could try to check the attribute type, and decide based on that, but
theres no way to find that info out through Java API I think. Perhaps we
should change that?

Russ Rew wrote:
> 
> John,
> 
> First, the GDT conventions at
> 
>  http://www-pcmdi.llnl.gov/drach/GDT_convention.html
> 
> say:
> 
>   In cases where the data variable is packed via the scale_factor and
>   add_offset attributes (section 32), the missing_value attribute
>   matches the type of and should be compared with the data after
>   unpacking.
> 
> Whereas the CDC conventions at
> 
>  http://www.cdc.noaa.gov/cdc/conventions/cdc_netcdf_standard.shtml
> 
> say
> 
>  ... missing_value has the (possibly packed) data value data type.
> 
> Here's what Harvey had to say to netcdfgroup about valid_min and
> valid_max or valid_range applying to the external packed values rather
> than the internal unpacked values:
> 
>  http://www.unidata.ucar.edu/glimpse/netcdfgroup-list/1174
> 
> implying that the missing_value or _FillValue attributes should be in
> the units of the packed rather than the unpacked data.
> 
> And Harvey said (in http://www.unidata.ucar.edu/glimpse/netcdfgroup-list/1095)
> 
>   Yet I have encountered far too many netCDF files which contravene
>   Section 8.1 in some way.  For example, we are currently processing
>   the NCEP data set from NCAR.  An extract follows. It is obvious that
>   a great deal of effort has gone into preparing this data with lots
>   of metadata and (standard and non-standard) attributes, etc.  But it
>   is also obvious that there cannot be any valid data because the
>   valid minimum (87000) is greater than the maximum short (32767)!
>   And Section 8.1 states that the type of valid_range should match
>   that of the parent variable i.e. should be a short not a float.
>   Obviously the values given are unscaled external data values rather
>   than internal scaled values.
> 
>             short slp(time, lat, lon) ;
>                   slp:long_name = "4xDaily Sea Level Pressure" ;
>                   slp:valid_range = 87000.f, 115000.f ;
>                   slp:actual_range = 92860.f, 111360.f ;
>                   slp:units = "Pascals" ;
>                   slp:add_offset = 119765.f ;
>                   slp:scale_factor = 1.f ;
>                   slp:missing_value = 32766s ;
>                   slp:precision = 0s ;


Its clear that missing_value is packed and valid_range is unpacked.


> 
>   It would be useful to have a utility which checked netCDF files for
>   conformance to these conventions.  It could also provide other data
>   for checking validity such as counting the number of valid and
>   invalid data elements.
> 
>   I guess I have to take some of the blame.  I was one of the authors
>   of NUGC and I was largely responsible for rewriting Section 8.1 last
>   year while I was working at Unidata.  I tried to make it clearer and
>   simpler.  In particular, I tried to simplify the relationship
>   between valid_range, valid_min, valid_max, _FillValue and
>   missing_value.  But it seems that we have failed to make the current
>   conventions sufficiently clear and simple.
> 
> In
> 
>  http://www.unidata.ucar.edu/glimpse/netcdfgroup-list/1079
> 
> here's what John Sheldon of GFDL had to say about whether the missing
> value should be in units of the packed or unpacked data:
> 
>   - Section 32: Missing values in a data variable I think that the
>     data should be checked against the "missing_value" *before*
>     unpacking.  First, I think there is already a pretty strong
>     convention that "missing_value" be of the same type as the data.
>     Second, some packages simply display the packed values, and they
>     wouldn't be able to detect missing values. Third, I've been burned
>     and confused often enough by varying machine precision to be quite
>     shy of comparing computed values.
> 
>     However, handling missing values when unpacking packed data does
>     present a real problem!  Imagine a subroutine which unpacks, say,
>     SHORT values into a FLOAT array.  This routine will be able to
>     reliably detect missing values, but what value is it to put in the
>     FLOAT array?  We solve this by storing a global FLOAT attribute
>     which specifies this number.  If a file has no such attribute, we
>     stuff a default value in it.  In any case, we inform the user of
>     what was used.
> 
> but Jonathan Gregory replied
> 
>     > Section 32: Missing values in a data variable
>     >
>     > > I think that the data should be checked against the "missing_value" 
> *before*
>      > unpacking. [JS]
>     >
>     > Yes, you may well be correct. Thanks.
> 
>     The problem then becomes: what will you put in the array of
>     unpacked data if you find a missing value in the packed data?  We
>     store a global attribute to hold this value (say, -1.E30).  In the
>     absence of this global attribute, we simply stuff in a fill-value,
>     which is OK, but you lose the distinction between intentionally
>     and unintentionally missing data.  In any case, we tell the
>     calling routine what float values we used in both cases.
> 
> So there evidently was no consensus on this issue and differing
> opinions.  Since we have to pick one, I think I favor having the
> missing value be in the packed units.
> 
> --Russ

I'm thinking VariableStandardized should look at the attribute type and
choose based on that. If its integer, use packed, if floating use
unpacked. We should also promulgate another convention so that users can
do either in a standard way. Its clear some users prefer specifying in
unpacked units.

PS: do you mind if i send this to netcdf-support so we have a record?