[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[no subject]



> Organization: Oklahoma Mesonet
> Keywords: 199403092201.AA25413

Hi Sridhar,

>       Iam working with the Oklahoma Mesonet as a programmer.
> Iam currently working on the archival of Mesonet data using NetCDF.
> Kindly suggest a compression routine that is best suited for NetCDF files.
> The routine should be a platform independant one as the compressed files will
> be shared with other users. We are currently using a VMS operating system.
> 
>       Is gzip from GNU a good routine. I would like your opinion.

gzip is a good general-purpose compression program, if nothing is known
about the data.  gzip does well for text files or images, but it only seems
to do well for binary files containing numeric values if there are a lot of
close or identical values.  I just tried gzip on a few large netCDF files of
floating point values from model outputs, and it compressed the files pretty
well, eliminating from 35% to 68% of the bytes needed in the data.  However
if you try gzip on a file of random floating point data, you may get no
compression at all.

Another approach to compression is to pack low-precision floating point values
into small integers, using 8 or 16-bits for what would otherwise require 32
bits as floating point.

I don't know of any better general-purpose compression programs than gzip.
In case you want to know more about the packing approach, I've appended a
reply I sent earlier to the netcdfroup mailing list on this subject.

__________________________________________________________________________
                      
Russ Rew                                              UCAR Unidata Program
address@hidden                                        P.O. Box 3000
(303)497-8645                                 Boulder, Colorado 80307-3000


>       I am trying to port our weather model's output to netCDF format.
> The user guide mentions that though netCDF is not a good archiving format
> its possible to pack data while using netCDF. Could you please elaborate
> on that ? A small run typically generates about 25M of data, and so we
> are looking into machine independent packing, using byte for a few arrays
> and int for most of them.

One way to do this is to pack floating-point numbers into ncbyte or ncshort
values and use the conventional netCDFattributes `scale_factor' and
`add_offset' to store the packing parameters, as described in the User's
Guide:


    `scale_factor'
         If present for a variable, the data are to be multiplied by this
         factor after the data are read by the application that accesses
         the data.

    `add_offset'
         If present for a variable, this number is to be added to the data
         after it is read by the application that accesses the data.  If
         both `scale_factor' and `add_offset' attributes are present, the
         data are first scaled before the offset is added.  The attributes
         `scale_factor' and `add_offset' can be used together to provide
         simple data compression to store low-resolution floating-point
         data as small integers in a netCDF file.  When scaled data are
         written, the application should first subtract the offset and then
         divide by the scale factor.

         When `scale_factor' and `add_offset' are used for packing, the
         associated variable (containing the packed data) is typically of
         type byte or short, whereas the unpacked values are. intended to
         be of type float or double.  The attributes `scale_factor' and
         `add_offset' should both be of the type intended for the unpacked
         data, e.g. float or double.

The netCDF library doesn't treat these attributes in any special way, so you
have to use their values for packing before you write values and unpacking
after you read values.  As an example, if you want to pack floating-point
values between 950 and 1050 into 8-bit bytes for a program variable named
`x' that is to be strored into a netCDF variable named x_packed, the
structure of the netCDF file might include a data specification like the
following:

    variables:
        ...
        byte x_packed(n);
                x_packed:scale_factor = 0.3937;
                x_packed:add_offset = 950;
                x_packed:_fillValue = 255;
         ...

where we just use the minimum value, 950, for the offset to keep all packed
values positive, and we compute the scale factor by using

        scale_factor = (Max - Min)/(2^Nbits - 2)
                     = (1050 - 950) / (256-2)
                     = 0.39370079

Now before you store the value x, you pack it with the formula:

        x_packed = (x - add_offset) / scale_factor

and you store the byte value x_packed (which will be between 0 and 254)
instead.  You can use the byte value 255 for a missing value.

Similarly, when you read the data back in, you can unpack it using the
formula:

        x = (x_packed - 1)*scale_factor + add_offset

If you need more than 8-bits of precision but you still want to each value
as one netCDF value, you will have to use 16-bit shorts, and then the
formula above will use Nbits = 16 instead of Nbits = 8.

If you are using C, you may have to declare x_packed to be an `unsigned
char' to get these formulas to work out, or change the formulas to assume
signed values.  In Fortran there are no unsigned integers, so change the
formulas to use signed integers instead.

There are other techniques for accessing packed netCDF data (using the units
attribute to encode packing information, packing values into a bland array
of bytes with some other packing technique and storing the technique name
as a variable attribute, etc.) but the one I've outline above is probably
the simplest.

----------------------------------------------------------------------------
Russell K. Rew                                          UCAR Unidata Program
address@hidden                                          P.O. Box 3000
                                                      Boulder, CO 80307-3000
----------------------------------------------------------------------------