Re: znetcdf compression stuff

To: "blincoln" <blincoln@xxxxxxxxxx>
Subject: Re: znetcdf compression stuff
From: noon@xxxxxxxxxxxxxxxxxxxx
Date: Thu, 16 Jul 1998 08:59:50 -0400
"blincoln" wrote:
>   I've recently been playing with the znetcdf compression
> patch with win95 and watcom C.  The implementation 
> (transparent with just a change of flags at create time)
> is quite nice, but I wonder if this development is a dead
> branch or if it will continue?
> 
> I had some trouble getting the znetcdf files to download
> properly (they are only available via http:// from a unix machine,
> as far as I can tell), but eventually was able to get them.
> 
> I followed the instructions and ran the install on a unix machine
> in order to get the directory and files setup properly.  I was unable
> to get the files to work on my win95 machine, seems my version of 
> 'patch' doesn't like something in the z.diff file.
> 
>   The compile of the zlib (the compression library this patch uses)
> went very smoothly.

It seems that win95/Internet Explorer has problems downloading files
that end in .tar.gz.  I have added a new link to the file renamed as
.tgz that should solve the download problem.

If you would like to contribute you library I will put a link to it
on the znetcdf page.  That would save other win95 users from making the
same effort you have.

> 
>   Linking the new znetcdf.lib file to my programs I found it to work,
> but experienced a mysterious crash in one of my many test programs.
> I can turn compression on and off and see differences
> in performance.  Here are my initial impressions:
> 
> 1) it runs much slower using compression.  Often half the
> speed or slower than with compression off.
> 

I would look at the order you are writing data into the files, if you are
not writing the data in dimension increasing order (i.e. from the lowest
major dimension to highest) it will take longer.  If you cannot change 
the write order, try compressing the files after writing them.  Read access
is generally faster than write access.

> 2) some of the files it produces are MUCH larger than the
> originals.  I think it may be more effective with certain types
> of data and with some data it loses its brain:
> 
>   2a) If I have a file with *only* attributes in it, no variables defined,
> the original file is 500 bytes or so and the 'compressed' file is 9K.
> I understand that this is a very rare case, but some of our statistics
> files can end up this way.

There is some overhead necessary to build the data directory at the front
of the file.  By default this is set to save space for 1024 blocks of data,
8k each.  This allows for a file size up to 8Megs.  The directory takes 
2 * 1024 * 4 bytes (8k bytes) by itself and is not compressed.  That 
explains the file size increase from 500 bytes to 9k.  In this case you
can set the buffer parameters to reserve space for one block (or not compress
the file).  You can use the nczip program to minimize the reserved space
automatically.

> 
> 3) multiple records in a file seem to get better compressed than single
> 'frames' or 'records' per file.. some of our data is stored as single frames
> per file because the dimensions of the grid can sometimes change between
> frames..much of our data is not this way.. but for the data which does not
> have multiple records for a variable the compression seems not very good.
> 

This is one case where compression really helps if you are changing dimensions
of the grid to save space.  If all 'frames' are stored with the same
dimensions and the unused portions are set to 0 or _FillValue, the repeated
data will compress to near nothing.  Then the programming for access is
greatly simplified and the file management is easier.  The compression
overhead is also minimized.

> 4) for larger grids with repeated data and multiple timesteps (records along 
> the
> NC_UNLIM dimension) it seems to work pretty good: some files are 1/10 their
> original size and many are half or less.

This is the type of usage we have and generally we see a 80-90% compression.
We generally work with files 100k to 10Megs in size (uncompressed) but 
basically I compress everything by default.

> 
>   I haven't played with setting the buffer parameters yet, but before
> I spend too much more time on this, I'd like to know who else is
> using this and whether there are any plans for further updates or 
> development.
> 

The Regional Climate Centers are using it extensively.  Maybe others will
chime in if they have found it useful (or not).

There are several improvements I would like to add:
        Integrate it with the 3.4 version of netcdf
        Post native binaries for different platforms
        Improve the backfill algorithm so adding data to the middle of a 
                compressed file is more efficient.
        Improve the utility routines.

> happy summer,
> 
> bcl
> blincoln@xxxxxxxxxx
> 

Bill Noon
Northeast Regional Climate Center
Cornell University
1998 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: