Hi Brian, > I am running the CESM model, and is generating quite a lot of output > data. So I used nccopy to compress the output files (deflate/compression > level 1), which worked flawlessly and reduced the size to ~50%. However, > the original format was "64-bit offset" and by compressing it it is only > possible to have "netCDF4 classic" or "netCDF4". > > I then realised that all my plotting routines slowed way down (factor > 10+), and doing some testing I found out that it wasn't the compression > (slowdown with uncompressed netCDF4 format as well), but the change from > "64-bit offset". > > So my questions are: > > 1: Are there any reasons that "64-bit offset" cannot be compressed? Yes, compression requires chunking (also known as "multidimensional tiling"), which is supported in the two netCDF-4 formats. The netCDF-4 formats use the HDF5 library and storage to implement chunking and compression. Without chunking, it would be necessary to uncompress an entire file even to access only a small subset of datad. Each chunk is a separate unit of compression and uncompression, so only the chunks containing data to be accessed need to be uncompressed. > 2: Is there an inherent reason that "netCDF4" format is slower than > "64-bit offset" or can it be some system specifics on my servers. Accessing data from the HDF5-based netCDF-4 format can be either slower or faster than the netCDF-3 classic or 64-bit offset formats, depending on the pattern of access and details of how the data is stored in the HDF5 format. Specifically, if you use an unlimited (record) dimension or compression, the data must be chunked in the HDF5 file. Specifying the shapes and sizes of chunks to match common patterns of how the data will be accessed is important for tuning the efficiency of access in HDF5 files. It's also important to provide enough chunk cache that chunks you are accessing repeatedly stay in memory, avoiding an expensive disk read of a whole chunk if you need only a few values from the chunk. If you aren't using the unlimited dimension in your output file, then each variable's data will be stored contiguously, which avoids any overhead related to chunking. If you're seeing slow access performance with contiguous data in netCDF-4 files, that's puzzling, and we'd like to know more details. You can get information about storage (contiguous or chunked) and about chunk shapes by looking at the output of "ncdump -h -s", to see the "Special Attributes" with names beginning with "_". Looking at that information might make it clear whether the problem you're seeing is with bad chunk shapes or sizes, which can happen with bad chunk shape defaults. I've written 2 blogs about chunking and its performance implications that might make these performance issues clearer: http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes The first also has a reference to an HDF5 white paper providing some guidance to chunking and performance. The nccopy utility can be used to "rechunk" data or even "dechunk" it to convert unlimited dimensions to fixed size and make the storage of every variable contiguous, but then you can't compress the data (compressed data must use chunking). --Russ Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu Ticket Details =================== Ticket ID: LFV-605873 Department: Support netCDF Priority: Normal Status: Closed
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.