Hi Mike, > Organization: adnet systems > Package Version: 4.3 > Operating System: linux > Hardware: > Description of problem: Hi, I have a question about nccopy. I try to use to > convert a classic netcdf file to netcdf V.4 to use version 4's compression > feature. > This is the command I use: > nccopy -k 3 -d 6 > dmspf18_ssusi_edr-auroral_disk_2013315T234753-2013316T012838-REV20972_vA0106r001.nc > > dmspf18_ssusi_edr-auroral_disk_2013315T234753-2013316T012838-REV20972_vA0106r001_v4.nc > > But, when I check the file sizes. The version 4's is even larger than the > original. Am I doing something wrong? Not necessarily, but you may need to use another nccopy option to convert the unlimited dimension in the netCDF classic file (if you have one) to fixed size to get actual compression. The "u" option to nccopy does this, and can be used either in a separate nccopy step or combined with -k 3 -d 6, and which is faster may depend on your data. (By the way, you can probably use -k 4 instead of -k 3 to get a classic model netCDF-4 output file, or just leave off the -k option, since nccopy can infer that it has to be a netCDF-4 file from your compression option.) Also, if your variables are not large, it could be that the extra netCDF-4 storage overhead is more than any savings on compression. Another possibility is the default chunk shapes used for your data may be interfering with good compression. You can see the chunk shapes with ncdump -h -s, where the "-s" option shows the virtual special attributes, such as chunk shapes. Finally, if your data is very noisy or random, compression will not be possible, as the compression algorithm used is lossless, so cannot compress data that appears to be random. I'll write a Unidata Developers Blog entry on compression in the next month or two will show how this works with examples, but in the meantime, you might look at this extract of two responses I sent to an earlier question on compression. Eventually I hope to put this into guidance in the documentation: I just looked at the files quickly, to see what's going on, and verified the results you reported, that gzip of the whole raw file provides 8.4 to 1 compression, and gzip of the whole proc file yields about 2.8 to 1 compression. However, using nccopy with -d1 does poorly (and is very slow), resulting in making the files larger by factors of about 7.0 and 6.7, respectively. I think use of the unlimited time dimension is the root of the problem, because it means that each variable is divided into chunks for compression, with one record per chunk, and what you are seeing is the HDF5 space overhead for storing lots of tiny chunks, each supposedly compressed. Two solutions come to mind:   1. ÂIf you don't need the unlimited dimension any more, perhaps     because no more data will be appended to the files, then convert     the unlimited dimension into a fixed-size dimension, resulting in     all the values of each variable being stored contiguously, which     should be more compressible.   2. ÂIf you still need the unlimited dimension, then rechunk the data     before compressing it, so the compression can work on larger     chunks. The nccopy utility can be used for both of these approaches. For approach 1:   $ nccopy -u proc.nc proc-u.nc Â# makes unlimited dimension fixed size   $ nccopy -d9 proc-u.nc proc-u-d9.nc Â# compresses result, 2.8 to 1 and similarly for the raw.nc file For approach 2:   $ nccopy -c time/3157 proc.nc proc-c.nc # chunks time dimension   $ nccopy -d9 proc-c.nc proc-c-d9.nc   # compresses result, 2.8 to 1 Both of these achieve the same modest amount of compression, which isn't as good as gzip because   - each chunk of each variable is separately compressed, whereas gzip    compresses the whole file as a single chunk of data   - the file metadata in the header is not compressed, only the data   - the HDF5 overhead is a larger portion of the file for relatively    small files like these If your variables were a lot larger or you had fewer variables per file or the variables were multidimensional, nccopy might be able to achieve better compression, but the benefit of compressing each chunk of each variable separately is that you can read a small amount of data out of the file without uncompressing the whole file. ÂOnly the compressed chunks of the desired variable need to be uncompressed. I duplicated the poor results you saw when I didn't use eith the "-u" flag to fix the size of the unlimited dimension or the "-c time/3157" arguments to set the chunk length to something better than the default 1 used for unlimited dimensions. ÂSo maybe there's nothing wrong with your build, unless you've tried those arguments and still get poor compression. I also verified that you don't have to do each of these approaches in two separate nccopy calls using an intermediate file, as in my examples. Each of them can be done with just one nccopy call using the options from the separate calls, and get the same compression:  Â$ nccopy -u -d9 proc.nc proc-u-d9.n  Â$ nccopy -c time/3157 -d9 proc.nc proc-c-d9.nc  Â$ ls -l proc.nc proc-u-d9.n proc-c-d9.nc  Â-rw-rw-r-- 1 russ ustaff 3843892 Jul 15 15:31 proc.nc  Â-rw-rw-r-- 1 russ ustaff 1355552 Jul 15 19:36 proc-u-d9.nc  Â-rw-rw-r-- 1 russ ustaff 1355552 Jul 15 19:36 proc-c-d9.nc Also, the -k4 is not needed, as nccopy can figure out the type of the output file. --Russ > Thanks, > > Mike > > > Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu Ticket Details =================== Ticket ID: IIW-512338 Department: Support netCDF Priority: Normal Status: Closed
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.