[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #TSI-527912]: nccopy advice - rechunking very large files

Hi Dan,

> I could use some advice.
> I am trying to rechunk about 30x or so 8-30 GB netcdf4 files
> for the North American Regional Reanalysis physical aggregations
> created from a wgrib2 convert process -- for eventual use on our
> THREDDS server.
> I am using source compiled binaries from netCDF
> The inputs are chunked as:
> chunkspec (t y x)
> 1, 277, 349
> Into a new file chunked to optimize read access to time series
> 98128,6,8
> These files are 1 parameter for 1 z-level, so z is excluded here.
> using the command:
> $ /san5102/netcdf4/nccopy -m 4000000000 -h 1000000000
> -ctime/98128,x/8,y/6
> /san5102/nexus/narr-physaggs/narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2.nc4
> /raid/nomads/testing/data/narraggs/narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2.nc4.ts
> Issue is, this is unreasonably slow.   At the beginning I will get a burst of
> about 350-500 KB/sec output (which is reasonable for the server hardware),
> then after a few minutes it falls to < 10 KB/sec
> ~ for a 10+ GB files, this will take more than 10 days
> just to rechunk one file.  Adjusting -m and -h options gives only
> a minor improvement, the initial write burst lasts longer, but still
> eventually floors to <10 KB/sec.
> Do you think this is the best way to optimize for a time
> series read access?  And what do you suggest to make
> the process finish in a reasonable time?    Are files of this
> size just too much?   The output format of the file doesn't
> matter to me as long as its netcdf4 and max compression can be
> applied later.

How much memory do you have that you can dedicate to nccopy when it is
rechunking the data?  If you have enough memory, use of the -w option
may speed things up significantly.  Since available memory can make a
big difference in how long rechunking takes, is a possible solution
just doing the rechunking on a different system with lots of memory,
e.g. 64 GB?  Memory is pretty cheap compared to programmer time these
days, so I'm wondering if that's a possibility ...

Another approach that might work is using more than one pass over the
data by writing an intermediate file that's rechunked in a way
intermediate between the current input and the desired output.

This problem is very interesting to me, and I'd like to be able to
test approaches to optimizing access for time series using a real data
file rather than some artificial test data.  Could you either make
available one of those input files (but not as an email attachment!
:-) or tell me how to get one?  Especially when dealing with questions
that may ultimately involve compression as well as chunking, it's
important to deal with real-world data.

If that's not practical, I'd like to get the CDL from ncdump -h (or
-c) for the input netCDF file as well as CDL for the desired output,
so I know exactly what yuo're trying to do.


Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu

Ticket Details
Ticket ID: TSI-527912
Department: Support netCDF
Priority: Normal
Status: Closed

NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.