Hi Dan, > I could use some advice. > > I am trying to rechunk about 30x or so 8-30 GB netcdf4 files > for the North American Regional Reanalysis physical aggregations > created from a wgrib2 convert process -- for eventual use on our > THREDDS server. > > I am using source compiled binaries from netCDF 4.2.1.1 > > The inputs are chunked as: > > chunkspec (t y x) > 1, 277, 349 > > Into a new file chunked to optimize read access to time series > 98128,6,8 > > These files are 1 parameter for 1 z-level, so z is excluded here. > > using the command: > > $ /san5102/netcdf4/nccopy -m 4000000000 -h 1000000000 > -ctime/98128,x/8,y/6 > /san5102/nexus/narr-physaggs/narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2.nc4 > /raid/nomads/testing/data/narraggs/narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2.nc4.ts > > Issue is, this is unreasonably slow. At the beginning I will get a burst of > about 350-500 KB/sec output (which is reasonable for the server hardware), > then after a few minutes it falls to < 10 KB/sec > ~ for a 10+ GB files, this will take more than 10 days > just to rechunk one file. Adjusting -m and -h options gives only > a minor improvement, the initial write burst lasts longer, but still > eventually floors to <10 KB/sec. > > Do you think this is the best way to optimize for a time > series read access? And what do you suggest to make > the process finish in a reasonable time? Are files of this > size just too much? The output format of the file doesn't > matter to me as long as its netcdf4 and max compression can be > applied later. How much memory do you have that you can dedicate to nccopy when it is rechunking the data? If you have enough memory, use of the -w option may speed things up significantly. Since available memory can make a big difference in how long rechunking takes, is a possible solution just doing the rechunking on a different system with lots of memory, e.g. 64 GB? Memory is pretty cheap compared to programmer time these days, so I'm wondering if that's a possibility ... Another approach that might work is using more than one pass over the data by writing an intermediate file that's rechunked in a way intermediate between the current input and the desired output. This problem is very interesting to me, and I'd like to be able to test approaches to optimizing access for time series using a real data file rather than some artificial test data. Could you either make available one of those input files (but not as an email attachment! :-) or tell me how to get one? Especially when dealing with questions that may ultimately involve compression as well as chunking, it's important to deal with real-world data. If that's not practical, I'd like to get the CDL from ncdump -h (or -c) for the input netCDF file as well as CDL for the desired output, so I know exactly what yuo're trying to do. --Russ Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu Ticket Details =================== Ticket ID: TSI-527912 Department: Support netCDF Priority: Normal Status: Closed
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.