[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #UAU-670796]: Rechunking of a huge NetCDF file



Hi Henri,

> I have a 200GB uncompressed NetCDF file with 5 variables (+lat,lon,time) of 
> ECMWF ERA-Interim data like this:
> 
> dimensions(sizes): lon(480), lat(241), time(99351)
> 
> I need to access all time instants of the data, one gridpoint at a time. 
> Unfortunately the data is organized inefficiently for this, and retrieving 
> one slice takes 10 minutes or so. I have tried to rechunk the data with this 
> command:
> 
> nccopy -w -c time/99351,lat/1,lon/1 all5.nc all5_T.nc
> 
> but the processing has taken 9 days already (I have allocated 1 CPU and 250GB 
> of memory to it). Is there some way to estimate how itâs doing and how long 
> this will take? I ran the same command with a test file of only 9 grid 
> points, and estimated that if the process scaled perfectly, the full data 
> would be finished in 2 days.
> 
> Alternatively, is there some smarter way to do this? I suppose I should have 
> done this in smaller pieces, but Iâd hate to kill the process now if itâs 
> close to finishing.

You might want to read these blog posts, if you haven't already:

  
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
  
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes

You haven't mentioned whether the 200 GB source file is a netCDF-4
classic model format using compression.  That might make a difference,
as you may be spending an enormous amount of time uncompressing the
same source chunks over and over again, due to using a small chunk
cache.

Even if the source data is not compressed, you probably need to
specify use of a chunk cache to make sure the same source data doesn't
need to be reread from the disk repeatedly for each of the 480x241
points.  And I would advise using a different shape for the output
chunks, something more like time/10000,lat/10,lon/20 so that you can
get the data for one point with 10 disk accesses instead of 1,
probably still fast enough.  Also, such a shape would store data for
200 adjacent points together in 1 chunk, so if it's cached, nearby
queries will be very fast after the first.

I would also advise just giving up on the current nccopy, which may
well take a year to finish!  Spend a little time experimenting with
using some of the advanced nccopy options, such as -w, -m, -h, and -e,
which could make a significant difference in rechunking time:

  http://www.unidata.ucar.edu/netcdf/docs/nccopy-man-1.html

What works best is platform-specific, but you may be able to get
something close to optimum by timing with smaller examples.  I'd be
interested in knowing what turns out to be practical!

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: UAU-670796
Department: Support netCDF
Priority: Normal
Status: Closed


NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.