[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #UAU-670796]: Rechunking of a huge NetCDF file



Hi Henri,

> I have a 200GB uncompressed NetCDF file with 5 variables (+lat,lon,time) of 
> ECMWF ERA-Interim data like this:
> 
> dimensions(sizes): lon(480), lat(241), time(99351)
> 
> I need to access all time instants of the data, one gridpoint at a time. 
> Unfortunately the data is organized inefficiently for this, and retrieving 
> one slice takes 10 minutes or so. I have tried to rechunk the data with this 
> command:
> 
> nccopy -w -c time/99351,lat/1,lon/1 all5.nc all5_T.nc
> 
> but the processing has taken 9 days already (I have allocated 1 CPU and 250GB 
> of memory to it). Is there some way to estimate how it’s doing and how long 
> this will take? I ran the same command with a test file of only 9 grid 
> points, and estimated that if the process scaled perfectly, the full data 
> would be finished in 2 days.
> 
> Alternatively, is there some smarter way to do this? I suppose I should have 
> done this in smaller pieces, but I’d hate to kill the process now if it’s 
> close to finishing.

You might want to read these blog posts, if you haven't already:

  
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
  
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes

You haven't mentioned whether the 200 GB source file is a netCDF-4
classic model format using compression.  That might make a difference,
as you may be spending an enormous amount of time uncompressing the
same source chunks over and over again, due to using a small chunk
cache.

Even if the source data is not compressed, you probably need to
specify use of a chunk cache to make sure the same source data doesn't
need to be reread from the disk repeatedly for each of the 480x241
points.  And I would advise using a different shape for the output
chunks, something more like time/10000,lat/10,lon/20 so that you can
get the data for one point with 10 disk accesses instead of 1,
probably still fast enough.  Also, such a shape would store data for
200 adjacent points together in 1 chunk, so if it's cached, nearby
queries will be very fast after the first.

I would also advise just giving up on the current nccopy, which may
well take a year to finish!  Spend a little time experimenting with
using some of the advanced nccopy options, such as -w, -m, -h, and -e,
which could make a significant difference in rechunking time:

  http://www.unidata.ucar.edu/netcdf/docs/nccopy-man-1.html

What works best is platform-specific, but you may be able to get
something close to optimum by timing with smaller examples.  I'd be
interested in knowing what turns out to be practical!

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: UAU-670796
Department: Support netCDF
Priority: Normal
Status: Closed