[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #TSI-527912]: nccopy advice - rechunking very large files



Hi Dan,

This is just a short followup on using nccopy to rechunk files.

I'm assuming the goal is to allow fast access to all the data for a point
or small region for all 98128 times (each originally stored in a separate
chunk) without having to access 98128 distinct disk blocks.  This goal can
certainly be achieved by rechunking with data for all times in each chunk,
but that can require a lot of memory, because all the output chunks must be
kept in memory throughout the rechunking.

If you can accept making only a few disk accesses instead of only one to get
data for all the times for a point or small region, then the rechunking can
be done faster and using a lot less memory.  For example, if you measure and
conclude that using only 4 disk accesses instead of 98128 suffices for the
use case you have in mind, then rechunking to chunks with length 98128/4 = 
24532 along the time access means you only have to have enough memory for
1/4 of the output file, and the rechunking can still be done in about 30 minutes
on a disktop machine. For example, here's what it took on my Linux desktop,
reserving only 10 GB of memory for the chunk cache:

  $ /usr/bin/time nccopy -ctime/24532,x/16,y/12 -e 102000 -m 40M -h 10G -d0 
tmp.nc4 tmp-rechunked.nc4
  1264.99user 175.39system 31:34.06elapsed 76%CPU (0avgtext+0avgdata 
12299388maxresident)k
18554864inputs+77738408outputs (22856major+12001463minor)pagefaults 0swaps

Interactive access with 4 disk reads per query would probably seem just as fast 
as with
one disk access per query.  Similarly, accepting a number larger than 4 might 
be a good
compromise between access time and processing time to rechink the data ...

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: TSI-527912
Department: Support netCDF
Priority: Normal
Status: Closed