[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #TSI-527912]: nccopy advice - rechunking very large files



> Thansk Russ...
> I just finished running your suggested command in the previous
> message... the /usr/bin/time output was:
> 
> 1119.76user 221.73system 36:46.87elapsed 60%CPU (0avgtext+0avgdata
> 152676640maxresident)k
> 20743616inputs+76115608outputs (2290major+13402231minor)pagefaults 0swaps
> 
> 
> Not bad.. nto bad at all...  compared to the 4 days+ my previous
> attempt was taking ~
> the machine had to be rebooted for maintenance and it never finished.
> 
> Our most powerful server here has 48 GB of RAM, which is what we used here.

That's good to here.  Originally I tried specifying the number of chunk cache
entries with "-e 102K" instead of "-e 102000" thinking nccopy would interpret
the K suffix as it does for the -m and -h options, but instead it just used
102.  It turns out it would have taken 545 days to complete the rechunking with
that number of cache entries, because it would have had to re-read and re-
uncompress the input chunks, as well as re-read and re-write the output chunks! 
 
It was taking over 8 minutes to process each input record, with 98128 records 
...

I'll be fixing the support for suffixes with -e in the next version, but be
warned you currently have to be explicit with -e.

> Nearly all of my questions are satisfied, the note about how to figure
> the [-e X] option was insightful, as I didn't understand how to do this.
> 
> My remaining question... This worked great for the 10 GB input .. that
> produced a 39 GB output at -d0... how will our 48 GB RAM machine fare
> if I try a ~25 GB input...   will I need to reduce the time chunk to
> 1/4th or more
> to work with these in a reasonable time?  I have no problem doing this.. as 
> you
> suggested.

Yes, although you may have to experiment with whether you can get by with
dividing the record dimension by 3, 4, 5, or 6.  It depends on how good the
compression is and how much memory you're really using.  It might be helpful
to use "top" or some such memory monitor to see how much memory is actually
used by nccopy in that case, and adjust as needed to make sure you aren't using
all the memory, which would probably mean you would be ejecting chunk cache
entries and slowing things down.  Also, I haven't played much with the -m
setting, so I'm not sure if setting that much larger or smaller would make 
any significant difference.

> An extraneous note, Apparently the -d4 input file compression works less-well
> with certain fields.. resulting in anywhere from 7 - 25 GB at -d4.
> 
> The ultimate goal is access on THREDDS on a stressful environment,
> where it would
> certainly help to reduce disk accesses.. and probably file size as well.
> Now that they are rechunked, we will be testing them at various
> compression levels
> (d0 -d5 -d9) to see what gives the best access performance.

Yes, the zlib compression is not necessarily the best thing to use for 
floating-point data, but the better szip compression has licensing problems,
so we don't include it.  I've also noticed that higher zlib compression
levels don't necessarily result in higher compression.

I'm planning a talk on this subject at the AMS meeting in Austin, and would
appreciate hearing about any experience you gain by turning the knobs.  Here's
the abstract of my talk:

  Making earth science data more accessible: experience with compression and 
chunking
  https://ams.confex.com/ams/93Annual/webprogram/Paper220240.html

--Russ

> On Tue, Oct 9, 2012 at 11:04 AM, Unidata netCDF Support
> <address@hidden> wrote:
> > Hi Dan,
> >
> > This is just a short followup on using nccopy to rechunk files.
> >
> > I'm assuming the goal is to allow fast access to all the data for a point
> > or small region for all 98128 times (each originally stored in a separate
> > chunk) without having to access 98128 distinct disk blocks.  This goal can
> > certainly be achieved by rechunking with data for all times in each chunk,
> > but that can require a lot of memory, because all the output chunks must be
> > kept in memory throughout the rechunking.
> >
> > If you can accept making only a few disk accesses instead of only one to get
> > data for all the times for a point or small region, then the rechunking can
> > be done faster and using a lot less memory.  For example, if you measure and
> > conclude that using only 4 disk accesses instead of 98128 suffices for the
> > use case you have in mind, then rechunking to chunks with length 98128/4 =
> > 24532 along the time access means you only have to have enough memory for
> > 1/4 of the output file, and the rechunking can still be done in about 30 
> > minutes
> > on a disktop machine. For example, here's what it took on my Linux desktop,
> > reserving only 10 GB of memory for the chunk cache:
> >
> >   $ /usr/bin/time nccopy -ctime/24532,x/16,y/12 -e 102000 -m 40M -h 10G -d0 
> > tmp.nc4 tmp-rechunked.nc4
> >   1264.99user 175.39system 31:34.06elapsed 76%CPU (0avgtext+0avgdata 
> > 12299388maxresident)k
> > 18554864inputs+77738408outputs (22856major+12001463minor)pagefaults 0swaps
> >
> > Interactive access with 4 disk reads per query would probably seem just as 
> > fast as with
> > one disk access per query.  Similarly, accepting a number larger than 4 
> > might be a good
> > compromise between access time and processing time to rechink the data ...
> >
> > --Russ
> >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: TSI-527912
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> 
> 
> 
> --
> =======================================
> Dan Swank
> STG, Incorporated - Government Contractor
> NCDC-NOMADS Project:  Software & Data Management
> Data Access Branch
> National Climatic Data Center
> Veach-Baley Federal Building
> 151 Patton Avenue
> Asheville, NC 28801-5001
> Email: address@hidden
> Phone: 828-271-4007
> =======================================
> Any opinions expressed in this message are mine personally and do not
> necessarily reflect any position of STG Inc or NOAA.
> =======================================
> 
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: TSI-527912
Department: Support netCDF
Priority: Normal
Status: Closed