[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #UAU-670796]: Rechunking of a huge NetCDF file



> Hi Russ,
> 
> >> I did make some interesting observations. I had previously overlooked the 
> >> “-u” flag (it’s documentation is somewhat confusing…?). The time 
> >> coordinate has been unlimited in my files. On my Macbook Air:
> >>
> >> nccopy -w -c time/99351,lat/1,lon/1 small.nc test1.nc  11.59s user 0.07s 
> >> system 99% cpu 11.723 total
> >>
> >> nccopy -u small.nc small_u.nc
> >>
> >> nccopy -w -c time/99351,lon/1,lat/1 small_u.nc test2.nc  0.07s user 0.04s 
> >> system 84% cpu 0.127 total
> >>
> >> That’s amazing!
> >
> > It's because we use the same default chunk length of 1 as HDF5
> > does for unlimited dimensions.  But when you use -u, it makes
> > all dimensions fixed, and then the default chunk length is larger.
> 
> But both small.nc and small_u.nc are classic netCDF files. So no HDF5 
> relations at all…?

Oops, you're right, it has nothing to do with HDF5, but everything to do 
with the format for record variables in netCDF classic format files:

  http://www.unidata.ucar.edu/netcdf/workshops/2012/performance/ClassicPerf.html

Accessing all the data from a record variable along the unlimited 
dimension can require one disk access per value, whereas using the
contiguous storage of fixed-size variables accesses data very
efficiently.  On the other hand, if you need to access data from 
multiple variables one record-at-a-time, record variables can be
the best layout for data.

> >> However, when I ran a similar test with a bigger (11GB) subset of my 
> >> actual data, this time on a cluster (under SLURM), there was no difference 
> >> between the two files. Maybe my small.nc is simply too small to reveal 
> >> actual differences and everything is hidden behind overheads?
> >
> > That's possible, but you also need to take cache effects into account.
> > Sometimes when you run a timing test, a small file is read into memory
> > buffers, and subsequent timings are faster becasue the data is just
> > read from memory instead of disk, and similarly for writing.  With 11GB
> > files, you might not see any in-memory caching, because the system disk
> > caches aren't large enough to hold the file, or even consecutive chunks
> > of a variable.
> 
> My non-python timings were naively from the “time” command, which performs 
> the command just once. So I don’t think there can be any cache effects here.

Hmmm, not sure how to explain that.

> I’m not sure what I did differently previously with the 11GB test file (maybe 
> a cluster with hundreds of users is not the best for performance comparison). 
> Anyways, I do think that the -u flag solved my problem. I got fed up with 
> queuing for resources on the cluster and decided to go with a normal desktop 
> machine with 16GB of memory. So I stripped a single variable from the huge 
> file and did the -u operation on the resulting 43GB file, and then run this:
> 
> nccopy -m 10G -c time/10000,lat/10,lon/10 shortwave_u.nc shortwave_u_T10.nc
> 
> It took only 15 minutes! Without the -u operation the command processed only 
> a few GB in 1 hour (after which I cancelled it).
> 
> 2/5 variables done now. If no further technical problems arise, I should have 
> the data ready for their actual purpose tomorrow. :)

Excellent, and your experience may be useful to other users.  I'll add use
of "-u" to my future performance advice.

> Thank you for your help! I will acknowledge you/Unidata in my paper (any 
> preference?).

Feel free to acknowledge me.  Thanks!

--Russ

> - Henri
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: UAU-670796
Department: Support netCDF
Priority: High
Status: Closed