[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #UAU-670796]: Rechunking of a huge NetCDF file



Hi Henri,

> Thanks, your blog posts were quite enlightening. However, I didn’t
> understand what was wrong with my command
> 
> nccopy -w -c time/99351,lat/1,lon/1 all5.nc all5_T.nc
> 
> I did have the -w flag, which, according to the manual, makes
> playing with the -m, -h and -e options futile. I allocated 250GB of
> memory, and I assume nccopy would have crashed if memory ran out.

Sorry, I overlooked the "-w" flag on your command.  You're right, if
you have enough memory, use of -w seems to be better than adjusting
-m, -h, and -e options, at least in my experience.

However, your example may be a good test case for the "-r" flag as
well as the "-w" flag, if you have enough memory for both an input and
output copy of the file in memory.  That would read the entire file
into memory, write a transposed version in memory, and then write that
out when the file is closed.

Your example is an important use case for nccopy to do well, so I'd
like to see if we can help make it work in a practical amount of time.
For that, nccopy needs one more feature (we're running out of flags!)
to provide verbose output that allows you to monitor progress and how
long it takes.  I've just added this to our issues tracking system,
but I'm afraid it won't get implemented right away:

  https://bugtracking.unidata.ucar.edu/browse/NCF-285

For now, I suggest just adding a print statement to the
ncdump/nccopy.c source to copy_var_data().

The print statement in copy_var_data() could look something like this,
in context of surrounding code in that function:

        NC_CHECK(nc_get_vara(igrp, varid, start, count, buf));
        NC_CHECK(nc_put_vara(ogrp, ovarid, start, count, buf));
        printf("%s start[0]=%d",varname, start[0]);  /* for logging progress */

to watch progress by the first index of each variable.  Just noting
how much time it takes between printed lines would make it possible to
estimate how long it would take to finish.

If the input file uses the unlimited (record) dimension and the record
variables are also big variables, you will have to add a similar
printf() call to the copy_record_data() function in nccopy.c:

            NC_CHECK(copy_rec_var_data(ncid, ogrp, irec, varid, ovarid, 
                                       start[ivar], count[ivar], buf[ivar]));
            NC_CHECK(nc_inq_varname(ncid, varid, varname));   /* for logging 
progress */
            printf("%s start[0]=%d",varname, start[ivar][0]); /* for logging 
progress */

You might have to add #include <stdio.h> up at the top of
ncdump/nccopy.c for the printf() function to be available.  After
those modifications, running "make all" should create a new nccopy
executable, and "make install" should install it.

> The original data is in classic NetCDF3 format.
> 
> As per your suggestion, I tried different chunk sizes as well, but
> it seems nccopy crashes for anything bigger than 3 as the lengths
> for lat/lon:
> 
> nccopy -c time/10000,lat/2,lon/4 input.nc output.nc
> NetCDF: HDF error
> Location: file nccopy.c; line 1437
> 
> netcdf library version 4.3.0 of Jan 17 2014 13:23:48
> 
> The same happens on 3 different platforms (same NetCDF version), and
> with netcdf library version 4.2.1.1.

That's a new bug that will require a way for us to reproduce it to fix
it.  The fact that it says "HDF error" is puzzling; maybe there's an
HDF rule about chunk lengths that I'm not aware of.  Again, this is an
important use case, so we should do whatever is required to fix it.

> I just came across the ncpdq utility, but apparently it’s unrelated
> to chunking and doesn’t provide huge perfomance benefits?

It's also unrelated to Unidata, as it's a utility that's part of the
NCO (netCDF Operators) package developed and maintained by Charlie
Zender and his group at Univ. of California at Irvine.  It would
provide the same performance benefits, and might be a better way to
go, as it only has to fit each variable in memory, rather than all the
variables at once.  If you choose to try it for this problem, I'd be
very interested in knowing the results!

--Russ

> Thanks!
> 
> - Henri
> 
> 
> On 28 Jan 2014, at 19:02, Unidata netCDF Support <address@hidden> wrote:
> 
> > Hi Henri,
> >
> >> I have a 200GB uncompressed NetCDF file with 5 variables (+lat,lon,time) 
> >> of ECMWF ERA-Interim data like this:
> >>
> >> dimensions(sizes): lon(480), lat(241), time(99351)
> >>
> >> I need to access all time instants of the data, one gridpoint at a time. 
> >> Unfortunately the data is organized inefficiently for this, and retrieving 
> >> one slice takes 10 minutes or so. I have tried to rechunk the data with 
> >> this command:
> >>
> >> nccopy -w -c time/99351,lat/1,lon/1 all5.nc all5_T.nc
> >>
> >> but the processing has taken 9 days already (I have allocated 1 CPU and 
> >> 250GB of memory to it). Is there some way to estimate how it’s doing and 
> >> how long this will take? I ran the same command with a test file of only 9 
> >> grid points, and estimated that if the process scaled perfectly, the full 
> >> data would be finished in 2 days.
> >>
> >> Alternatively, is there some smarter way to do this? I suppose I should 
> >> have done this in smaller pieces, but I’d hate to kill the process now if 
> >> it’s close to finishing.
> >
> > You might want to read these blog posts, if you haven't already:
> >
> >  
> > http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
> >  
> > http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes
> >
> > You haven't mentioned whether the 200 GB source file is a netCDF-4
> > classic model format using compression.  That might make a difference,
> > as you may be spending an enormous amount of time uncompressing the
> > same source chunks over and over again, due to using a small chunk
> > cache.
> >
> > Even if the source data is not compressed, you probably need to
> > specify use of a chunk cache to make sure the same source data doesn't
> > need to be reread from the disk repeatedly for each of the 480x241
> > points.  And I would advise using a different shape for the output
> > chunks, something more like time/10000,lat/10,lon/20 so that you can
> > get the data for one point with 10 disk accesses instead of 1,
> > probably still fast enough.  Also, such a shape would store data for
> > 200 adjacent points together in 1 chunk, so if it's cached, nearby
> > queries will be very fast after the first.
> >
> > I would also advise just giving up on the current nccopy, which may
> > well take a year to finish!  Spend a little time experimenting with
> > using some of the advanced nccopy options, such as -w, -m, -h, and -e,
> > which could make a significant difference in rechunking time:
> >
> >  http://www.unidata.ucar.edu/netcdf/docs/nccopy-man-1.html
> >
> > What works best is platform-specific, but you may be able to get
> > something close to optimum by timing with smaller examples.  I'd be
> > interested in knowing what turns out to be practical!
> >
> > --Russ
> >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: UAU-670796
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> 
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: UAU-670796
Department: Support netCDF
Priority: Normal
Status: Closed