[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[python #HBO-649201]: IO bounds with large netCDF



Nathan,

You're generally correct about chunking. It *is* possible, though, to use 
nccopy to rewrite your file with different chunking options.

While you may have different access patterns, the blog posts I linked discuss 
the fact that naive contiguous chunking along one dimension (say time) can 
produce completely pathological, worst case performance for access along other 
dimensions. You can use a chunking strategy that, while not optimal for any 
particular access pattern, does not penalize either of your access patterns 
excessively. Think of it as making sure that regardless of how you're striding 
through the data, the read-ahead support on your disk is able to grab more than 
one grid point of useful data in a single read.

If you run:

ncdump -sh my_netcdf_file.nc

you can see what the _ChunkSizes attribute says is being used for your data.

Ryan


> I had not yet used the C library directly; though, I was getting started on
> going that route earlier this week. As far as chunking, I have not done too
> many tests with it. From my understanding of it, you should chunk such that
> it matches your file access pattern. For me, however, I have two different
> data processing steps that have two access patterns (one on horizontal
> slices and the other on the vertical time dimension). My understanding is
> once you set the file up to have a particular chunk pattern, you cannot
> change it to more efficiently access in a different way. I'll take another
> look at the chunking documentation. If I am misunderstanding something, let
> me know.
> 
> Thanks,
> 
> address@hidden> wrote:
> 
> > Nathan,
> >
> > Are you only doing this using Python? Or have you tried the netCDF-C
> > library as well?
> >
> > I'm wondering if there's some issues of how the data are chunked at play:
> >
> > https://www.unidata.ucar.edu/blogs/developer/entry/
> > chunking_data_why_it_matters
> > https://www.unidata.ucar.edu/blogs/developer/entry/
> > chunking_data_choosing_shapes
> > https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_
> > perf_chunking.html
> >
> > Ryan
> >
> > > Ryan,
> > >
> > > Patrick Marsh here at SPC has pointed me in your direction to ask about
> > an
> > > issue I am having. I have a netCDF file that has a variable in it that is
> > > 8784*2502*5852 in size. I have been using packages like h5netcdf to take
> > > advantage of parallel IO. The calculations I do are basically taking 2D
> > > arrays from each level of the first dimension and also taking 1D arrays
> > at
> > > each point for all time (again, the first axis). What I am running into
> > is
> > > that my analysis is IO bound. I have attempted to alleviate some of this
> > > problem by making temporary arrays that store a chunk of data before
> > > writing the results back to disk. This helps some, but it would still
> > take
> > > a day or more to process. Are you aware of anything I could look into to
> > > try and make this more efficient or have someone else I should ask about
> > > this? I appreciate whatever advice you might have.
> >
> > Ticket Details
> > ===================
> > Ticket ID: HBO-649201
> > Department: Support Python
> > Priority: Low
> > Status: Closed
> > ===================
> > NOTE: All email exchanges with Unidata User Support are recorded in the
> > Unidata inquiry tracking system and then made publicly available through
> > the web.  If you do not want to have your interactions made available in
> > this way, you must let us know in each email you send to us.
> >
> >
> >
> 
> 
> --
> Nathan Wendt
> Meteorologist, CIMMS/SPC Research Associate
> 2330 National Weather Center
> 120 David L Boren Blvd.
> Norman, OK 73072
> address@hidden
> 
> 

Ticket Details
===================
Ticket ID: HBO-649201
Department: Support Python
Priority: Low
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.