[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #IDT-559068]: Efficiency of reading HDF with netcdf 4



Benno,

> MODIS is kept in 317 tiles per timestep, so to make 1 global image, I need
> to read 1 tile each from 317 different files, a common structure for GIS
> community data (there are 8 or so tiles per file corresponding to different
> variables).
> netcdf4 is reading all the metadata on open, which means in this case that
> all the disk blocks are touched (my guess is that at least some of the
> metadata is spread amongst the data blocks).  I am not using the metadata at
> all (they are tiles, when I read the tiles I already know the structure from
> a previous analysis/read of the metadata), so this is a bit of a waste, and
> very slow if the metadata is spread throughout the file, as it seems to be.

I'm surprised that every disk block is read.  That's not typical for netCDF-4
files that use HDF5 as a storage layer.  Typically there is only a small amount
of metadata, which comprises
 
 - the name and size of each dimension
 - the name, type, and values for each attribute
 - the name, type, and shape for each variable
 - the association information linking variables and attributes
 - other variable properties, such as compression level and layout
 - information about group names and group links
 - information about definitions of user-defined types
 - the B-trees of chunks for each chunked variable

Although this information is scattered around the file, it doesn't involve every
disk block.  Each variable chunk is typically one or more disk blocks that are 
entirely data, perhaps compressed.  Compressed data is not uncompressed until
its read, and reading a single tile should not access any of the other tiles,
if a tile is one or more chunks.

> Since I am reading the entire tile once for a given variable (just not all
> the different variables), chunking within the file does not really matter,
> and reuse does not happen.

Right, chunking is irrelevant in that case.

> Aggregation like this is a common use case for netcdf -- not necessarily
> common yet for tiles, but certainly common in time.  So would you consider
> improving the performance in this case where the metadata is not read for
> use along with the data?

I've discussed this previously with Ed Hartnett, who implemented the
HDF5 reading code.  It apparently would require some extensive changes
to the current code. However, I've created a Jira issue for this, so
it's on our list to investigate, and you can follow or comment on the
issue here:

  https://www.unidata.ucar.edu/jira/browse/NCF-132

For now, you would probably be better off reading the MODIS data
through the HDF5 library than through the netCDF API.  If you try
that, I'd be interested in how much the performance improves.

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: IDT-559068
Department: Support netCDF
Priority: Normal
Status: Closed