[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: netCDF internal storage



Hi Keith,

> > We don't have an explanation for this.  I just checked with Glenn Davis,
> > who implemented it, and he agrees that it should take essentially the
> > same amount of time to access a value at the beginning of each record as
> > at the end.  If you have a small test case that demonstrates otherwise,
> > we'd be interested, because it would either indicate a bug or a behavior
> > of lseek() we don't understand.  The offset for each record variable
> > within a record is calculated once and stored in the header with other
> > information about that variable, so the only difference between the two
> > situations should be adding a zero offset vs. adding a nonzero offset to
> > the record offset before the seek.
> 
> We've been running more tests on a variety of file sizes and can no
> longer duplicate this phenomenon.  We're definitely at a loss to
> explain this.  We thought a passage in the netCDF manual might be
> referring to the ordering effect we saw (on p. 138):
> 
> "The order in which the data in the fixed-size data part and in each
> record appears is the same as the order in which the variables were
> defined, in increasing numerical order by netCDF variable ID.  This
> knowledge can sometimes be used to enhance data access performance,
> since the best data access is currently achieved by reading or writing
> the data in sequential order."
>
> But, now we're not sure why we saw an ordering performance difference
> before, but none now.  What does this passage really mean?

It's referring to the fact that if you read the data in the same order it's
written on the disk, you can take advantage of the read-ahead that's done in
systems like UNIX, where when you read a disk block, the system will read
the next block into a memory buffer for you so it can be quickly accessed if
you are reading sequentially.  Also the netCDF library doesn't make
unnecessary lseek() calls if it notices it's already at the right offset in
a file for a requested read, which is the case with reading the data in the
same order in which it was written.

On the other hand, if you were to read the variables in the reverse of the
order in which they're written, you would end up doing an lseek() call
before each read, flushing the read-ahead cache, and not getting any benefit
from the read-ahead buffers provided by the operating system.

Whether these differences are actually significant depends on lots of
things, including the record size and data type

> > Our understanding is that there should be no need for creating smaller
> > files, and that caching the header should be enough to get the performance
> > you want.  We'd be interested in hearing about your progress in diagnosing
> > this problem, especially if it indicates a problem with netCDF performance
> > that we can't currently explain.
> 
> We prefer using big files as it reduces the number to open and keep
> track of.  We'll keep you posted if we figure out any more on what is
> happening with the juke box.  Thanks.

One other thought I had was that you might be linking to the HDF/netCDF
software (-lmfhdf) from NCSA rather than the Unidata XDR-based netCDF
library (-lnetcdf).  The mfhdf library uses a completely different I/O
implementation, and might explain the performance differences you are
seeing.

--Russ