[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: netCDF internal storage



> Organization: NOAA/CDC
> Keywords: 199503172126.AA22872

Hi Keith,

> I wanted to check with you about an issue pertaining to the "internal
> workings" of netCDF.  I've been wondering about how truely
> "random-access" netCDF is behind its API.  I'll give you some
> background on what we're doing here to explain why I'm asking you
> about this.
> 
> At CDC we're creating a set of large (some over 500 MB) netCDF files
> on an optical jukebox.  When a user accesses a file, it is copied to a
> cache on magnetic disk, where it remains until displaced by other
> files.  If the user accesses the same file while in cache, it doesn't
> have to access the optical platter, which is much faster.  The
> software we bought for caching supposedly caches 'records' instead of
> whole files.  That way it isn't necessary for entire files to get
> copied to the cache to access a portion of them.  Clearly whole files
> take longer to copy and displace more files already in the cache.
> 
> Our tests indicate that entire files are indeed being copied to cache,
> even when only a subset is requested.  We are investigating the
> possibility that the jukebox software is not working correctly and is
> the cause of this problem.  But, we were thinking that perhaps the
> netCDF internal code works such that trying to access a subset of
> records is really accessing everything there.

No, the implementation is supposed to only seek to the offset in the file
where the data access occurs.  The header information at the beginning of
the file contains the number of records (the size of the unlimited
dimension), so that has to be updated also when a new record is written, but
no intervening data should be accessed.

> One reason we thought of this possibility is that we've found that
> putting a coordinate variable using the unlimited dimension (like
> time) after a large data variable in a netCDF file greatly slows
> access to the coordinate values as compared to putting it *before* the
> data variable in the file.  This makes it seem like netCDF is not
> behaving in a random-access manner, i.e. it takes a while for it to
> 'scan' past the data variable before it locates the time coordinate
> variable.  We're not sure how netCDF is doing the 'seek' to the data.

We don't have an explanation for this.  I just checked with Glenn Davis, who
implemented it, and he agrees that it should take essentially the same
amount of time to access a value at the beginning of each record as at the
end.  If you have a small test case that demonstrates otherwise, we'd be
interested, because it would either indicate a bug or a behavior of seek()
we don't understand.  The offset for each record variable within a record is
calculated once and stored in the header with other information about that
variable, so the only difference between the two situations should be adding
a zero offset vs. adding a nonzero offset to the record offset before the
seek.

> So, what is your opinion on our situation?  Do you think that part of
> a netCDF file can be cached, i.e. just the "header" info and some of
> the data records?  If not, it appears that we might be better off
> creating smaller files on the optical platters.  Thanks for your
> advice.

Our understanding is that there should be no need for creating smaller
files, and that caching the header should be enough to get the performance
you want.  We'd be interested in hearing about your progress in diagnosing
this problem, especially if it indicates a problem with netCDF performance
that we can't currently explain.

--Russ

______________________________________________________________________________

Russ Rew                                                UCAR Unidata Program
address@hidden                                          P.O. Box 3000
http://www.unidata.ucar.edu/                          Boulder, CO 80307-3000
______________________________________________________________________________