[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #GSH-862796]: netCDF and very large datasets



Hi Dennis (and John),

> Recently, several of us [eg, John Dennis, CISL] have been working
> with output from a very high resolution ocean model. This model
> produces an enormous amount of data. One netCDF monthly mean file
> is ~18GB. Ultimately, the model will produce 100s of TBs.
> 
> Background:
> There are ~33 'science' variables [eg, UVEL,VVEL, TEMP,...].
> All are type "float" and, of course, vary with time,
> 
> float TEMP ( time, z_t, nlat, nlon )   => (1, 42, 2400, 3600)
> float SSH  ( time, nlat, nlon )        => (1, 2400, 3600)
> 
> There are numerous variables which are time invariant.
> Some are standard coordinate variables: eg,  z_t(z_t).
> The other time invariant variables (21) are large two
> dimensional arrays of type float/integer (7) and double (14).
> 
> float TLONG ( nlat, nlon )     => 34.5MB
> float ULAT  ( nlat, nlon )
> 
> double UAREA ( nlat, nlon )    => 69MB each
> double TAREA ( nlat, nlon )
> 
> Hence, ~1.2GB (14*69MB + 7*34.5MB)  of each file represents
> time invariant fields.  This represents about 6.6% (1.2/18)
> of each file.
> 
> This is tolerable even though, when summed over all the files
> (ie, time steps), the total disk space of the time invariant
> variables is TBs. From a user perspective, the file sizes can result
> in problems. For example, if the files are on CISL's Mass Storage
> System, obtaining multiple files requires significant time.
> 
> 
> Generally, users work with a small subset of the variables.
> One idea was to create files containing one variable per file
> for scalars (TEMP) and files containing vector pairs (eg: TAUX, TAUY).
> These files would contain the time invariant variables associated
> with the variable(s) on the file.  This would still result in
> considerable storage requirements for invariant variables.
> So far, nothing would violate the netCDF [CF] mantra that each file
> should be self describing (contained).
> 
> Long preface to a short question/comment:
> 
> To your knowledge, has anybody thought about a 'mechanism' for
> specifying where multi-dimensional time invariant information
> would be stored/accessible?
> 
> My guess is "no" but I do think this will be an issue for large
> multifile datasets.

Yes, there has been some thought and discussion about this issue, which
came up in the context of V. Balaji's GridSpec, a specification for
structured grids that can be nested, mosaiced, staggered, and refined.
The idea is that a time-varying grid specification should be shared
among model outputs, since the grid specification can take a great deal
of space by itself.  Some discussion of the issues and problems with
multi-file datasets has taken place in the arena of CF metadata
standards, because GridSpec is moving forward as a proposed CF standard.

The current proposal is in the section "Linkages between files" in
GridSpec:

  http://www.gfdl.noaa.gov/~vb/gridstd/gridstdse3.html#x5-210003.1

I think Balaji has had some more recent ideas about this, perhaps
involving the storage of file signatures (for example MD5 signatures)
along with a reference to an external file, to make sure that any
change in the external file "breaks the link", to provide more 
confidence in the integrity of a multifile dataset.

> URLs change so that is not an option. I have heard that
> DOIs [Digital Object Interfaces] could be used to
> specify a location but this is down the road.

The issues around stewardship, preservation, and data durability are
also a hot topic among participants in the evolution of CF standards.
For example, the CF standard names table must survive changes in
domain names, institutional funding, and naming technologies.  There
is enough interest in this topic that I think some good solutions will
come out of the competition among concepts like PURLs, URNs, and DOIs.
In the meantime, using a relative URL so that multifile datasets are
within a movable directory structure may be adequate ...

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                     http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: GSH-862796
Department: Support netCDF
Priority: Normal
Status: Closed