[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal for Piped NetCDF



>To: address@hidden
>From: Frank Toussaint <address@hidden>
>Subject: Re: 20050614: Proposal for Piped NetCDF
>Organization: World Data Center for Climate, Max-Planck-Institute for 
>Meteorology
>Keywords: streaming data, concatenation, pipes

Hi Frank,

> regarding our talk at yesterday's conference lunch, we make the
> following proposal for implementation into further NetCDF
> versions. The aim is to make the work on nc files as easy
> as with GRIB format without losing the specific advantages of 
> NetCDF.
> 
> We would be glad, if you could consider this for future releases
> of NetCDF.
> 
> 
> Best regards...  frank
> 
> 
> 
> -----------------------------------
> 
> 1. Work with GRIB
> 2. Situation at our Institute
> 3. Proposal
> 4. Technical Remarks
> 
> 
> 1. Work with GRIB
> -----------------
> In GRIB1 data is organized in records, each consisting of some 
> hundred header bytes followed by a global coverage of data 
> (e.g., LONxLAT=320x160).
> 
> Extensions in several dimensions are allowed: Before reaching the 
> end of the first record the reading application does not know,
> 1) whether there follows another record,
> 2) if yes, if this contains the next level, the next parameter 
> or the next time step
> 
> So for any n GRIB files  gf1.grb,.. gfn.grb  the concatenation by
> the unix cat command results in a valid GRIB file:
>   cat gf?.grb  > newGribFile.grb
> This allows for piping and piecewise consumption of files of 
> virtually any size.
> 
> However, the full amount of flexibility is not neccessarily needed 
> for a pipable NetCDF. Concatenation can be done by more specific 
> routines as well.
> 
> 
> 2. Situation at our Institute
> -----------------------------
> The World Data Center for Climate has most of its data stored in a DB
> as individually accessible records of one time step each. The download 
> of a period in this GRIB format is very convenient as the records 
> simply are concatenated on system (Unix) level. This allows for 
> piping the stream of records coming from the DB directly into the 
> stream of the data request (mostly http), simple functions (cuts in
> any dimension, simple arithmetics) can be done on the fly. Data volume
> is not an issue for datasets of any size.
> 
> However, in NetCDF format the situation requires disk buffering 
> followed by concatenation on disk and final piping of a possibly 
> very big dataset.
> 
> 
> 3. Proposal
> -----------
> To allow for a simple form of concatenation and for piping (without
> buffering) data files, we propose the following change.
> 
> Any valid netCDF structure (header+data portion) that is followed
> by another valid NetCDF structure within the same file should be 
> a new valid netCDF structure. The constraints my be the same
> as they are presently for concatenation along the "current dimension"
> by CDO operators. However, this might not be neccessarily the case.
> 
> More generally spoken: 
> Perhaps it eaven can be reached for concatenated NetCDF files, that
> they consist of NetCDF structures (header+data) that extend the data 
> domain of the first structure in any of its dimensions as long as 
> the extension in all other dimensions stays the same.
> 
> 
> 4. Technical Remarks
> --------------------
> Simple calculations on similarly structured files can be done with
> small handy routines. E.g., for a simple addition of two files they
> can bev piped into the program and the result streamed out. 
> The program then works on the NetCDF structures one by one.
> 
> In C, the seek/flush commands will probably not work the same way
> they do now. Perhaps they have to be forbidden for this kind of
> "coupled" netCDF files.
> 
> Perhaps the headers of the various NetCDF structures within one
> coupled file needs one or more of the following flags:
> 1) at the end of this data portion another similar nc file follows
> 2) at the end of this data portion another similar nc file may follow
> 3) at the end of this data portion no other nc file follows (last 
>    data portion)
> 
> However, as the end of the data portion of a file presently can be 
> determined by the header entries, and, in addition, by the EOF mark 
> this might not be neccessary.
> 
> Perhaps the concatenation of files needs to be restricted to special
> netCDF tools (not the simple Unix "cat .." command) to guarantee 
> for consistency. These routines, however, can use piping for IO.
> 
> 
> -----------------------------------

Thanks for this proposal.  You've explained the benefits of the
approach very clearly.  Previously we've assumed something like this
wasn't feasible with netCDF, which makes use of lseek() calls for
data access, because "you can't seek on a pipe".  But the OPeNDAP
protocol provides for streaming netCDF data across a network.

An obstacle to implementing the concatenation proposal is the way the
unlimited dimension is used in netCDF files, to permit appending to
one or more variables along the unlimited dimension.  If Unix file
systems supported forks, that might permit appending to variables in
the middle of a multi-dataset file, but with current file systems,
this would restrict the suggested approach to read-only access.

Currently if anything is concatenated on the end of a netCDF file, it
cannot be detected in the interface, which just accesses data with the
information in the initial header.  This opens the possibility of
concatenating multiple read-only netCDF files together to send them
through a pipe, then splitting them into multiple netCDF datasets on
the other side of the pipe.

Another consideration with the suggested approach is that HDF5 does
not support concatenating HDF5 files into a single HDF5 file, so our
new netCDF-4 format that uses HDF5 could not be implemented this way.

Nevertheless, we'll have to consider the other ways you've suggested
to get the benefits of this sort of access.  Thanks for the ideas!

--Russ