[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #AMR-714212]: netcdf file size for limited vs unlimited



Oscar,

> This settles this issue for me, I have no further questions. Thanks a
> lot for your clear explanation.

I just want to let you know first that netcdf-4 is not a solution to this
problem, in that it won't save you the wasted space, at least for a small
example I just tried.  The underlying HDF5 format uses a B-tree structure
for unlimited variables to keep track of the resulting chunks, and that
uses a lot of space.

I'll post a table I'm constructing of the space used by all the cases you
are considering, and I think it will be clear from that that netcdf-4 
would not be suitable to save space in this case, unless you used
compression.

--Russ


> I posted this question on the MathWorks-forum as well (no reply so far);
> do you mind if I reply to my own message copy-pasting your answer? I'll
> leave the credit to the unidata-support of course.
> 
> cheers,.................Oscar
> 
> 
> 
> -----Original Message-----
> From: Unidata netCDF Support [mailto:address@hidden]
> Sent: Wednesday, 24 March, 2010 19:38
> To: Hartogensis, Oscar
> Cc: address@hidden
> Subject: [netCDF #AMR-714212]: netcdf file size for limited vs unlimited
> 
> Hi Oscar,
> 
> > Writing multiple 1-dimensional variables (a time-series) to a
> > netcdf-file formatted as nc_type "short", I noticed that the file
> > becomes twice as large when using an unlimited versus a limited
> > dimension definition.
> >
> > However:
> > 1. Writing one variable of nc_type 'short' only, both the limited and
> > unlimited dimension files are of the same size...
> > 2. Writing all data as floats the limited and unlimited dimension
> > nc-files are of equal size (double the size of the limited dimension
> > file of type short; as expected). It seems that using multiple
> > variables of unlimited dimension means that the data is always written
> 
> > as a float?, or am I doing something wrong?
> 
> Dennis's answer was close, in that you need to know something about the
> underlying netCDF-classic format to explain this.  The reason is that
> the space for each variable's data in a record is padded to the nearest
> multiple of 4-bytes.  This makes sure each variable's data starts on a
> 4-byte boundary, which is an optimization for disk seeks on some
> platforms.
> 
> There is a special case if there is only one record variable, in which
> case no padding is used for byte or short variables.  These padding
> rules are documented in the format specification:
> 
> 
> http://www.unidata.ucar.edu/netcdf/docs/netcdf.html#NetCDF-Classic-Forma
> t
> 
> and specifically in the description of the "varslab", which is a
> record's worth of data for a single variable, along with the special
> note at the end of the specification on padding:
> 
> Note on padding: In the special case of only a single record variable
> of character, byte, or short type, no padding is used between data
> values.
> 
> As for a way to get around this problem, all I can think of is to use an
> extra artifical dimension to make the short variables 2-dimensional,
> such as:
> 
> netcdf unlim2 {
> dimensions:
> time = unlimited;
> two = 2;
> variables:
> short var1(time, two);
> short var2(time, two);
> data:
> var1 =
> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
> var2 =
> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19; }
> 
> You can still read these values all at once in a contiguous block, and a
> small layer of software would let you write the values two-at-a-time,
> using a function you would call for each value that would save odd
> values and write to the file when it had 2 values.
> 
> --Russ
> 
> 
> > The files I write are quite large and I need to use an unlimited
> > dimension as I don't know the record length in advance (I join
> > multiple files into one nectdf file) but I don't like to waste double
> > the disk-space my nc-files.
> 
> > I use Matlab to write nc-files and I tried the Matlab-native netcdf
> > commands (example below), but also the mexcdf-toolbox and snctools.
> > All give the same result. This seems to be more a netcdf than a Matlab
> 
> > issue. Any help is much appreciated though.
> >
> >
> > EXAMPLE1 to illustrate this issue (Matlab native commands):
> > %%%%%%%%%%%%%%%%%%%%%%%%
> > N=80000;
> >
> > % LIMITED dimension
> > % creating a netcdf file
> > nc = netcdf.create('testfile_lim.nc', 'NC_CLOBBER'); % define
> > dimension time_dim = netcdf.defDim(nc, 'time', N); % define variables
> > var1_id = netcdf.defVar(nc, 'var1', 'short', time_dim); var2_id =
> > netcdf.defVar(nc, 'var2', 'short', time_dim); netcdf.endDef(nc); %
> > write data netcdf.putVar(nc, var1_id,int16([1:N])); netcdf.putVar(nc,
> > var2_id,int16([1:N])); % close nc-file
> > netcdf.close(nc)
> >
> > % UNLIMITED dimension
> > % creating a netcdf file
> > nc = netcdf.create('testfile_unlim.nc', 'NC_CLOBBER'); % define
> > dimension time_dim = netcdf.defDim(nc, 'time',
> > netcdf.getConstant('NC_UNLIMITED'));
> > % define variables
> > var1_id = netcdf.defVar(nc, 'var1', 'short', time_dim); var2_id =
> > netcdf.defVar(nc, 'var2', 'short', time_dim); netcdf.endDef(nc); %
> > write data netcdf.putVar(nc, var1_id,0,N,int16([1:N]));
> > netcdf.putVar(nc, var2_id,0,N,int16([1:N])); % close nc-file
> > netcdf.close(nc)
> > %%%%%%%%%%%%%%%%%%%%%%%%
> >
> > testfile_lim.nc => 312kB
> > testfile_unlim.nc => 625kB
> >
> >
> >
> > EXAMPLE2 to illustrate this issue (mexcdf commands):
> > %%%%%%%%%%%%%%%%%%%%%%%%
> > N=80000;
> >
> > nc_lim = netcdf( 'test_lim.nc' , 'clobber'); nc_unlim = netcdf(
> > 'test_unlim.nc' , 'clobber');
> >
> > nc_lim('time') = N;
> > nc_unlim('time') = 0;
> >
> > nc_lim{'var1'} = ncshort('time');
> > nc_lim{'var2'} = ncshort('time');
> > nc_unlim{'var1'} = ncshort('time');
> > nc_unlim{'var2'} = ncshort('time');
> >
> >
> > nc_unlim{'var1'}([1:N]) = int16([1:N]);  % Store data
> > nc_unlim{'var2'}([1:N]) = int16([1:N]);  % Store data
> >
> > nc_lim{'var1'}(:) = int16([1:N]);  % Store data
> > nc_lim{'var2'}(:) = int16([1:N]);  % Store data
> >
> > close(nc_lim);
> > close(nc_unlim);%%%%%%%%%%%%%%%%%%%%%%%%
> >
> > test_lim.nc => 312kB
> > test_unlim.nc => 625kB
> >
> >
> > thanks,............................Oscar Hartogensis
> >
> >
> > ---------------------------------------------------------
> > Oscar K Hartogensis
> > Meteorology and Air Quality Group
> > Wageningen University
> > mail: PO Box 47, 6700 AA Wageningen, the Netherlands
> > visit: Atlas, building 104, Droevendaalsesteeg 4,
> > 6708 PB Wageningen, the Netherlands
> > tel: +31 (0)317 482109
> > fax: +31 (0)317 419000
> > email: address@hidden
> > url: www.met.wau.nl
> > ---------------------------------------------------------
> >
> >
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                      http://www.unidata.ucar.edu
> 
> 
> 
> Ticket Details
> ===================
> Ticket ID: AMR-714212
> Department: Support netCDF
> Priority: Normal
> Status: Closed
> 
> 
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: AMR-714212
Department: Support netCDF
Priority: Normal
Status: Closed