Re: 4GigB variable size limit


Katie Antypas <kantypas@xxxxxxx> wrote:
> I'm jumping into the discussion late here, but coming from a perspective 
> of trying to find and develop an IO strategy which will work at the 
> petascale level, the 4 GigB variable size limitation is a major 
> barrier.  Already a 1000^3 grid variable can not fit into a single 
> netcdf variable.  Users at NERSC and other supercomputing centers 
> regularly run problems of this size or greater and IO demands are only 
> going to get bigger.  We don't believe chopping up data structures into 
> pieces is a good long term solution or strategy.  There isn't a natural 
> way to break up the data and chunking eliminates the elegance, ease and 
> purpose of a parallel IO library.  Besides the direct code changes, 
> analytics and visualization tools become more complicated as datafiles 
> from the same simulation but of different sizes would not have the same 
> number variables.  Restarting a simulation from a checkpoint file on a 
> different number of processors would also become more convoluted.
> The view from NERSC is that if Parallel-NetCDF is to be viable option 
> for users running large parallel simulations, this is a limitation that 
> must be lifted...

First a minor correction: a 1000^3 grid variable *can* fit into a
single netCDF variable if it's of type float, int, or a smaller type.
In fact a 1023^3 grid variable is still within the limits of 4GiB for
a single variable size.  For a record variable, the size could be up
to numrecs*1023^3, since the limit of 4 GiB is only on each record's
worth of data for a record variable, and you could have a large number
of variables of this size in the same netCDF file.

However, we're very sympathetic with the intent of the above request,
to remove current 4GiB variable size limitations in the netCDF format,
and we're discussing the possibility of this with the parallel netCDF
developers.  It may be possible to remove such restrictions without
changing the CDF2 format.  To do this without changing the format
would also require that variables larger than 4GiB have more than one
dimension, since removing the current dimension size restriction of
2^32-1 *would* require a format change.  Also, netCDF files created
with large variables (> 4GiB) might not be portable to 32-bit

Of course another option is using netCDF-4 when it's out of beta,
because it has no 4GiB limit on variable size.

Are there important use cases for needing a single dimension length
greater than 2^32?  An example of this much data would be a time
series of a single measurement taken every 0.01 seconds for more than
500 days.  As I mentioned above and as John Caron pointed out,
supporting dimensions larger than 2^32 would be a much bigger deal,
requiring a new format, making data inaccessible on 32-bit platforms,
and even causing problems in other language interfaces such as the
Java interface.



Russ Rew                                         UCAR Unidata Program

To unsubscribe netcdfgroup, visit: