[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDFJava #BNA-191717]: chunking in Java



Jeff,

> According to this page, the default chunk size for unlimited dimension is
> 1. That still doesn't explain why the file size didn't change when I
> changed the chunk size.
> 
> https://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Default-Chunking.html
> 4.9.2 The Default Chunking Scheme in version 4.1 (and 4.1.1)
> 
> When the data writer does not specify chunk sizes for variable, the netCDF
> library has to come up with some default values.
> 
> The C code below determines the default chunks sizes.
> 
> For unlimited dimensions, a chunk size of one is always used. Users are
> advised to set chunk sizes for large data sets with one or more unlimited
> dimensions, since a chunk size of one is quite inefficient.
> 
> For fixed dimensions, the algorithm below finds a size for the chunk sizes
> in each dimension which results in chunks of DEFAULT_CHUNK_SIZE (which can
> be modified in the netCDF configure script).
> 
> /* Unlimited dim always gets chunksize of 1. */
> if (dim->unlimited)
> chunksize[d] = 1;
> else
> chunksize[d] = pow((double)DEFAULT_CHUNK_SIZE/type_size,
> 1/(double)(var->ndims - unlimdim));

You're right, that documentation was the way chunk sizes were 
determined in netCDF-C version 4.1.1, but we need to update
that to reflect the current behavior for the special case of
1-dimensional record variables.  In the netCDF-C library 4.3.2
and later, the default chunk size for such variables is
DEFAULT_CHUNK_SIZE/type_size bytes, where DEFAULT_CHUNK_SIZE is
a configure-time constant and type_size is the size in bytes of
the variable type.

I've just verified that's what happens with your case:

  dimensions:
     time = UNLIMITED; // currently 10000
  variables:
     double time(time);

in netCDF-C version 4.3.2.  Leaving chunking unspecified in this
case results in

       time:_ChunkSizes = 524288 ;

if you haven't changed the DEFAULT_CHUNK_SIZE from 4 MB when the
library was built.

The 4MB defrault chunk size is appropriate in HPC contexts, where
the physical disk block size is often megabytes, but it's too
large as a default for desktop machines, which often use 4KB or
or 8KB disk blocks.  There's no good way to determine the right
disk-block size to use at configure time, because often the
library will be used for creating files on different file systems 
with different disk block sizes.

But it looks like we should make the default for 1D record variables
smaller, e.g. 8KB.  Otherwise, the resulting size of what should be
a small file, e.g. 80KB as in your case, instead has one chunk 
allocated for the variable, which results in a file size of 4 MB.
The reason we can't base the default decision on the size of the
unlimited dimension is that there is no way to know at 
variable-creation time how many records will be written ...

--Russ

> address@hidden> wrote:
> 
> > Jeff,
> >
> > > Thanks for the additional info. I will be using release 4.3.21 (or later)
> > > regardless of what file format we ultimately end up using. You mentioned
> > > that 4.3.2 should improve the default chunking, but the results I sent
> > were
> > > using newer than that release already, so it sounds like I shouldn't
> > expect
> > > any improvements on the NC4 files size at this point, correct?
> >
> > The improvement that netCDF C version 4.3.2 made was to change the default
> > chunk size for 1-dimensional record variables to DEFAULT_CHUNK_SIZE bytes,
> > where DEFAULT_CHUNK_SIZE is a configure-time constant with default value
> > 4194304.  I'm surprised that using different chunk sizes made no difference
> > in the file size, so I may try to duplicate your results to understand how
> > that happened.
> >
> > --Russ
> >
> > > address@hidden> wrote:
> > >
> > > > Hi Jeff,
> > > >
> > > > > From those articles the purpose of chunking is to improve
> > performance for
> > > > > large multi-dimensional data sets. It seems like it won't really
> > provide
> > > > > any benefit in out situation since we only have one dimension. I know
> > > > that
> > > > > NetCDF4 added chunking, but are all NetCDF4 files chunked, i.e., is
> > there
> > > > > such a thing as a non-chunked NetCDF4 files? Or is that a
> > contradiction
> > > > in
> > > > > terms somehow?
> > > >
> > > > No, all netCDF-4 files aren't chunked.  The simpler alternative,
> > > > contiguous layout,
> > > > is better if you don't need compression, unlimited dimensions, or
> > support
> > > > for
> > > > multiple patterns of access that chunking makes possible in netCDF-4
> > files.
> > > >
> > > > A netCDF-4 variable can use contiguous layout if doesn't use an
> > unlimited
> > > > dimension or any sort of filter such as compression or checksums.
> > > >
> > > > > Given that NetCDF4 readers are backwards-compatible with NetCDF3
> > files,
> > > > is
> > > > > there any reason not to use a NetCDF3 file from your perspective? My
> > > > > suspicion is that our requirement is just being driven by "use the
> > latest
> > > > > version" rather than any technical reasons.
> > > >
> > > > I think I agree with you.  With only one unlimited dimension, and if
> > you
> > > > don't need
> > > > the transparent compression that netCDF-4 makes possible, there's no
> > need
> > > > to
> > > > not just use the default contiguous layout that a netCDF-3 format file
> > > > provides.
> > > > However, you should still use the netCDF-4 library, just don't specify
> > the
> > > > netCDF-4
> > > > format when you create the file.  That's because the netCDF-4 software
> > > > includes bug
> > > > fixes, performance enhancements, portability improvements, and remote
> > > > access
> > > > capabilities mot available in the old netCDF-3.6.3 version software.
> > > >
> > > > The reason you were seeing a 7-fold increase in size is exactly as
> > Ethan
> > > > pointed out,
> > > > due to way the HDF5 storage layer implements unlimited dimensions,
> > using
> > > > chunking
> > > > implemented with B-tree data structures and indices, rather than a
> > simpler
> > > > contiguous
> > > > storage used in the classic netCDF format.  The recent netcdf-4.3.2
> > > > version improves
> > > > the default chunking for 1-dimensional variables with an unlimited
> > > > dimension, as in
> > > > your case, so may be sufficient to provide both smaller files and
> > benefits
> > > > of netCDF-4
> > > > chunking, but without testing I can't predict how close it comes to the
> > > > simpler netCDF
> > > > classic format in this case.  Maybe I can get time later today to try
> > it
> > > > ...
> > > >
> > > > > I couldn't find anything on the NetCDF website regarding "choosing
> > the
> > > > > right format for you". I was hoping there'd be something along those
> > > > lines
> > > > > in the FAQ, but no luck.
> > > >
> > > > The FAQ section on "Formats, Data Models, and Software Releases"
> > > >
> > > >    http://www.unidata.ucar.edu/netcdf/docs/faq.html
> > > >
> > > > is intended to clarify the somewhat complex situation with multiple
> > > > versions of netCDF
> > > > data models, software, and formats, but evidently doesn't help much in
> > > > your case of
> > > > choosing whether to use the default classic netCDF format, the netCDF-4
> > > > classic model
> > > > format, or the netCDF-4 format.
> > > >
> > > > Thanks for pointing out the need for improving this section, and in
> > > > particular the answer
> > > > to the FAQ "Should I get netCDF-3 or netCDF-4?", which should really
> > > > address the question
> > > > "When should I use the netCDF classic format?".
> > > >
> > > > --Russ
> > > >
> > > > > address@hidden> wrote:
> > > > >
> > > > > > Hi Jeff,
> > > > > >
> > > > > > How chunking and compression affect file size and read/write
> > > > performance
> > > > > > is a complex issue. I'm going to pass this along to our chunking
> > expert
> > > > > > (Russ Rew) who, I believe, is back in the office on Monday and
> > should
> > > > be
> > > > > > able to provide you with some better advise than I can give.
> > > > > >
> > > > > > In the mean time, here's an email he wrote in response to a
> > > > conversation
> > > > > > on the effect of chunking on performance that might be useful:
> > > > > >
> > > > > >
> > > > > >
> > > >
> > http://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2013/msg00498.html
> > > > > >
> > > > > > Sorry I don't have a better answer for you.
> > > > > >
> > > > > > Ethan
> > > > > >
> > > > > > Jeff Johnson wrote:
> > > > > > > Ethan-
> > > > > > >
> > > > > > > I made the changes you suggested with the following result:
> > > > > > >
> > > > > > > 10000 records, 8 bytes / record = 80000 bytes raw data
> > > > > > >
> > > > > > > original program (NetCDF4, no chunking): 537880 bytes (6.7x)
> > > > > > > file size with chunk size of 2000 = 457852 bytes (5.7x)
> > > > > > >
> > > > > > > So a little better, but still not good. I then tried different
> > chunk
> > > > > > sizes
> > > > > > > of 10000, 5000, 200, and even 1, which I would've thought would
> > give
> > > > me
> > > > > > the
> > > > > > > original size, but all gave the same resulting file size of
> > 457852.
> > > > > > >
> > > > > > > Finally, I tried writing more records to see if it's just a
> > symptom
> > > > of a
> > > > > > > small data set. With 1M records:
> > > > > > >
> > > > > > > 8MB raw data, chunk size = 2000
> > > > > > > 45.4MB file (5.7x)
> > > > > > >
> > > > > > > This is starting to seem like a lost cause given our small data
> > > > records.
> > > > > > > I'm wondering if you have information I could use to go back to
> > the
> > > > > > archive
> > > > > > > group and try to convince them to use NetCDF3 instead.
> > > > > > >
> > > > > > > jeff
> > > > > >
> > > > > >
> > > > > > Ticket Details
> > > > > > ===================
> > > > > > Ticket ID: BNA-191717
> > > > > > Department: Support netCDF
> > > > > > Priority: Normal
> > > > > > Status: Open
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jeff Johnson
> > > > > DSCOVR Ground System Development
> > > > > Space Weather Prediction Center
> > > > > address@hidden
> > > > > 303-497-6260
> > > > >
> > > > >
> > > > Russ Rew                                         UCAR Unidata Program
> > > > address@hidden                      http://www.unidata.ucar.edu
> > > >
> > > >
> > > >
> > > > Ticket Details
> > > > ===================
> > > > Ticket ID: BNA-191717
> > > > Department: Support netCDF
> > > > Priority: Normal
> > > > Status: Closed
> > > >
> > > >
> > >
> > >
> > > --
> > > Jeff Johnson
> > > DSCOVR Ground System Development
> > > Space Weather Prediction Center
> > > address@hidden
> > > 303-497-6260
> > >
> > >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: BNA-191717
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> >
> 
> 
> --
> Jeff Johnson
> DSCOVR Ground System Development
> Space Weather Prediction Center
> address@hidden
> 303-497-6260
> 
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: BNA-191717
Department: Support netCDF
Priority: Normal
Status: Closed