Chunking Data: Choosing Shapes

In part 1, we explained what data chunking is about in the context of scientific data access libraries such as netCDF-4 and HDF5, presented a 38 GB 3-dimensional dataset as a motivating example, discussed benefits of chunking, and showed with some benchmarks what a huge difference chunk shapes can make in balancing read times for data that will be accessed in multiple ways.

In this post, I'll continue looking at that example dataset to see how we derived good chunk shapes, look at how long it can take to rechunk a multidimensional dataset, and consider the use of Solid State Disk (SSD) for both accessing and rechunking data.

Chunk shapes and sizes: You can't always get what you want

Tiling of two-dimensional data is often used to explain chunking, because it's easy to understand and to illustrate with simple figures. Common patterns of subgrid access for a 2D array include:

  1. accessing data by rows
  2. accessing data by columns
  3. accessing a rectangular subgrid of data from somewhere in the middle of the array

If data is stored contiguously by rows, then accessing by rows is very fast, accessing by columns is slow, and the speed of accessing subgrids depends on details of how many rows and columns are involved and the shape of the subgrid.  That's all still true if you interchange the words "row" and "column".

For 2-dimensional data, if you want to support equally frequent access by either rows or columns, then a natural solution is chunking the data into rectangular chunks (also known as tiles) so that reading a row requires the same number of disk accesses as reading a column. You can treat each chunk as if it were a single disk block that must be read completely to access any of its data values. An optimum solution is to make the chunks similar in shape to the entire array, so that the same number of chunks are required to read an entire row or an entire column.

For example, if you have a 277 x 349 array of values (the shape of a horizontal slice in the example dataset in part 1), you could use chunks of size 28 x 35, so that 10 chunks would be adequate to read all the values in any row or any column.  Notice there's a little overhang in the last column and last row of chunks, which really cover 280 x 350 values, but that's typically handled in the library and not something you have to worry about.  To get all the values in a row, you're really reading more data than you want, since each chunk has values from 28 rows, but that's OK because it's typically the number of disk accesses that make I/O slow, not the number of values read.  Also, if you're reading successive rows, caching the 35 chunks for a row in memory means you only have to read each chunk once.

If your chunks are smaller than a disk block you will still have to read a whole disk block, so it doesn't make much sense to choose chunks significantly smaller than a disk block. The number of bytes in a 28 x 35 block of floats (4 bytes each), is 3920 bytes, which is actually pretty close to the 4096 byte blocks used in many desktop file systems, so 28 x 35 is a pretty good shape and size for the parameters of this problem (we're ignoring compressed chunks for now).

Two-dimensional examples are often where the discussion of chunk shapes and sizes ends. However, the 2D case is too simple to generalize to 3D or higher, because equalizing access time along each axis is not necessarily the most common access pattern to be optimized in higher dimensions.  The real-world example presented in part 1 involved a (time, y, x) floating-point variable of shape 98128 x 277 x 349, in which we wanted to balance access to time series of shape 98128 x 1 x 1 and to horizontal slices of shape 1 x 277 x 349 so that neither type of access was ridiculously slow. If we just use chunk shapes of similar shape to the 3D variable, we might try 9813 x 28 x 35 chunks.  For that chunk shape, a time series will need 10 chunks but a horizontal slice will need 100 chunks, and take 10 times longer if the number of disk accesses are the predominant time (as they usually are).

A little algebra can be applied to this 3D case, setting the number of chunks accessed for a 1D time series equal to the number of chunks accessed for a 2D horizontal slice, leading to chunks of shape

98128/N2  by  277/N  by  349/N

where N4 is the total number of chunks used to partition the array. It turns out not to matter how the chunks are distributed along the x and y axes, as long as their product, the number of chunks in a horizontal slice, is the same as the number of chunks in a time series. So the most general formula for optimal chunk shapes for this access pattern has an arbitrary positive constant C, with chunk shapes

98128/N2  by  C*277/N  by  (1/C)*349/N  

Here's source code for a little Python function, chunk_shape_3D, that computes a good shape for this access pattern of equally frequent 1D and 2D accesses of a 3D variable. You provide a variable's shape as a list of dimension sizes, a desired chunk size that the resulting chunks should be close to without exceeding, and the external size in bytes of each element of the variable.  The function returns a "good" chunk shape. The function handles some details not described here, such as converting ideal shapes with fractional dimensions to practical shapes with whole-number dimensions. 

Examples of chunk shapes returned by the function are given in the following table for our 98128 x 277 x 349 variable, for various power-of-two chunk sizes that include the common cases of size 4 KB, 8KB, 1MB, and 4 MB, and assuming the variable values are 4 bytes each.


Desired chunk size

(bytes)

Actual chunk size

(bytes)

Chunk shape

(values)

Number of chunks per access

(time series, spatial slice) 

4096 3960 33 x 5 x 6 2974,  3304
8192 7728 46 x 6 x 7 2134,  2850
16384 16384 64 x 8 x 8 1533,  1540
1048576 1032000 516 x 20 x 25 191,  196
4194304 4189920 1032 x 29 x 35 96,  100

 

Rechunking? How Long Will That Take?

If you have some important datasets that are heavily accessed in complementary ways, such as the 1D and (n-1)D pattern presented here, and you're convinced that rechunking the data might be a good idea, the good news is that tools for rechunking are available: nccopy for netCDF-4 data and h5repack, which works for both HDF5 and netCDF-4 data. Note that you can specify chunking (and compression) for classic-model netCDF data, so you don't have to make use of features of the netCDF-4 data model to get the benefits of chunking.

The bad news is that rechunking BigData, or even only PrettyBigData, can take a lot of time. The problem is analogous to transposing a matrix that's too big to fit in memory all at once. But don't despair. The time it takes to rechunk a big dataset is often only a small multiple of time to copy the data from one disk file to another, especially when you have a lot of memory. Maybe some day rechunking datasets will be a cloud service, which would be especially convenient if your big datasets are already in the cloud.

Another issue to consider is compression. If your data is compressed in netCDF-4 or HDF5, it has to be chunked, because a chunk is the atomic unit of compression as well as disk access. Rechunking compressed data means reading each compressed chunk, uncompressing it, rechunking the uncompressed data, recompressing the new chunks, and writing them out to disk. Believe it or not, that can be faster than doing the same rechunking with uncompressed data, due to savings in disk I/O for data that is very compressible.

Here are some benchmarks for rechunking our example dataset, using nccopy or h5repack: 

Source chunks Destination chunks 

 nccopy: disk, SSD (minutes)

 h5repack: disk, SSD (minutes)

1032 x 29 x 35 516 x 20 x 25  7,  499,  38
1032 x 29 x 35 64 x 8 x 8 10,  10134, 43 
1032 x 29 x 35 46 x 6 x 7  11,  10144, 46
1032 x 29 x 35 33 x 5 x 6 12,  14 147, 49

Though nccopy is faster than h5repack for netCDF files, it could probably still be sped up significantly. However, the 4-minute time for rechunking to 1 MB chunks using SSD is about the same as copying the 38 GB file using SSD, so it's unlikely that could be sped up much.  But optimizing rechunking to smaller chunks might be a good project for a summer intern ...

If you're going to be a frequent rechunker, you'd be wise to get a machine with lots of memory and maybe lots of SSD instead of spinning disk. But there are some surprises with SSD, as we'll see in the next section.

Running these tests have helped to provide some tips on rechunking that weren't obvious to me.  First, if you want to rechunk data quickly, what form of source is best, in terms of chunking and compression?  Possibilities include unchunked contiguous data, a few large chunks, or lots of small chunks, and in each case, is it better to use compressed or uncompressed data as the source?  Benchmarks with the 38 GB example dataset suggest a few answers.  

First, rechunking from a contiguous layout is slow unless you can read the entire input file into memory.  Typically that's not practical, but you may have to start with contiguous data to get a chunked dataset with a few large chunks that you can use as source for experimenting with creating files with better chunk shapes.

It doesn't seem to matter much whether the input or output data is compressed, as I/O time will dominate the rechunking, as long as you have enough memory dedicated to chunk cache to hold both the compressed input and compressed output in memory.  If you lack that much memory, use of chunk cache may have to be tuned for optimum rechunking.  This is still somewhat of a dark art, but nccopy lets you specify how much memory to use for chunk caches as a command-line option.

Even with SSD, rechunking data takes a relatively long time.  Is it ever worth it?  I think it is, for important datasets that will be written once but read many times.  It's similar to justification for developing the new zlib-compatible zopfli compression algorithm, which is 100x slower than zlib for compression, but compresses 5% better, so it saves time on every access and is a win after about 20 accesses.  With the huge bias in access times discussed in part 1, rechunking is a win if it replaces only a few accesses in the "slow" order.

If Memory Serves: What's Going On with SSD?

If you do I/O intensive manipulations such as rechunking and if you can afford it, equip your machines with SSD in addition to spinning disk (but make sure your SSDs are designed to deal with power faults). How does using SSD compare with using conventional spinning disks? We've tried them for:

  • accessing contiguous data
  • accessing chunked data
  • rechunking data

Serial access with SSD can easily be 4 or 5 times faster than spinning disks, but that's not the only speedup SSD provides. Using SSD with traditional contiguous storage can make chunking the data unnecessary, because random access is so much faster in SSD than spinning disk. Here's an example of average time to access time series and horizontal slabs on our example dataset:


Storage layout,
chunk shapes
Read time
series (sec)
Read spatial
slice (sec)
Performance bias
(slowest / fastest)
Contiguous favoring time range 0.000030.00004 1.3
Contiguous favoring spatial slice 53 0.003 18000
1032 x 29 x 35 1.2 1.0 1.2
64 x 8 x 8 0.5 0.3 1.5
46 x 6 x 7 0.6 0.2 2.4
33 x 5 x 6 0.6 0.3 2.4

There's a bit of a mystery here. Why does using SSD make only one form of access (in red) to unchunked data slow, when spinning disk is slow for 2 forms of access to contiguous layout. And how can SSD access possibly make accessing a spatial slice as fast as a time range for data organized to favor time series access (also in red).  Stay tuned, to see if this is just a bug or if there's some logical explanation.

And remember to keep your APIs separated from your IPAs ...

Comments:

Hello, great blog.
I'm curious how the formula would look like for a 4D dataset ?

I have 25256 x 37 x 256 x 512 Array that needs an optimal chunking. What would you recommend?

Thanks appreciate it.

Posted by Michael on April 28, 2016 at 07:01 AM MDT #

Different chunk shapes are optimal for different patterns of access. For example, if you know that the data will primarily be accessed by fixed slices of the first dimension, optimal chunks would be in the shape 1 x 37 x 256 x 512 or a similar shape that fits within one physical disk block, such as 1 x 5 x 32 x 64, which for 4-byte values fits within a 64K disk block.

If there is no most common access pattern, let D be the number of values you want in a chunk (preferably <= what one disk access can read), then let c = (D/(25256 * 37 * 256 * 512)) ^ (1/4), and use a chunk shape resulting from multiplying each dimension size by c and truncating to an integer.

Posted by Russ on April 28, 2016 at 03:57 PM MDT #

Interesting post. Unfortunately I have a different situation and I don't know how to optimize data I/O. I have a huge array, 10 dimensions like this: 3x72x73x42x32x23x46x26x83x73
I have to write millions of values at a time on the array, but these data are not contiguous in any dimension! Furthermore, the way I have to write data is quite wired because first I have to read the value of the variable stored in the file, then update it (it's just a sum)and finally I write the new value at the same position on the array. I know the coordinates of all values to update before I had to write them. So at the moment I have this for loop where for each point in the array, I read value, update it and then write it on file. And this for millions of time. Is there any technique to optimize this ?

Posted by Massimiliano on September 16, 2016 at 03:38 AM MDT #

Would I be tagged as "lazy" if I ask you to update (or upload again) the link for the Python function "chunk<em>shape</em>3D" for optimal chunking?

Posted by feliperiosg on May 05, 2017 at 03:49 PM MDT #

Hi Russ,

thanks a lot for clarifying chunking in such a clear way!

At the moment, the chunk_shape_3D script is missing. Could you please link it again?

Best,
Hella

Posted by Hella on May 20, 2017 at 10:57 AM MDT #

The chunk<em>shape</em>3D.py script is now linked to this post again. Sorry for the interruption...

Posted by unidatanews on May 22, 2017 at 08:45 AM MDT #

Thanks for the updated link. In the meantime I updated the script to be used in R.
It works for 3D, 2D and even 1D chunking.

<h1>!-Source: https://bitbucket.csiro.au/projects/CMAR_RS/repos/netcdf-tools/browse/chunking/chunk_shape_3D.py?at=97c17c836371b07d9d08c9ffa05b3a8db623e0f1</h1></p>

<h1>!-Returns a 'good shape' for a 3D var, assuming balanced 1D, 2D access</h1>

<h2>'GOOD SHAPE' for chunks means that the number of chunks accessed</h2>

<h2>to read either kind of 1D or 2D subset is approximately equal, and</h2>

<h2>the size of each chunk (uncompressed) is no more than chunkSize, which is often a Disk Block Size.</h2>

<h1>!-Variable should be in the shape (T, X, Y), where:</h1>

<h2>1D subsets are of the form VAR[:,x,y], and</h2>

<h2>2D slices are of the form var[t,:,:], typically 1D time series and 2D spatial slices.</h2>

numVals = function(shape){if(length(shape)==0){1}else{last(cumprod(shape))}}
binlist = function(n,width){rev(as.integer(intToBits(n)[1:width]))}
perturbShape = function(shape,onbits){unlist(shape) + binlist(onbits,length(shape))}

CHUNK.3D <- function(diMa){

<h1>diMa=list(143,166,1081)</h1>

<h1>!-run: 'C:\Windows\system32>fsutil fsinfo ntfsinfo c:' as admin in Windows</h1>

<h1>!- look for the value in 'Bytes Per Physical Sector'... that's your CHUNKsIZE.max capability</h1>

chunkSize <- 4096 #-maximum chunksize desired [in bytes]
valSize <- 4 #-size of each data value [in bytes]

if(last(cumprod(diMa)) <= chunkSize/valSize){return(diMa)}else{
if(length(diMa)<=2){
NN = ifelse(length(diMa)==1,last(cumprod(diMa))/chunkSize*valSize
,sqrt(last(cumprod(diMa))/(chunkSize)*valSize))
cBest = as.integer(unlist(diMa)/NN)
return(cBest)
}else{

<h1>!-4096/4=1024 -> Bytes Per FileRecord Segment</h1>

varShape <- c(last(diMa),diMa[1:(length(diMa)-1)]) #-length 3 list of variable dimension sizes

rank = 3 # this is a special case of n-dimensional function chunk_shape
chunkVals = chunkSize/valSize # ideal number of values in a chunk
numChunks = last(cumprod(varShape))/chunkVals # ideal number of chunks
axisChunks = numChunks**(1/valSize) # ideal number of chunks along each 2D axis
#axisChunks = numChunks**0.25 # ideal number of chunks along each 2D axis

cFloor = list() # will be first estimate of good chunk shape
# cFloor = [varShape[0] // axisChunks**2, varShape[1] // axisChunks, varShape[2] // axisChunks]
# except that each chunk shape dimension must be at least 1
# chunkDim = max(1.0, varShape[0] // axisChunks**2)

if(varShape[[1]]/(axisChunks**2) < 1){ #-varShape[[1]] : [[1]] is your TIME.dimension
chunkDim = 1
axisChunks = axisChunks/sqrt(varShape[[1]]/(axisChunks**2))
}else{
chunkDim = floor(varShape[[1]]/(axisChunks**2))
cFloor = c(cFloor,chunkDim)
prod = 1 # factor to increase other dims if some must be increased to 1

<code>  for(i in 2:length(varShape)){
if(varShape[[i]]/axisChunks < 1){prod = prod * axisChunks/varShape[[i]]}
}
for(i in 2:length(varShape)){
if(varShape[[i]]/axisChunks < 1){chunkDim = 1
}else{chunkDim = floor((prod*varShape[[i]])/axisChunks)}
cFloor = c(cFloor,chunkDim)
}
</code>

}
# cFloor is typically too small, (numVals(cFloor) < chunkSize).
# Adding 1 to each shape dim results in chunks that are too large, (numVals(cCeil) > chunkSize).
# Want to just add 1 to some of the axes to get as close as possible to chunkSize without exceeding it.

# Here we use brute force, compute numVals(cCand) for all 2**rank candidates and return the one closest to chunkSize without exceeding it.

bestChunkSize = 0
cBest = cFloor

for(i in 0:7){
cCand = perturbShape(cFloor,i)
thisChunkSize = valSize * numVals(cCand)
if(bestChunkSize<thisChunkSize &amp; thisChunkSize<=chunkSize){
bestChunkSize = thisChunkSize
cBest = cCand # make a copy of best candidate so far
}
}

return(rev(cBest))
}
}
}

Posted by feliperiosg on May 22, 2017 at 12:31 PM MDT #

I made available a blunt adaptation for R, of the above py.function at:

<script src="https://gist.github.com/feliperiosg/30e5dc84f6c14e5e11221c63977df032.js&quot;></script></p>

It also valid for 3D, 2D, and 1D chunking.

Posted by feliperiosg on May 22, 2017 at 01:40 PM MDT #

Post a Comment:
  • HTML Syntax: Allowed
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Recent Entries:
Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« July 2017
SunMonTueWedThuFriSat
      
1
2
4
5
6
7
8
9
11
12
13
14
15
16
18
19
20
21
22
23
25
26
27
28
29
30
31
     
Today