In part 1, we explained what data chunking is about in the context of scientific data access libraries such as netCDF-4 and HDF5, presented a 38 GB 3-dimensional dataset as a motivating example, discussed benefits of chunking, and showed with some benchmarks what a huge difference chunk shapes can make in balancing read times for data that will be accessed in multiple ways.
In this post, I'll continue looking at that example dataset to see how we derived good chunk shapes, look at how long it can take to rechunk a multidimensional dataset, and consider the use of Solid State Disk (SSD) for both accessing and rechunking data.
Chunk shapes and sizes: You can't always get what you want
Tiling of two-dimensional data is often used to explain chunking, because it's easy to understand and to illustrate with simple figures. Common patterns of subgrid access for a 2D array include:
- accessing data by rows
- accessing data by columns
- accessing a rectangular subgrid of data from somewhere in the middle of the array
If data is stored contiguously by rows, then accessing by rows is very fast, accessing by columns is slow, and the speed of accessing subgrids depends on details of how many rows and columns are involved and the shape of the subgrid. That's all still true if you interchange the words "row" and "column".
For 2-dimensional data, if you want to support equally frequent access by either rows or columns, then a natural solution is chunking the data into rectangular chunks (also known as tiles) so that reading a row requires the same number of disk accesses as reading a column. You can treat each chunk as if it were a single disk block that must be read completely to access any of its data values. An optimum solution is to make the chunks similar in shape to the entire array, so that the same number of chunks are required to read an entire row or an entire column.
For example, if you have a 277 x 349 array of values (the shape of a horizontal slice in the example dataset in part 1), you could use chunks of size 28 x 35, so that 10 chunks would be adequate to read all the values in any row or any column. Notice there's a little overhang in the last column and last row of chunks, which really cover 280 x 350 values, but that's typically handled in the library and not something you have to worry about. To get all the values in a row, you're really reading more data than you want, since each chunk has values from 28 rows, but that's OK because it's typically the number of disk accesses that make I/O slow, not the number of values read. Also, if you're reading successive rows, caching the 35 chunks for a row in memory means you only have to read each chunk once.
If your chunks are smaller than a disk block you will still have to read a whole disk block, so it doesn't make much sense to choose chunks significantly smaller than a disk block. The number of bytes in a 28 x 35 block of floats (4 bytes each), is 3920 bytes, which is actually pretty close to the 4096 byte blocks used in many desktop file systems, so 28 x 35 is a pretty good shape and size for the parameters of this problem (we're ignoring compressed chunks for now).
Two-dimensional examples are often where the discussion of chunk shapes and sizes ends. However, the 2D case is too simple to generalize to 3D or higher, because equalizing access time along each axis is not necessarily the most common access pattern to be optimized in higher dimensions. The real-world example presented in part 1 involved a (time, y, x) floating-point variable of shape 98128 x 277 x 349, in which we wanted to balance access to time series of shape 98128 x 1 x 1 and to horizontal slices of shape 1 x 277 x 349 so that neither type of access was ridiculously slow. If we just use chunk shapes of similar shape to the 3D variable, we might try 9813 x 28 x 35 chunks. For that chunk shape, a time series will need 10 chunks but a horizontal slice will need 100 chunks, and take 10 times longer if the number of disk accesses are the predominant time (as they usually are).
A little algebra can be applied to this 3D case, setting the number of chunks accessed for a 1D time series equal to the number of chunks accessed for a 2D horizontal slice, leading to chunks of shape
98128/N2 by 277/N by 349/N
where N4 is the total number of chunks used to partition the array. It turns out not to matter how the chunks are distributed along the x and y axes, as long as their product, the number of chunks in a horizontal slice, is the same as the number of chunks in a time series. So the most general formula for optimal chunk shapes for this access pattern has an arbitrary positive constant C, with chunk shapes
98128/N2 by C*277/N by (1/C)*349/N
Here's source code for a little Python function, chunk_shape_3D, that computes a good shape for this access pattern of equally frequent 1D and 2D accesses of a 3D variable. You provide a variable's shape as a list of dimension sizes, a desired chunk size that the resulting chunks should be close to without exceeding, and the external size in bytes of each element of the variable. The function returns a "good" chunk shape. The function handles some details not described here, such as converting ideal shapes with fractional dimensions to practical shapes with whole-number dimensions.
Examples of chunk shapes returned by the function are given in the following table for our 98128 x 277 x 349 variable, for various power-of-two chunk sizes that include the common cases of size 4 KB, 8KB, 1MB, and 4 MB, and assuming the variable values are 4 bytes each.
Desired chunk size
Actual chunk size
Number of chunks per access
(time series, spatial slice)
|4096||3960||33 x 5 x 6||2974, 3304|
|8192||7728||46 x 6 x 7||2134, 2850|
|16384||16384||64 x 8 x 8||1533, 1540|
|1048576||1032000||516 x 20 x 25||191, 196|
|4194304||4189920||1032 x 29 x 35||96, 100|
Rechunking? How Long Will That Take?
If you have some important datasets that are heavily accessed in complementary ways, such as the 1D and (n-1)D pattern presented here, and you're convinced that rechunking the data might be a good idea, the good news is that tools for rechunking are available: nccopy for netCDF-4 data and h5repack, which works for both HDF5 and netCDF-4 data. Note that you can specify chunking (and compression) for classic-model netCDF data, so you don't have to make use of features of the netCDF-4 data model to get the benefits of chunking.
The bad news is that rechunking BigData, or even only PrettyBigData, can take a lot of time. The problem is analogous to transposing a matrix that's too big to fit in memory all at once. But don't despair. The time it takes to rechunk a big dataset is often only a small multiple of time to copy the data from one disk file to another, especially when you have a lot of memory. Maybe some day rechunking datasets will be a cloud service, which would be especially convenient if your big datasets are already in the cloud.
Another issue to consider is compression. If your data is compressed in netCDF-4 or HDF5, it has to be chunked, because a chunk is the atomic unit of compression as well as disk access. Rechunking compressed data means reading each compressed chunk, uncompressing it, rechunking the uncompressed data, recompressing the new chunks, and writing them out to disk. Believe it or not, that can be faster than doing the same rechunking with uncompressed data, due to savings in disk I/O for data that is very compressible.
Here are some benchmarks for rechunking our example dataset, using nccopy or h5repack:
|Source chunks||Destination chunks|
nccopy: disk, SSD (minutes)
h5repack: disk, SSD (minutes)
|1032 x 29 x 35||516 x 20 x 25||7, 4||99, 38|
|1032 x 29 x 35||64 x 8 x 8||10, 10||134, 43|
|1032 x 29 x 35||46 x 6 x 7||11, 10||144, 46|
|1032 x 29 x 35||33 x 5 x 6||12, 14||147, 49|
Though nccopy is faster than h5repack for netCDF files, it could probably still be sped up significantly. However, the 4-minute time for rechunking to 1 MB chunks using SSD is about the same as copying the 38 GB file using SSD, so it's unlikely that could be sped up much. But optimizing rechunking to smaller chunks might be a good project for a summer intern ...
If you're going to be a frequent rechunker, you'd be wise to get a machine with lots of memory and maybe lots of SSD instead of spinning disk. But there are some surprises with SSD, as we'll see in the next section.
Running these tests have helped to provide some tips on rechunking that weren't obvious to me. First, if you want to rechunk data quickly, what form of source is best, in terms of chunking and compression? Possibilities include unchunked contiguous data, a few large chunks, or lots of small chunks, and in each case, is it better to use compressed or uncompressed data as the source? Benchmarks with the 38 GB example dataset suggest a few answers.
First, rechunking from a contiguous layout is slow unless you can read the entire input file into memory. Typically that's not practical, but you may have to start with contiguous data to get a chunked dataset with a few large chunks that you can use as source for experimenting with creating files with better chunk shapes.
It doesn't seem to matter much whether the input or output data is compressed, as I/O time will dominate the rechunking, as long as you have enough memory dedicated to chunk cache to hold both the compressed input and compressed output in memory. If you lack that much memory, use of chunk cache may have to be tuned for optimum rechunking. This is still somewhat of a dark art, but nccopy lets you specify how much memory to use for chunk caches as a command-line option.
Even with SSD, rechunking data takes a relatively long time. Is it ever worth it? I think it is, for important datasets that will be written once but read many times. It's similar to justification for developing the new zlib-compatible zopfli compression algorithm, which is 100x slower than zlib for compression, but compresses 5% better, so it saves time on every access and is a win after about 20 accesses. With the huge bias in access times discussed in part 1, rechunking is a win if it replaces only a few accesses in the "slow" order.
If Memory Serves: What's Going On with SSD?
If you do I/O intensive manipulations such as rechunking and if you can afford it, equip your machines with SSD in addition to spinning disk (but make sure your SSDs are designed to deal with power faults). How does using SSD compare with using conventional spinning disks? We've tried them for:
- accessing contiguous data
- accessing chunked data
- rechunking data
Serial access with SSD can easily be 4 or 5 times faster than spinning disks, but that's not the only speedup SSD provides. Using SSD with traditional contiguous storage can make chunking the data unnecessary, because random access is so much faster in SSD than spinning disk. Here's an example of average time to access time series and horizontal slabs on our example dataset:
(slowest / fastest)
|Contiguous favoring time range||0.00003||0.00004||1.3|
|Contiguous favoring spatial slice||53||0.003||18000|
|1032 x 29 x 35||1.2||1.0||1.2|
|64 x 8 x 8||0.5||0.3||1.5|
|46 x 6 x 7||0.6||0.2||2.4|
|33 x 5 x 6||0.6||0.3||2.4|
There's a bit of a mystery here. Why does using SSD make only one form of access (in red) to unchunked data slow, when spinning disk is slow for 2 forms of access to contiguous layout. And how can SSD access possibly make accessing a spatial slice as fast as a time range for data organized to favor time series access (also in red). Stay tuned, to see if this is just a bug or if there's some logical explanation.
And remember to keep your APIs separated from your IPAs ...