What is data chunking? How can chunking help to organize large multidimensional datasets for both fast and flexible data access? How should chunk shapes and sizes be chosen? Can software such as netCDF-4 or HDF5 provide better defaults for chunking? If you're interested in those questions and some of the issues they raise, read on ...
Is anyone still there? OK, let's start with a real-world example of the improvements possible with chunking in netCDF-4. You may be surprised, as I was, by the results. Maybe looking at examples in this post will help provide some guidance for effective use of chunking in other similar cases.
First let's consider a single large 3D variable from the NCEP North American Regional Reanalysis representing air temperature (if you must know, it's at the 200 millibars level, at every 3 hours, on a 32.463 km resolution grid, over 33 years from 1979-01-01 through 2012-07-31):
dimensions: y = 277 ; x = 349 ; time = 98128 ; variables: float T(time, y, x);
Of course the file has lots of other metadata specifying units, coordinate system, and data provenance, but in terms of size it's mostly just that one big variable: 9.5 billion values comprising 38 GB of data.
This file is an example of PrettyBigData (PBD). Even though you can store it on a relatively cheap flash drive, it's too big to deal with quickly. Just copying it to a 7200 rpm spinning disk takes close to 20 minutes. Even copying to fast solid state disk (SSD) takes over 4 minutes. For a human-scale comparison, its close to the storage used for a blu-ray version of a typical movie, about 50 GB. (As an example of ReallyBigData (RBD), a data volume beyond the comprehension of ordinary humans, consider the 3D, 48 frame per second version of "The Hobbit, Director's Cut".)
Access Patterns make a difference
For now, let's ignore issues of compression, and just consider putting that file on a server and permitting remote access to small subsets of the data. Two common access patterns are:
- Get a 1D time-series of all the data from a specific spatial grid point.
- Get a 2D spatial cross section of all the data at a specific time.
The first kind of access is asking for the 1D array of values on one of the red lines, pictured on the left, below; the second is asking for the 2D array of values on one of the green planes pictured on the right.
With a conventional contiguous (index-order) storage layout, the time dimension varies most slowly, y varies faster, and x varies fastest. In this case, the spatial access is fast (0.013 sec) and the time series access is slow (180 sec, which is 14,000 times slower). If we instead want the time series to be quick, we can reorganize the data so x or y is the most slowly varying dimension and time varies fastest, resulting in fast time-series access (0.012 sec) and slow spatial access (200 sec, 17,000 times slower). In either case, the slow access is so slow that it makes the data essentially inaccessible for all practical purposes, e.g. in analysis or visualization.
But what if we want both kinds of access to be relatively fast? Well, we could punt and make two versions of the file, each organized appropriately for the kind of access desired. But that solution doesn't scale well. For N-dimensional data, you would need N copies of the data to support optimal access along any axis, and N! copies to support optimal access to any cross-section defined by some subset of the N dimensions.
A better solution, known for at least 30 years, is the use of chunking, storing multidimensional data in multi-dimensional rectangular chunks to speed up slow accesses at the cost of slowing down fast accesses. Programs that access chunked data can be oblivious to whether or how chunking is used. Chunking is supported in the HDF5 layer of netCDF-4 files, and is one of the features, along with per-chunk compression, that led to a proposal to use HDF5 as a storage layer for netCDF-4 in 2002.
Benefits of Chunking
I think the benefits of chunking are under-appreciated.
Large performance gains are possible with good choices of chunk shapes and sizes. Chunking also supports efficiently extending multidimensional data along multiple axes (in netCDF-4, this is called "multiple unlimited dimensions") as well as efficient per-chunk compression, so reading a subset of a compressed variable doesn't require uncompressing the whole variable.
So why isn't chunking more widely used? I think reasons include at least the following:
- Advice for how to choose chunk shapes and sizes for specific patterns of access is lacking.
- Default chunk shapes and sizes for libraries such as netCDF-4 and HDF5 work poorly in some common cases.
- It's costly to rewrite big datasets that use conventional contiguous layouts to use chunking instead. For example, even if you can fit the whole variable, uncompressed, in memory, chunking a 38GB variable can take 20 or 30 minutes.
This series of posts and better guidance in software documentation will begin to address the first problem. HDF5 already has a start with a white paper on chunking.
The second reason for under-use of chunking is not so easily addressed. Unfortunately, there are no general-purpose chunking defaults that are optimal for all uses. Different patterns of access lead to different chunk shapes and sizes for optimum access. Optimizing for a single specific pattern of access can degrade performance for other access patterns.
Finally, the cost of chunking data means that you either need to get it right when the data is created, or the data must be important enough that the cost of rechunking for many read accesses is justified. In the latter case, you may want to consider acquiring a computing platform with lots of memory and SSD, just for the purpose of rechunking important datasets.
What a difference the shape makes
We have claimed that good choices of chunk shapes and sizes can make large datasets useful for access in multiple ways. For the specific example we've chosen, how well do the netCDF-4 library defaults work, and what's the best we can do by tailoring the chunking to the specific access patterns we've chosen, 1D time series at a point and 2D spatial access at a specific time?
Here's a table of timings for various shapes and sizes of chunks, using conventional local 7200 rpm spinning disk with 4096-byte physical disk blocks, the kind of storage that's still prevalent on desk-top and departmental scale platforms:
(slowest / fastest)
|Contiguous favoring time range||0.013||180||14000|
|Contiguous favoring spatial slice||200||0.012||17000|
|Default (all axes equal) chunks, 4673 x 12 x 16||1.4||34||24|
|36 KB chunks, 92 x 9 x 11||2.4||1.7||1.4|
|8 KB chunks, 46 x 6 x 8||1.4||1.1||1.2|
We've already seen the timings in the first two rows of this table, showing huge performance bias when using contiguous layouts. The third row shows the current netCDF-4 default for chunking this data, choosing chunk sizes close to 4 MB and trying to equalize the number of chunks along any axis. This turns out not to be particularly good for trying to balance 1D and 2D accesses. The fourth row shows results of smaller chunk sizes, using shapes that provide a better balance between time series and spatial slice accesses for this dataset.
I think the last row of this table supports the main point to be made in this first posting on chunking data. By creating or rewriting important large multidimensional datasets using appropriate chunking, you can tailor their access characteristics to make them more useful. Proper use of chunking can support more than one common query pattern.
That's enough for now. In part 2, we'll discuss how to determine good chunk shapes, present a general way to balance access times for 1D and 2D accesses in 3D variables, say something about generalizations to higher dimension variables, and provide examples of rechunking times using the nccopy and h5repack utilities.
In later posts, we plan to offer opinions on related issues, possibly including
- chunk size tradeoffs: small chunks vs. large chunks
- chunking support in tools
- chunking and compression
- complexity of the general rechunking problem
- problems with big disk blocks
- worst case performance
- space-filling curves for improving access locality
In the meantime, Keep On Chunkin' ...
A note about timings
The times we quoted above are averaged from multiple consecutive runs on a desktop Linux system (2.27GHz Xeon CPU, 4 cores, 24 GB of memory, 7200 rpm disk on a SATA-3 bus). Each timing uses a different set of chunks, so we are not exploiting chunk caching. Before reading a subset of data from a file, we run a command to flush and clear all the disk caches in memory, so running the timing repeatedly yields nearly the same time. The command to clear disk caches varies from system to system. On Linux, it requires privileges to set the SUID bit on a shell script:
#!/bin/sh # Flush and clear the disk caches. sync sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"On OSX, the only privilege required is knowledge of the "purge" command, which we wrap in a shell script to make our benchmarks work the same on OSX as on Linux:
#!/bin/sh # Flush and clear the disk caches. purge