Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration. --Stan Kelly-Bootle
Library software like netCDF or HDF5 provides access to multidimensional data by array indices, but we would often rather access data by coordinates, such as points on the Earth's surface or space-time bounding boxes.
The Climate and Forecast (CF) Metadata Conventions provide agreed-upon ways to represent coordinate systems, but do not address making access by coordinates practical and efficient. In particular, use of coordinate systems such as those described in the CF Conventions section "Two-Dimensional Latitude, Longitude, Coordinate Variables" may not provide a direct way to find array indices from latitude and longitude coordinates for various reasons. Examples of such data include:
- use of curvilinear grids that follow coastlines
- satellite swaths georeferenced to lat-lon grids
- coordinate systems with no CF grid_mapping_name attribute
- ungridded "station" data, such as point observations and soundings
Using a real-world (well, OK, an idealized spherical Earth) example, we'll see several ways to access data by coordinates, as well as how the use of appropriate software and data structures can greatly improve the efficiency of such access.
An example dataset
The example dataset we use is from the HYbrid Coordinate Ocean Model (HYCOM).
This dataset makes a good example for a couple of reasons:
- It's typical of many ocean, climate, and forecast model data products, with multiple variables on the same grid, and conforming to the CF Conventions for representing a coordinate system.
- It has a non-trivial geospatial coordinate system, represented by 2D latitude and longitude arrays that are auxiliary coordinates for other variables. The grid appears to be a subset of a tri-polar grid, but none of the parameters needed to derive a formula for which elements correspond to a specified coordinate are in the file.
Though not very relevant for the subject of this blog, the example dataset also makes good use of compression, with fill values over land for ocean data. The netCDF classic file is 102MB, but the equivalent netCDF-4 classic model file with compression is only 30MB. Nothing in this post has anything to do with the netCDF-4 data model; it all applies to classic model netCDF-3 or netCDF-4 data.
Here's some output, relevant to the problem we'll present, from "ncdump -h" applied to the example file:
dimensions:MT = UNLIMITED ; // (1 currently)
Y = 850 ;
X = 712 ;
Depth = 10 ;
double MT(MT) ;
MT:units = "days since 1900-12-31 00:00:00" ;
double Date(MT) ;
Date:units = "day as %Y%m%d.%f" ;
float Depth(Depth) ;
Depth:units = "m" ;
Depth:positive = "down" ;
int Y(Y) ;
Y:axis = "Y" ;
int X(X) ;
X:axis = "X" ;
float Latitude(Y, X) ;
Latitude:units = "degrees_north" ;
float Longitude(Y, X) ;
Longitude:units = "degrees_east" ;
float temperature(MT, Depth, Y, X) ;
temperature:coordinates = "Longitude Latitude Date" ;
temperature:standard_name = "sea_water_potential_temperature" ;
temperature:units = "degC" ;
float salinity(MT, Depth, Y, X) ;
salinity:coordinates = "Longitude Latitude Date" ;
salinity:standard_name = "sea_water_salinity" ;
salinity:units = "psu" ;
Plotting all of the latitude-longitude grid would result in a big blob of 600,000 pixels, but here's a sparser representation of the grid, showing every 25th line parallel to the X and Y axes:
Examples of coordinate queries
The temperature variable has auxiliary Longitude and Latitude coordinates, and we want to access data corresponding to their values, rather than X and Y array indices. Can we efficiently determine values on netCDF variables such as temperature or salinity at, for example, the grid point closest to 50 degrees north latitude and -140 degrees east longitude?
More generally, how can we efficiently
- determine the value of a variable nearest a specified location and time?
- access all the data on a subgrid defined within a specified space-time coordinate bounding box?
Why use an iPython notebook for the example?
For this example, the code for reading netCDF data and mapping from coordinate space to array index space is presented in an iPython notebook. Why not C, Java, Fortran, or some other language supported directly by Unidata?
The point is clarity. This is a pretty long blog, which has probably already collected several "TLDR" comments. The accompanying example code has to be short, clear, and easy to read.
The iPython notebook, which grew out of a session on reading netCDF data in the Unidata 2013 TDS-Python Workshop, is available from the 2013 Unidata TDS-Python Workshop, with which you can run examples and do timings on your own machine and even with your own data!
If you choose the second method, you can see the results of changing the examples or substituting your own data. To use the notebook you'll need to have Python and some additional libraries installed, as listed at the bottom of the README page on the referenced workshop site.
Just looking at the HTML version should be sufficient to see what's going on.
In the notebook, we implement and progressively improve several ways to access netCDF data based on coordinates instead of array indices:
- slow, simple, and sometimes wrong
- fast, simple, and sometimes wrong
- not quite as fast, but correct
- fast, correct, and scalable for many queries on the same set of data points
Naive, slow approach
Since our concrete problem is to find which of over 600,000 points is closest to the (lat, lon) point (50, -140), we should first say what is meant by "close".
A naive metric for distance squared between (lat,lon) and (lat0, lon0) just uses the "Euclidean norm", because it's easy to compute:
dist_squared = (lat - lat0)2 + (lon - lon0)2
So we need to find indices x and y such that the point (Latitude[y, x], Longitude[y, x]) is close to the desired point (lat0, lon0). The naive way to do that is to use two nested loops, checking distance squared for all 600,000+ pairs of y and x values, one point at a time.
The code for this version is presented as the function naive_slow and a few subsequent lines that call the function on our explicit example.
Note that the code cell examples all initially open the example file, read the data needed to find the desired values, and close the file after printing enough information to verify that we got the right answer. The work of opening the file, reading the data, and closing the file is repeated for each function so that the cells are independent of each other and can be run in any order or re-run without getting errors such as trying to read from closed files or closing the same file twice.
Faster with NumPy
The first approach uses conventional explicit loops, which makes it resemble Fortran or C code. It can be sped up by a factor of over 700 just by using NumPy, a Python library supporting multidimensional arrays and array-at-a-time operations and functions that often replace explicit loops with whole array statements. We call this version of the function naive_fast.
Like naive_slow, it suffers from flaws in the use of a "flat" measure of closeness.
Tunnel distance: still fast, but more correct
A problem with both of the previous versions, is that they treat the Earth as flat, with a degree of longitude near the poles just as large as a degree of longitude on the equator. It also treats the distance between points on the edges, such as (0.0, -179.99) and (0.0, +179.99) as large, even though these points are just across the International Date Line from each other.
One way to solve both these problems is to use a better metric, for example the length of a tunnel through the Earth between two points as distance, which happens to be easy to compute by just using a little trigonometry.
This results in a more correct solution that works even if longitudes are given in a different range, such as between 0 degrees and 360 degrees instead of -180 to 180, because we just use trigonometric functions of angles to compute tunnel distance. Wikipedia has the simple trig formulas we can use. This also gives us the same answer as the naive approach for our example point, but avoids erroneous answers we would get by treating the Earth as flat in the "naive" approaches.
Timing this approach, with the function we call tunnel_fast, because we still use fast NumPy arrays and functions instead of explicit loops, is still about 200 times as fast as the loopy naive approach.
Using KD-trees for More Scalable Speedups
Finally, we use a data structure specifically designed for quickly finding the closest point in a large set of points to a query point: the KD-tree (also called a "k-d tree"). Using a KD-tree is a two-step process.
First you load the KD-tree with the set of points within which you want to search for a closest point, typically the grid points or locations of a set of ungridded data. How long this takes depends on the number of points as well as their dimensionality, but is similar to the time it takes to sort N numbers, namely O(N log(N)). For the example dataset with over 600,000 points in the grid, this takes under 3 seconds on our laptop test platform, but it's noticeably longer than the setup time for the minimum tunnel-distance search.
The second step provides a query point and returns the closest point or points in the KD-tree to the query point, where how "closest" is defined can be varied. Here, we use 3D points on spherical Earth with unit radius. The combination of setup and query are initially implemented in a kdtree_fast function.
Although you can't tell by running the setup and query together, the kdtree_fast query is significantly faster than in the tunnel_fast search. Thus the kdtree_fast approach scales much better when we have one set, S, of points to search and lots of query points for which we want the closest point (or points) in S.
For example, we may want variable values on a 100 by 100 (lat,lon) subgrid of domain, and as we'll see, the KD-tree provides a much faster solution than either the tunnel_fast or naive_fast methods. If the locations don't change with data from multiple files, times, or layers, the setup time for the KD-tree can be reused for many point queries, providing significant efficiency benefits. For popular spatial datasets on the server side, the same KD-tree could be reused for multiple datasets that use the same coordinates.
The KD-tree data structure can also be used more flexibly to provide fast queries for the M closest points to a specified query point, which is useful for interpolating values instead of just using the value of a variable at a single closest point.
The KD-tree data structure is like a Swiss-army knife for coordinate data, analogous in some ways to regular expressions for strings. We have shown use of KD-trees for 2D lat/lon data and for 3D coordinates of points on a sphere with a tunnel distance metric. KD-trees can also be used with vertical coordinates or time, if you specify a reasonable way to define "close" between points in space-time.
Implementations of KD-trees are freely available for C, Java, Fortran, C++, and other languages than Python. A Java KD-tree implementation is used in ncWMS, the Web Map Service included in TDS servers but developed at the University of Reading. A C/C++ implementation of KD-trees is used in Fimex, a library from the Norwegian Meteorological Institute that supports interpolation and subsetting of geospatial data built around the Unidata Common Data Model. The module we used in the iPython notebook example, scipy.spatial.cKDTree, is a C implementation callable from Python, significantly faster than the pure Python implementation in scipy.spatial.KDTree.
The last section of the notebook re-implements our four functions as four object-oriented Python classes, with the setup code in a class constructor and a single query method. This makes it easier to time the initialization and query parts of each approach separately, which leads to simple formulas for the time required for N points and a prediction for when the extra setup time for kdtree_fast pays off.
And the Winner Is ...
The envelope, please. And the winner is ... YMMV!
The reasons Your Mileage May Vary include:
- number of points in search set
- dimensionality of search set
- complexity of distance metric
- number of query points
- number of closest neighbors needed, e.g. for interpolation
- I/O factors, including caching, compression, and remote access effects
- accounting for non-spherical Earth shape
We've summarized our timing results for the four approaches to the sample problem on the example dataset at the end of the iPython notebook. Perhaps we can look at some of the factors listed above in future blogs.
In case you didn't want to open the iPython notebook to see the results, here's some times on a fast laptop platform:
|Method||Setup (ms)||Query (ms)||Setup + 10000|
Determining when it's worth it to use a KD-tree for this example resulted in:
- kd_tree_fast outperforms naive_fast above 1050 queries
- kd_tree_fast outperforms tunnel_fast above 490 queries
The advantage of using KD-trees is much greater for more search set points, as KD-tree query complexity is O(log(N)), but the other algorithms are O(N), the same as the difference between using binary search versus linear search.
If you're interested in more on this subject, a good short paper on various ways to regrid data using KD-trees and other methods is Fast regridding of large, complex geospatial datasets by Jon Blower and A. Clegg.