« NetCDF CDL vs SQL | Main | First two weeks! »

07 June 2011

Gray's 3rd requirement for Scientific Data Management Systems:

3. Intelligent indexes: for efficiently subsetting and filtering the data.

The word index here is overloaded, as I've been using it to describe accessing array subsets using array indices. Here, a search index is a way to efficiently find subsets of data based on the data's values.

Scientific file formats like netCDF/HDF lack search indexes, but databases invariably have some kind of functionality for these purposes, typically using B-trees and (to a lesser extent) hash tables to implement it. The DBA (database administrator) is able to add and delete search indexes to optimize selected queries, without breaking existing applications. That's only approximately true since some applications require fast access or else they are in fact broken (if your ATM machine took an hour to decide if you can withdraw money, you wouldn't use it). Nonetheless, the separation of search indexes from the logical view of datasets is a powerful abstraction that's partially responsible for the success of relational databases.

There's nothing in the netCDF or HDF file libraries that creates "intelligent indices" for you, but there's nothing that prevents other libraries from building search indexes on top of netCDF/HDF files, either. One could use an external index, or put the index in the same netCDF/HDF file as the data. The PyTables library apparently builds B-trees on top of HDF5 files, and seems to be a very competent and widely used implementation (I haven't actually used it, although I have looked inside some of the files). The fundamental data structure for PyTables is the "table" just like for relational databases. I don't think it supports relational joins or SQL, but provides very fast indexed lookup on top of a single file implementation, that is the data and index is in the same file.

When would it be useful to build search indexes on top of netCDF files? A simple answer is when the cost of building the index can be amortized over enough queries to justify the extra work and space to build the index. If the index is only used once, its not worth building. So it depends on what the expected queries are, and under what circumstances they can use the same index.

A more fundamental limit comes from the physics of spinning disks, i.e. the relative costs of sequential vs random access. When using an index, one uses random access on a file to skip around to the next desired subset of data. When is that faster than brute force sequential reading of the entire file? If you can sequentially read a file at, say, 50Mb/sec, then it takes 10 seconds to read a 500 Mb file. If a random access takes 10 msecs, then one could do 1000 random accesses in 10 secs. If your objects are 10K in size and randomly distributed throughout the 500 Mb file, then there are 50K objects in the file, and your index wont help you unless you want less than 1000 of them. One can say (in this case) your index needs to discriminate 2% (1K / 50K) or less to be useful.

So whats the chance that for some query, an index exists and has enough discrimination to do better than a sequential scan? When I think about this, questions start to fly around like pigeons:

Are there indices that would be used by many researchers?
Are there indices that would be repeatedly used by a small group of researches?
How much data do they want once they find what they are looking for?
Do they need all the data (eg for visualization) or do they really want to run some computation over it?
Could the computation be sent to the data?
Is it a simple computation that can be expressed in some elegant algebra?
Is there a high-level query language or are we trying to satisfy file-at-a-time processing using array indexes?
How is the data subset organized? Is there data locality on disk?
What does the query actually return?
How would a DBA recognize that an index would be justified?

And on and on. In the database community, lots of research has been done to answer these kinds of questions, but very little research that I know of has been done for scientific data access.

In ignorance, all of us are experts, and so undaunted, I offer the following ways to proceed.

First, I'm interested in the case when these files are big; hundreds of megabytes at least, more typically we have collections of many files that are a few gigabytes each, i.e. a few terabytes of data. I'd like to build servers that can eventually handle petabytes of data. Some numbers that I gathered for a talk at XLDB in October of 2010:

NCAR Mass Store has 8 PB, adding 2.5 PB / year.
NCAR Research Archive holds 55 Tb in 600 datasets.
NASA ESDIS has 4.2 PB / 4000 datasets / 175M files.
NOAA CLASS has 30 PB now, 100 PB by 2015
IPCC AR4 has 35 TB / 75K files
IPCC AR5 will be ~2.5 PB

The data in these archives is not going to be stuck into a relational database. It may be that some hybrid file/database system might come along that dominates the problem, but there's nothing proven on the immediate horizon. So principle 1: canonical datasets will be kept in scientific file formats like netCDF/HDF. By canonical I mean that no matter what else archival centers do, the data stored in scientific file formats will be considered to be correct, and will be used to validate other access methods, like web services, query systems, cloud computation, etc.

Second, I'm interested in Earth science data, meaning data that usually is associated with time and spatial locations (aka georeferenced data). The problems of accessing general scientific data will have to wait, for now lets build a system specialized to our community. How should one, in general, store such data, not knowing in advance how it might be used? So principle 2: partition the data by time. This means divide your data into separate files by some period of time, perhaps days, months, or years. Make the files fairly large, to minimize the overhead of managing very large numbers of files. For current technology, I'd say something like 100 Mb to 1 Gb in each file.

Third, abstractions for storing scientific data are needed at a higher level than multidimensional arrays. If one wants to find things in a large collection of stuff , you have to have a way to identify the thing and say what it is when you have it. In computer science, we have converged on the Object as the way to do this. In the abstract data modeling going on in the scientific community, we are calling these things Features. Features provide the higher level abstraction needed to do intelligent indexing, since now its clear what a query should return: feature objects. So principle 3: the data stored in scientific data files are collections of features. Multidimensional arrays might be the way the feature is provided to your program, but your program needs to be rewritten to understand features, if you want it to scale to very large datasets.

Amazingly, storing data in scientific file formats, partitioning them by time, and using metadata conventions to describe the semantics of features is what data providers are (mostly) already doing. Wow, is that a coincidence or what? So, we will just have to build our most excellent next generation Earth Science Data Management System (ESDMS) on top of these best practices in this best of all possible worlds.

Posted by $entry.creator.screenName [ Comments [1] ]

Comments:

I loved this blog post right up to the point where you reach principle 2: "partition the data by time". This is fine sometimes but completely wrong other times depending on the end use of the data. I have a blog post describing the problem titled Data producers vs. data consumers:

http://mazamascience.com/WorkingWithData/?p=84

Right now, for instance I am in the middle of reformatting 2 Tb of HYSPLIT model output FROM a format that partitions by time TO a collection of NetCDF files organized by location and then Julian Day. (The goal is to interactively extract location specific climatologies.) Access times have gone from hours to under a second.

As you pointed out, the best way to store data depends on the questions we will ask of that data. Often, I find myself advocating duplicating the data so that we have both synoptic and location-specific organization.

Extra disk space is way cheaper than the human costs associated with excessive data access times.

Posted by Jonathan Callahan on June 14, 2011 at 06:33 AM MDT #