[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: questions about data formats



Dejan Vucinic wrote:

Dear John,

On the wiki.opendap.org site there is a web page you wrote that
has a comparison of HDF5, Netcdf-3, Netcdf-Java-2.2, and
OpenDAP-2.0 data formats.

I am writing a brain imaging application, and am at a point where
I need to move away from the naive data formats I've been using
so far to something more robust and future-proof.  After spending
several days trying to figure out the ins and outs of the
various standardized data formats I feel like my head is about to
explode.  Since you seem to have given the topic a great deal of
thought, I wondered if you'd be willing to give advice on which
format might be the best choice.

If I understand things well, formats such as netCDF have been
developed for a particular application, such as geological data,
and then expanded to accommodate a greater variety of data types
and structure.  HDF5, on the other hand, seems to be more like a
meta-format, in that just about any other data format can be
encoded within it.  The latter, therefore, seemed like the best
choice until I realized that the only way to read and write it
is from C, i.e. the Java code uses a native library to get at it!
Call me old-fashioned, but I don't like the idea of a
language-bound data format, the whole idea of a standardized
data format to me is about portability first and foremost.
Then I came across the NJ22 library which purports to handle
HDF5 in pure Java.  Now I'm truly confused.

I would be very grateful if you could take the time to summarize
what all this is about, and to answer this question simply: if
you had a clean-slate design today, a brand new application for
a brand new way of data acquisition, which data format would you
adopt?


Hi Dejan:

NetCDF is an array oriented file format, quite simple and robust, suitable for any scientific data that needs multidimensional arrays.

HDF5 has a similar purpose, and they have added a number of other features, eg structures, along with storage optimizations such as packing and chunking. HDF5 can thus be more efficient, but the APIs are more complex. Unfortunately their file format has gotten complex also, and up until recently could only be accessed through their C library. It is a typical case where in the heat of development, the implementation library gets confused with the standard.

We are working on merging netcdf and HDF5, which will be called netcdf-4. it will use HDF5 as the file format, and an extended netcdf API. Our C library will use the HDF5 C library underneath. We also need 100% pure Java netcdf-4 library, so we are createing a pure Java reader for HDF5. Since netcdf-4 only uses a subset of the HDF5, we really only need to read that subset, however, we are trying to read as much as possible, given resource constraints. We expect that we will read the most HDF5 files, perhaps all, but unsure as of now.

To answer your question, I would use netcdf-3 if its adequate for your needs. This probably boils down to whether you really need packing to save space, and what your read access patterns will be like. Once you understand how the data is laid out on disk, its not hard to understand what read patterns will be efficient. HDF5 is much less intuitive, but if you use chunking, it can do better on average for arbitrary accessed. Testing is the only sure way to tell.

If you do need HDF5, I would use netCDF-4, although you should also know that its still alpha, and not even useable yet for production.

Hope thats enough info for you to get started. Good luck!

John


Best regards,

Dejan Vucinic
address@hidden