Re: [bufrtables] Summary report on the suitability of GRIB/BUFR for archiving data

Hi Enrico:

I have a few clarifying comments, below.

On 3/29/2011 4:26 AM, Enrico Zini wrote:
Apologies for the long hiatus on this list.

I have written a  brief report about BUFR/GRIB with a (possibly
controversial) recommendation. Feel free to forward to anyone who
might be interested.

http://www.unidata.ucar.edu/staff/caron/bufr/Summary.html
Hello,

from the experience[1][2][3] I have with BUFR messages, I see a few
problems with your proposal:

  1. it would imply that BUFR decoding can only happen when/where there
     is network connectivity and the central server is working. I am not
     comfortable in tying a long lived archive to the existance of a 3rd
     party server;

I think that a software library or met center would want to cache their own copies of the tables that they use. The central server would be used for tables that have never been seen before.

  2. alternatively, the archive needs to store and maintain up to date an
     entire mirror of all the tables mentioned by all the BUFRs it
     contains, and that more or less what we already have, barring the
     proposal to standardise a file format for storing tables.
     But if you retrofit the system that we have now with a standard file
     format for tables and a working central repository, you basically
     fix it without the need for hash codes;

There is no such central archive that i know of that handles local tables. Now that the WMO is publishing machine readable tables, and as those table are actually used, then the current chaos is greatly reduced for the WMO standard tables.

However, there is no way to know for sure, now or in the future, if the writer actually used the correct table. This happens a lot more often then you might expect, especially in conjunction with local table use. I run across grib and bufr records where the writer no longer knows which tables were used when the files were written, they only know which tables are currently used in an operational setting.

  3. 16bits (0-65535) are imo not that big a hash space: when you allow
     everyone to create new tables at will, things may degenerate
     quickly.

The MD5 checksum is 16 bytes, not bits.

But the biggest problem I have is this: you do need to maximise reuse of
BUFR table codes, otherwise the problem of making sense of the decoded
data is not machine computable anymore.

I am maintaining software that not only decodes BUFR bulletins, but also
tries to make sense of them: for example, it can understand that a given
decoded value is a temperature, that it is sampled at a given vertical
level and that it went through a given kind of statistical processing.
That is, it can decode a bulletin and say:

   "There is a temperature reading at 2 meters above ground, maximum over
   12 hours."

This interpreted information can be used by meteorologists without
having to be aware that temperatures can come as B12001, B12101, B12111,
B12112, B12114..B12119 or what else. Where I work, the possibility to do
this is considered a very valuable resource, as it allows to uniformly
compare readings from different sources.

Yes, I sympathize with this, we are trying to do the same with our software.

If you have a process where data sharing across centers has to use some
well standardised, well known tables (as well as some reasonable
standards, or even just practices, for laying out BUFR templates), you
can code (I have coded) that sort of interpretation in software. If
instead anyone can at any point start distributing BUFRs that can use
any B code they want to represent temperature, then the only way to make
sense of a decoded bulletin is to have it personally read by an
experienced meteorologist.

Even if you don't want machine interpretation of the bulletins, if the
lifetime of the archive is long enough then its data can potentially
outlive the availability of experienced meteorologists who can remember
how to make sense of them.

To have a long lived archive, IMO what is needed are pervasive
standards, stable over time. Instead of designing for chaos, I'd rather
see how to make coordination work: propose a standard file format for
distributing tables;

do you mean local tables?

propose the creation of a repository where to
download the WMO standard table;

at this point WMO now has this; there are some ongoing efforts to validate the new XML and csv formats. These are pretty good now, AFAICT.

propose a process for submission of new
table entries, akin to what happens with submissions of new code points
to UTF-8, or new locales to ISO. My feeling is that something like UTF-8
is more like the kind of thing to model BUFR tables on.

Of course chaos should still be supported, because scientists have to
have full freedom of experimentation. But there are already local table
numbers that can be used for that, and after the experiments are
successful the new entries can be submitted to a new version of the
shared tables, so that the shared language can grow.


[1] http://www.arpa.emr.it/dettaglio_documento.asp?id=2927&idlivello=64
[2] http://www.arpa.emr.it/dettaglio_documento.asp?id=514&idlivello=64
[3] http://www.arpa.emr.it/dettaglio_documento.asp?id=1172&idlivello=64

Ciao,

Enrico

Thanks for your comments, your software looks like a nice contribution.

The question I have that I would ask you to consider is how do you ensure that the reader and writer are using the same tables? What would you do if some congressional committee investigating "fraud science" asked you to prove that?