Re: when will HDF5 support Unicode?

Bob,

> I want to double check to confirm that the proposed (pretty limited)
> changes are actually what you need.  This is a pretty messy change, 
> and I'm concerned that it's not really what you need.
> 
> With the long discussion, I've lost track of the trail, so I want
> to restate what we propose, and ask you guys to let us know if this
> is what you need.
> 
> 
> 1. flag the character encoding for links
> 
> Basically, the change means that there will be an information field
> in the HDF5 metadata that suggests that a link name should be interpreted
> as ASCII or UTF-8.  (It is already possible to store UTF-8 as the
> name of objects, but there is no standard way to know that was done.)
> 
> This will cover the names of datasets and groups (essentially path names),
> but not attributes.  I'm not sure about object references and the target
> of a soft link, etc., other places a path name might be used to create
> an object.

I think number 1. could also just be done as a documentation change,
where instead of saying that link names are represented (encoded) as
ASCII strings, you say link names are represented using UTF-8.  This
is backward compatible, because all current ASCII-encoded names are
already UTF-8.  

I don't know why you would want to support more than one encoding for
names, so you might as well just pick one, and UTF-8 would be the best
for this purpose because ASCII strings are UTF-8 strings.  Is there a
use case where more than one encoding for the names of datasets and
groups would be desirable or necessary?

At least one library change is needed to support UTF-8 encoded names,
specifically for iterating through dataset names in a Group in
"alphabetical order".  For names with non-ASCII characters, this order
should follow the Unicode collation algorithm.

> 2. add UTF-8 character encoding fo character/string datatype
> 
> Again, it is already possible to store this data, but the change
> makes the datatype document the encoding. 
> 
> I think this doesn't cover the stringsfor enumerations, or the field 
> names in a compund data type. I think they can have UTF-8 in them,
> but there won't be aflag to indiate the fact.
> 
> This change covers the content of datasets and attributes. 
> 
> We can't handle other encodings at this time.

This seems fine, since there are good reasons to support more than one
encoding for character data, unlike character names.  However, there
has to be a default encoding when one is not supplied.  Currently the
default encoding is ASCII, right?  You might consider whether it's
practical to make the default encoding UTF-8 instead.  I'm not sure
whether that would introduce any incompatibilities or not.  If it
does, then explicit specification for UTF-8 encoded data would be
required.

--Russ