Re: when will HDF5 support Unicode?


n 2005.05.10 10:22 Russ Rew wrote:
[...]all current ASCII-encoded names are
already UTF-8.

Unfortunately this is not quite true.  People have been putting anything
they want in path names including extended ASCII.  The bytes > 127
are not necessarily legal UTF-8, so we can't just say all existing files
are UTF-8, unfortunately. (This doesn't harm the file or library, but
tools will have problems if we tell them it's UTF-8 and it isn't.)


I don't know why you would want to support more than one encoding for
names,

We have many requests for non-English character sets, so it would
be nice to support them in the future. Between the above gotcha and the desire to someday support other char sets, the idea is to make ASCII and UTF-8 be the first of possibly many.
Since existing files may well have non-UTF8 in them, ASCII must be the
default for backward compatibility.


At least one library change is needed to support UTF-8 encoded names,
specifically for iterating through dataset names in a Group in
"alphabetical order".  For names with non-ASCII characters, this order
should follow the Unicode collation algorithm.

My understanding is that the current proposal will sort the objects by numeric value of the bytes in the names for all cases. I don't know if UTF-8 has
a different collating order than this, if so, it won't be implemented at
this time.


I'm trying to determine if the proposed changes address your requirements
well enough to be worth doing.