Re: when will HDF5 support Unicode?

"Robert E. McGrath" <mcgrath@xxxxxxxxxxxxx> wrote:
> On 2005.05.10 10:22 Russ Rew wrote:
> > [...]all current ASCII-encoded names are
> > already UTF-8.
> 
> Unfortunately this is not quite true.  People have been putting anything
> they want in path names including extended ASCII.  The bytes > 127
> are not necessarily legal UTF-8, so we can't just say all existing files
> are UTF-8, unfortunately. (This doesn't harm the file or library, but
> tools will have problems if we tell them it's UTF-8 and it isn't.)

OK, you're right, that's a good reason to require UTF-8 be specified
instead of the default encoding.

> > I don't know why you would want to support more than one encoding for
> > names,
> 
> We have many requests for non-English character sets, so it would
> be nice to support them in the future. 
> Between the above gotcha and the desire to someday support other
> char sets, the idea is to make ASCII and UTF-8 be the first of possibly 
> many.
> Since existing files may well have non-UTF8 in them, ASCII must be the
> default for backward compatibility.

Right, but has anyone requested any character sets not supported by
Unicode (ISO 10646)?  UTF-8 is a complete encoding for Unicode, so it
supports all Unicode characters.  I would be surprised if there is a
need for any non-Unicode characters in names, so UTF-8 should be
sufficient.  So I now see why you would need to support both ASCII and
UTF-8 encodings for names, but I'm not convinced you need any other
encodings just for names (although you will for *data*).

> > At least one library change is needed to support UTF-8 encoded names,
> > specifically for iterating through dataset names in a Group in
> > "alphabetical order".  For names with non-ASCII characters, this order
> > should follow the Unicode collation algorithm.
> 
> My understanding is that the current proposal will sort the objects
> by numeric value of the bytes in the names for all cases.  I don't
> know if UTF-8 has a different collating order than this, if so, it
> won't be implemented at this time.

OK, but portable open-source software is available for sorting using
the Unicode collation algorithm in case you change your mind.

> I'm trying to determine if the proposed changes address your
> requirements well enough to be worth doing.

I think what you have proposed would be excellent, from our point of
view.  We actually don't expose sorting by name in the netcdf4
interface, so how collation is done won't be relevant to that.  But
being able to use Unicode in names would offer a great improvement in
the ability of data providers to convey the meaning in the data.

--Russ