Re: when will HDF5 support Unicode?

Russ an ohers,

I want to double check to confirm that the proposed (pretty limited)
changes are actually what you need.  This is a pretty messy change, 
and I'm concerned that it's not really what you need.

With the long discussion, I've lost track of the trail, so I want
to restate what we propose, and ask you guys to let us know if this
is what you need.


1. flag the character encoding for links

Basically, the change means that there will be an information field
in the HDF5 metadata that suggests that a link name should be interpreted
as ASCII or UTF-8.  (It is already possible to store UTF-8 as the
name of objects, but there is no standard way to know that was done.)

This will cover the names of datasets and groups (essentially path names),
but not attributes.  I'm not sure about object references and the target
of a soft link, etc., other places a path name might be used to create
an object.

2. add UTF-8 character encoding fo character/string datatype

Again, it is already possible to store this data, but the change
makes the datatype document the encoding. 

I think this doesn't cover the stringsfor enumerations, or the field 
names in a compund data type. I think they can have UTF-8 in them,
but there won't be aflag to indiate the fact.

This change covers the content of datasets and attributes. 

We can't handle other encodings at this time.



On Fri, 6 May 2005, Robert E. McGrath wrote:

> Russ,
> 
> Here is the current plan to provide limited support for unicode for 
> HDF5-1.8.0.
> 
> Specifically:
> 
>   1. one new character encoding, UTF-8, will be added for user data 
> types,
>      i.e., datasets with string data.  This is a straightforward 
> extension of
>      the current string data types.
> 
>   2. a new property when creating links (i.e., creating objects or 
> adding
>      to groups), to specify either ASCII or UTF-8.
>     - default will be 'ASCII' (for backward compatibility)
>     - query will tell the encoding, one of (UNKNOWN, ASCII, UTF-8). 
> Older
>       files will return UNKNOWN
>     - the link names will not be checked, i.e., we won't check that it
>       is legal UTF-8.
> 
> Other unicode support will be considered at a later date.
> 
> On 2005.05.05 16:17 Russ Rew wrote:
> > > We've had several discussions of UTF-8 support.  The current ideas
> > are
> > > incorporated in a RFC at:
> > >
> > >    http://hdf.ncsa.uiuc.edu/RFC/Unicode/Unicode.html
> > >
> > > Close reading of this RFC will indicate that we know how to support
> > > UTF-8 for user data, but support for UTF-8 for names is still TBD.
> > 
> > I would consider supporting only UTF-8 for names but permit users to
> > specify other encodings as well for user data, for two reasons:
> > 
> >  - fixed-width encodings (like UCS2) permit quick access to the nth
> >    character in a string
> > 
> >  - other encodings may permit more compact representation than UTF-8
> >    for strings that contain a lot of non-ASCII characters
> > 
> > Joel Spolsky's column is a good introduction to some Unicode issues,
> > but I recommend this article for developers:
> > 
> >   http://www.w3.org/TR/charmod/
> > 
> > For example, the above gives examples of some of the complications in
> > sorting datasets alphabetically in a Group if you support Unicode
> > names.  You might need to use the "Unicode Collation Algorithm" in
> > that case.  Fortunately, there are open source implementations for
> > such
> > things in ICU (International Components For Unicode):
> > 
> >   http://icu.sourceforge.net/
> > 
> > --Russ
> > 
> 

-- 
Robert E. McGrath
National Center for Supercomputing Applications
University of Illinois, Urbana-Champaign
Champaign, Illinois 61820
(217)-333-6549

mcgrath@xxxxxxxxxxxxx