Re: when will HDF5 support Unicode?

Hi Russ,

> I'd like to reconsider the Unicode issue, and specifically ask about
> the feasibility of what we hope is a small addition to HDF5 to allow
> netCDF to support UTF-8 encoded names for variables, dimensions, and
> attributes without HDF5 having to support such encoded names.
> 
> We would like to just declare in netCDF documentation that the names
> for netCDF variables, dimensions, and attributes are UTF-8 encoded
> when provided to or returned from netCDF interfaces.  This is
> backwards compatible, because we currently only support ASCII strings
> (with some restrictions), and what we're proposing would just remove
> the restrictions and allow non-ASCII bytes (with the upper bit set),
> to allow for UTF-8 encoding of other Unicode characters.
> 
> What we would need from HDF5 is a way to request that names for
> Datasets and Attributes allow an arbitrary byte array, so we can use
> UTF-8 encoding for non-ASCII characters.
> 
> Is this feasible?

    After rooting through the group API as much as I have recently, I think
it's probably quite feasible for the names of object & attributes to use UTF-8
encoding for their strings.  There are only two hangups I can see:
    - The names will be sorted in byte-value order, since there's no locale
        information embedded in the file, which may disconcert international
        users.
    - The strings are nul-terminated and I'm not certain if part of a UTF-8
        string can be nul.

    I'll write some tests that check for proper insertion of non-ASCII strings
as object & attribute names and let you know what I find out.

    Note that Unicode strings as elements of a dataset is harder and probably
won't work correctly currently.

    Quincey

> Otherwise there are no library changes in netCDF that we would need to
> support UTF-8 encoding for Unicode names.  Some applications such as
> ncdump and ncgen will have to know how to handle encoded names, but we
> are willing to deal with that.
> 
> Note that we're not requesting that you drop restrictions on all
> names, just that you provide a way for netCDF-4 to be able to use
> names with non-ASCII bytes, for example a call to a function that says
> checking on new names will subsequently lenient (e.g. you could still
> disallow empty names, names with embedded null characters, or names
> that are too long).  Existing code that didn't invoke this call would
> still have to abide by the current name restrictions.
> 
> Also I notice that the documentation for H5Acreate and H5Dcreate at
> 
>   http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5A.html#Annot-Create
>   http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5D.html#Dataset-Create
> 
> currently list no restrictions on names to use only ASCII characters,
> but the Introduction to HDF5 says
> 
>   A dataset name is a sequence of alphanumeric ASCII characters.
> 
> --Russ
>