Re: when will HDF5 support Unicode?

NOTE: The netcdf-hdf mailing list is no longer active. The list archives are made available for historical reasons.

Hi Quincey,

I'd like to reconsider the Unicode issue, and specifically ask about
the feasibility of what we hope is a small addition to HDF5 to allow
netCDF to support UTF-8 encoded names for variables, dimensions, and
attributes without HDF5 having to support such encoded names.

We would like to just declare in netCDF documentation that the names
for netCDF variables, dimensions, and attributes are UTF-8 encoded
when provided to or returned from netCDF interfaces.  This is
backwards compatible, because we currently only support ASCII strings
(with some restrictions), and what we're proposing would just remove
the restrictions and allow non-ASCII bytes (with the upper bit set),
to allow for UTF-8 encoding of other Unicode characters.

What we would need from HDF5 is a way to request that names for
Datasets and Attributes allow an arbitrary byte array, so we can use
UTF-8 encoding for non-ASCII characters.

Is this feasible?

Otherwise there are no library changes in netCDF that we would need to
support UTF-8 encoding for Unicode names.  Some applications such as
ncdump and ncgen will have to know how to handle encoded names, but we
are willing to deal with that.

Note that we're not requesting that you drop restrictions on all
names, just that you provide a way for netCDF-4 to be able to use
names with non-ASCII bytes, for example a call to a function that says
checking on new names will subsequently lenient (e.g. you could still
disallow empty names, names with embedded null characters, or names
that are too long).  Existing code that didn't invoke this call would
still have to abide by the current name restrictions.

Also I notice that the documentation for H5Acreate and H5Dcreate at

  http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5A.html#Annot-Create
  http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5D.html#Dataset-Create

currently list no restrictions on names to use only ASCII characters,
but the Introduction to HDF5 says

  A dataset name is a sequence of alphanumeric ASCII characters.

--Russ

  • 2005 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-hdf archives: