[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #KZJ-320086]: Short read are not managed?



Hello David,

Thank you for the comprehensive description of the issue, and the proposed 
solution!  After consulting with Russ, I have created a ticket for this in our 
JIRA system, https://bugtracking.unidata.ucar.edu/browse/NCF-337, and am going 
to try to integrate the fix before the next netCDF release.  We're currently 
preparing for our annual Python workshop, being held next week, but I will be 
able to turn my attention to this shortly thereafter.  

The fix seems pretty straightforward; the only confounding issue will be how to 
test for it, since it seems difficult to cause the issue; I'm sure I can come 
up with something.  We also don't have access to Cray hardware or a LUSTRE 
filesystem, but as you point out this is not limited to that environment. 

Thanks again for the comprehensive information!  Have a great day,

-Ward


> Full Name: David Knaak
> Email Address: address@hidden
> Organization: Cray Inc.
> Package Version: 4,3,3,1
> Operating System:
> Hardware:
> Description of problem: This ticket is directly related to these tickets:
> 
> 08 Apr 2015
> [netCDF #KZJ-320086]: Short read are not managed?
> http://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg13072.html
> 
> 23 Mar 2015
> [netCDF #PDZ-683250]: Short read are not managed?
> http://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg13053.html
> 
> Cray has analyzed the short read situation.  We believe we understand
> the problem and have a proposed fix for NetCDF.
> 
> Due to a combination of factors, this short read issue has shown up on
> Cray systems with Lustre file systems.  But this issue is not limited Cray
> systems nor is it limited to Lustre file systems.  I will start with the
> specifics for Cray and Lustre but will then generalize it.
> 
> The first factor is that a major change introduced with Lustre 2.5 has
> caused behavior that is, by POSIX standards, legal, but is not the intended
> Lustre behavior.  The behavior is that a race condition can occur in Lustre
> that sometimes causes a read request to be only partially satisfied with
> a single read call.  This race condition is more likely to occur on large
> and very busy file systems but could occur on any Lustre 2.5 file system.
> Technically speaking, this is not a bug because POSIX semantics allows
> this (see below).  But this is not the intended behavior of Lustre and
> Lustre will be modified in a future release so that this does not happen.
> 
> The second factor is that not all programs and libraries properly handle
> the case of a short POSIX read or POSIX write.  This is the case with UCAR
> NetCDF when the creation mode is NC_CLASSIC_MODEL.  It may also be the
> case in other libraries and many user programs that are not properly coded.
> 
> In general, if a program calls read or write without checking for the
> number of bytes actually transferred and reading again if necessary
> then the program is exposed to the issue.  POSIX does not guarantee
> that a single read call will read all of the bytes requested or that a
> single write call will write all of the bytes requested.  Quoting from
> opengroup.org:
> 
> http://pubs.opengroup.org/onlinepubs/009695399/functions/read.html
> 
> Upon successful completion, where nbyte is greater than 0, read() shall
> mark for update the st_atime field of the file, and shall return the
> number of bytes read. This number shall never be greater than nbyte. The
> value returned may be less than nbyte if the number of bytes left in
> the file is less than nbyte, if the read() request was interrupted by a
> signal, or if the file is a pipe or FIFO or special file and has fewer
> than nbyte bytes immediately available for reading. For example, a read()
> from a file associated with a terminal may return one typed line of data.
> 
> If a read() is interrupted by a signal before it reads any data, it
> shall return -1 with errno set to [EINTR].
> 
> If a read() is interrupted by a signal after it has successfully read
> some data, it shall return the number of bytes read.
> 
> The issue for POSIX write is essentially the same.  See:
> 
> http://pubs.opengroup.org/onlinepubs/009695399/functions/write.html
> 
> So if a read returns some but not all bytes, read should be called again.
> The code in libsrc/posixio.c shows that for the NC_CLASSIC_MODEL path,
> read is not called again if there is a short read:
> 
> errno = 0;
> nread = read(nciop->fd, vp, extent);
> if(nread != (ssize_t) extent)
> {
> status = errno;
> if(nread == -1 || status != ENOERR)
> return status;
> /* else it's okay we read less than asked for */
> (void) memset((char *)vp + nread, 0, (ssize_t)extent - nread);
> }
> *nreadp = nread;
> *posp += nread;
> 
> return ENOERR;
> 
> With this code, if the POSIX read does not read the full number of bytes,
> the read is not retried but rather "memset" zeroes out the rest of the
> user's buffer even though there may still be more bytes in the file to read.
> This is the exact behavior that some of our users have experienced when
> using NetCDF.
> 
> With some local modifications to the NetCDF library and some test cases,
> Cray verified that the NC_CLASSIC_MODEL path does in fact pass through the
> above code.  But for creation mode NC_NETCDF4 it does not.  For this mode,
> HDF5 I/O is called and HDF5 I/O properly handles short reads.
> 
> Since a short read can potentially happen on any POSIX compliant file
> system, code calling read should handle this possibility with code
> something like this:
> 
> /* fd is the file descriptor */
> /* buf is the initial address of the user buffer */
> /* request_count is the initial number of bytes requested */
> char *p = buf;
> size_t read_count;
> size_t nread;
> ssize_t bytes_xfered = 0;
> 
> do {
> read_count = request_count - bytes_xfered;
> nread = read(fd, p, read_count);
> if (nread > 0) {
> bytes_xfered += nread;
> p += nread;
> }
> } while ((nread > 0 && bytes_xfered < request_count) ||
> (nread == -1 && errno == EINTR));
> 
> Other examples of this method of reading again can be seen for HDF5 I/O
> in HDF5 source and for MPI I/O in ANL MPICH2 source.
> 
> After analyzing the issues, we provided one of our users who was seeing
> the issue with a wrapper routine for the POSIX read call.  This wrapper
> reads again when necessary as shown above.  With the wrapper, the user
> no longer had any failures, verifying both the path and the fix.
> 
> As stated at the beginning, this issue is not unique to Cray systems or
> to Lustre file systems.  Lustre will eventually be modified so that it
> behaves as Lustre is intended to.  That is, Lustre will eventually do the
> additional reads such that POSIX read and POSIX write of a Lustre file will
> never return a short read.  But that doesn't remove the responsibility
> of program developers and library developers to handle the short read
> and short write cases.  Other file systems my exhibit the short read or
> write behavior.
> 
> We are informing our customers of the issue and encouraging them to
> correct their own calls to POSIX read and write if necessary.  Cray is not
> intending to provide our customers with a locally modified NetCDF library.
> We leave it to UCAR to provide the appropriate fixes for NetCDF.  When UCAR
> applies an appropriate fix and releases the new version, Cray will build
> it for our systems and release it to our customers.
> 
> Please connect me with any questions, comments, or concerns.
> 
> David Knaak
> 
> 
> 

Ticket Details
===================
Ticket ID: KZJ-320086
Department: Support netCDF
Priority: High
Status: Closed