Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)



On 08/18/2015 02:31 PM, Ward Fisher wrote:
Hello all,

I just wanted to jump in and comment that this issue, recently reported
to us by David Knaak at Cray, is now handled in the netCDF-C development
branch on GitHub. This fix will be in the upcoming release candidate and
eventual final release of netCDF-C 4.4.0.

Regarding the question of short reads providing more warning; netcdf
specifically was already checking for short reads when ‘paging in’ data
from a file, but was assuming an error when one would occur (due to a
non-zero |errno| value). The fix shouldn’t incur any performance
penalty. The new thing I learned about “short reads” is that it is
possible for this to occur /without/ being the result of an error, but
rather the result of an interrupt.

I found these short reads would happen in ROMIO when trying to read 2 GiB of data in one shot. Linux would give me back (2GiB-4k) worth of data.

Today, most MPI-IO libraries should detect and retry this case. Cray's MPI-IO library is closed source, so i don't know what they do.

In general, since they are technically allowed I think developers are
going to have to accommodate the possibility of short reads in their
software, one way or another. Developers should already be checking the
return value of |read()|, and when short, the fix is essentially:

 1. Check to see if errno is |EINTR|
 2. If so, perform some calculations and resume the read.

While that's strictly correct, I worry about short reads that for whatever reason don't set EINTR. So I would check how much data was read. If it is less than requested, continue the read to fetch the missing data. If that continued read returns 0, then you are EOF and you are done.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA



  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: