[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #IYL-401919]: HDF5/NetCDF4 error with lustre NO_FLOCK



We do not have good direct insight into HDF5 errors, however,
there is a way to get HDF5 to report more error info.
What you need to do is this command (assuming you are using linux+bash)
    export NETCDF_LOG_LEVEL=0
This should cause HDF5 to generate more error information.
However, I think that you also need to ask the HDF5 group
about this problem and see if they have any insights.


> On 3/16/19 8:03 PM, Unidata netCDF Support wrote:
> > Am I correct summarizing that because flock is off in hdf5, that
> > some netcdf tests are failing?
> > Also, the directory netcdf-c/h5_test has some HDF5 only tests.
> > Is it possible to explore what is happening to them in this situation?
> [...]
> 
> Make check tests ALL fail if HDF5_USE_FILE_LOCKING is TRUE (default),
> SOME fail for both HDF5 and netCDF if the HDF5_USE_FILE_LOCKING is set
> to FALSE, but this only if tests are executed on the lustre partition.
> 
> The h5 tests for netcdf were all successful if
> HDF5_USE_FILE_LOCKING=FALSE on lustre (attaching logs). Failed tests in
> previous e-mail.
> 
> On the same system all tests PASS if executed on EXT4/NFS4.
> 
> My idea is that the file locking is still required for the HDF5, and the
> supposed capability not to use it on some parallel file systems is not
> completely implemented or broken, and thus the runtime flag is useless.
> Worst, it can lead to silent and unreproducible corruption of data
> supposedly correctly generated. Note that if I run a short term
> simulation, sometimes the netCDF4 file are correct (even using MPIIO),
> sometimes not. On a long term simulation (generating 100+GB), data are
> invariably corrupted.
> 
> I do not find any flock call in the netcdf library itself, so I suppose
> the problem is confined to the HDF5 itself. BUT the netcdf library
> should have a better reporting of the error, because it was not easy to
> sort out that the "permission denied" error was actually an ENOSYS error
> in HDF5 due to the missing flock. Moreover, given that SUPPOSEDLY the
> HDF5 SHOULD have the capability to go without the flock, there should
> not be any netcdf test utterly failing on the unavailable flock.
> 
> The message:
> 
> If the file system is LUSTRE, ensure that the directory is mounted with
> the 'flock' option.
> 
> should read, once the HDF5_USE_FILE_LOCKING is correctly implemented,
> 
> If the file system is LUSTRE, ensure that the directory is mounted with
> the 'flock' option or the HDF5_USE_FILE_LOCKING is set to FALSE.
> 
> OR the HDF5_USE_FILE_LOCKING environment variable capability should be
> removed altogether from the HDF5 and the requirement of the flock
> capability in the underlying file system be set to mandatory.
> 
> In HDF5:
> 
> Binary file src/.libs/H5FDcore.o matches
> Binary file src/.libs/H5FDdirect.o matches
> Binary file src/.libs/H5FDsec2.o matches
> Binary file src/.libs/H5FDstdio.o matches
> Binary file src/.libs/H5system.o matches
> Binary file src/.libs/H5FDlog.o matches
> 
> [...]
> >> Sorry for cross-posting to the two support groups, but the problem I
> >> found links both.
> [...]
> --
> Graziano Giuliani - Earth System Physics Section
> The Abdus Salam International Centre for Theoretical Physics
> Strada Costiera 11 - I - 34151 Trieste Italy
> 
> 

=Dennis Heimbigner
  Unidata


Ticket Details
===================
Ticket ID: IYL-401919
Department: Support netCDF
Priority: High
Status: Open
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.