[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #IYL-401919]: HDF5/NetCDF4 error with lustre NO_FLOCK



Am I correct summarizing that because flock is off in hdf5, that
some netcdf tests are failing?
Also, the directory netcdf-c/h5_test has some HDF5 only tests.
Is it possible to explore what is happening to them in this situation?


> Sorry for cross-posting to the two support groups, but the problem I
> found links both.
> 
> VERSION:
> HDF5-1.10.5 / NetCDF-C 4.6.3 / NetCDF-fortran 4.4.4
> 
> USER:
> Graziano Giuliani <address@hidden>
> 
> SYNOPSIS:
> HDF5_USE_FILE_LOCKING=FALSE environment variable do not allow
> HDF5+NetCDF4 operation on lustre with no flock mount option.
> 
> MACHINE / OPERATING SYSTEM:
> Found this situation on wombele BULLX cluster in Cote d'Ivoire.,
> RedHat7
> 
> Linux wombele17.bullx 3.10.0-693.17.1.el7.x86_64 #1 SMP Sun Jan 14
> 10:36:03 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> Wombele has a scratch space mounted as a 1PT lustre partition,
> missing the flock option for mount. Apparently this is the default
> for lustre, as the option is not listed on mount output:
> 
> XXX.XXX.XXX.XXX@o2ib:XXX.XXX.XXX.XXX@o2ib:/cncci101 on /SCRATCH type
> lustre (rw,lazystatfs)
> 
> The FLOCK option has big impact on the file system:
> 
> http://wiki.lustre.org/Mounting_a_Lustre_File_System_on_Client_Nodes
> 
> and some application requiring very high throughput on cluster
> prevent me for asking a cluster wide enable of the flock option
> without entering political/diplomatic mode.
> 
> COMPILER:
> 
> Compiler can be GNU or Intel, same result. Test output here from
> GNU. Using here MPI OpenMPI-4.0.0 for parallel tests. Results the
> same with Intel MPI.
> 
> DESCRIPTION:
> 
> Libraries installed on system doing make check on my NFS mounted home
> with all PASSED tests both serial and parallel.
> 
> Application (RegCM, Fortran2003) works correctly writing netcdf3 both
> using classic interface and parallel NetCDF3 using pnetcdf on lustre
> scratch space or anywhere else. HDF5/NetCDF4 interface works without
> problems on all disk partition on system EXCEPT the lustre one.
> 
> Error reported trough nf90_strerror is permission error from
> nf90_create, both at NETCDF4 and NETCDF4+MPIIO interfaces.
> 
> File of zero size is created in output, so it is tricky to understand
> what is happening.
> 
> Found in HDF5 support reference to an error of this kind on HDF5+lustre
> or other parallel file systems as well:
> 
> http://hdf-forum.184993.n3.nabble.com/h5fcreate-1-10-unable-to-lock-td4028902.html
> 
> and the suggested workaround is to set the environment variable:
> 
> export HDF5_USE_FILE_LOCKING=FALSE
> 
> Compiling HDF5 out of lustre, make check works without any problems, but
> on lustre make check for HDF5 fails:
> 
> Tests fails on:
> Testing multiple--single process access for latest format
> *FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 1354 in H5Fstart_swmr_write(): unable to convert
> file format
> major: File accessibilty
> minor: Can't convert datatypes
> #001: H5Fint.c line 3410 in H5F__start_swmr_write(): unable to unlock
> the file
> major: File accessibilty
> minor: Unable to open file
> #002: H5FD.c line 1698 in H5FD_unlock(): driver unlock request failed
> major: Virtual File Layer
> minor: Can't update object
> #003: H5FDsec2.c line 990 in H5FD_sec2_unlock(): file locking
> disabled on this file system (uset HDF5_USE_FILE_LOCKING environment
> variable to override), errno = 38, error message = 'Function not
> implemented'
> major: File accessibilty
> minor: Bad file ID accessed
> *FAILED*
> at swmr.c:6886 in test_multiple_same()...
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 1354 in H5Fstart_swmr_write(): unable to convert
> file format
> major: File accessibilty
> minor: Can't convert datatypes
> #001: H5Fint.c line 3410 in H5F__start_swmr_write(): unable to unlock
> the file
> major: File accessibilty
> minor: Unable to open file
> #002: H5FD.c line 1698 in H5FD_unlock(): driver unlock request failed
> major: Virtual File Layer
> minor: Can't update object
> #003: H5FDsec2.c line 990 in H5FD_sec2_unlock(): file locking
> disabled on this file system (use HDF5_USE_FILE_LOCKING environment
> variable to override), errno = 38, error message = 'Function not
> implemented'
> major: File accessibilty
> minor: Bad file ID accessed
> Testing multiple--single process access for non-latest-format
> *FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 1354 in H5Fstart_swmr_write(): unable to convert
> file format
> major: File accessibilty
> minor: Can't convert datatypes
> #001: H5Fint.c line 3410 in H5F__start_swmr_write(): unable to unlock
> the file
> major: File accessibilty
> minor: Unable to open file
> #002: H5FD.c line 1698 in H5FD_unlock(): driver unlock request failed
> major: Virtual File Layer
> minor: Can't update object
> #003: H5FDsec2.c line 990 in H5FD_sec2_unlock(): file locking
> disabled on this file system (use HDF5_USE_FILE_LOCKING environment
> variable to override), errno = 38, error message = 'Function not
> implemented'
> major: File accessibilty
> minor: Bad file ID accessed
> *FAILED*
> at swmr.c:6886 in test_multiple_same()...
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 1354 in H5Fstart_swmr_write(): unable to convert
> file format
> major: File accessibilty
> minor: Can't convert datatypes
> #001: H5Fint.c line 3410 in H5F__start_swmr_write(): unable to unlock
> the file
> major: File accessibilty
> minor: Unable to open file
> #002: H5FD.c line 1698 in H5FD_unlock(): driver unlock request failed
> major: Virtual File Layer
> minor: Can't update object
> #003: H5FDsec2.c line 990 in H5FD_sec2_unlock(): file locking
> disabled on this file system (use HDF5_USE_FILE_LOCKING environment
> variable to override), errno = 38, error message = 'Function not
> implemented'
> major: File accessibilty
> minor: Bad file ID accessed
> Testing H5Pget/set_metadata_read_attempts()
> PASSED
> Testing H5Fset_metadata_read_retry_info()
> PASSED
> Testing H5Fstart_swmr_write() when creating/opening a file with latest
> format*FAILED*
> at swmr.c:1658 in test_start_swmr_write()...
> Testing H5Fstart_swmr_write() when creating/opening a file without
> latest format*FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 1354 in H5Fstart_swmr_write(): unable to convert
> file format
> major: File accessibilty
> minor: Can't convert datatypes
> #001: H5Fint.c line 3410 in H5F__start_swmr_write(): unable to unlock
> the file
> major: File accessibilty
> minor: Unable to open file
> #002: H5FD.c line 1698 in H5FD_unlock(): driver unlock request failed
> major: Virtual File Layer
> minor: Can't update object
> #003: H5FDsec2.c line 990 in H5FD_sec2_unlock(): file locking
> disabled on this file system (use HDF5_USE_FILE_LOCKING environment
> variable to override), errno = 38, error message = 'Function not
> implemented'
> major: File accessibilty
> minor: Bad file ID accessed
> *FAILED*
> at swmr.c:1761 in test_start_swmr_write()...
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 1354 in H5Fstart_swmr_write(): unable to convert
> file format
> major: File accessibilty
> minor: Can't convert datatypes
> #001: H5Fint.c line 3410 in H5F__start_swmr_write(): unable to unlock
> the file
> major: File accessibilty
> minor: Unable to open file
> #002: H5FD.c line 1698 in H5FD_unlock(): driver unlock request failed
> major: Virtual File Layer
> minor: Can't update object
> #003: H5FDsec2.c line 990 in H5FD_sec2_unlock(): file locking
> disabled on this file system (use HDF5_USE_FILE_LOCKING environment
> variable to override), errno = 38, error message = 'Function not
> implemented'
> major: File accessibilty
> minor: Bad file ID accessed
> Testing H5Fstart_swmr_write() on failure conditions for latest format
> *FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 1354 in H5Fstart_swmr_write(): unable to convert
> file format
> major: File accessibilty
> minor: Can't convert datatypes
> #001: H5Fint.c line 3410 in H5F__start_swmr_write(): unable to unlock
> the file
> major: File accessibilty
> minor: Unable to open file
> #002: H5FD.c line 1698 in H5FD_unlock(): driver unlock request failed
> major: Virtual File Layer
> minor: Can't update object
> #003: H5FDsec2.c line 990 in H5FD_sec2_unlock(): file locking
> disabled on this file system (use HDF5_USE_FILE_LOCKING environment
> variable to override), errno = 38, error message = 'Function not
> implemented'
> major: File accessibilty
> minor: Bad file ID accessed
> *FAILED*
> at swmr.c:2012 in test_err_start_swmr_write()...
> Testing H5Fstart_swmr_write() on failure conditions for without latest
> format*FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 1354 in H5Fstart_swmr_write(): unable to convert
> file format
> major: File accessibilty
> minor: Can't convert datatypes
> #001: H5Fint.c line 3410 in H5F__start_swmr_write(): unable to unlock
> the file
> major: File accessibilty
> minor: Unable to open file
> #002: H5FD.c line 1698 in H5FD_unlock(): driver unlock request failed
> major: Virtual File Layer
> minor: Can't update object
> #003: H5FDsec2.c line 990 in H5FD_sec2_unlock(): file locking
> disabled on this file system (use HDF5_USE_FILE_LOCKING environment
> variable to override), errno = 38, error message = 'Function not
> implemented'
> major: File accessibilty
> minor: Bad file ID accessed
> *FAILED*
> at swmr.c:2140 in test_err_start_swmr_write()...
> Testing H5Fstart_swmr_write()--concurrent access for latest format
> *FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 1354 in H5Fstart_swmr_write(): unable to convert
> file format
> major: File accessibilty
> minor: Can't convert datatypes
> #001: H5Fint.c line 3410 in H5F__start_swmr_write(): unable to unlock
> the file
> major: File accessibilty
> minor: Unable to open file
> #002: H5FD.c line 1698 in H5FD_unlock(): driver unlock request failed
> major: Virtual File Layer
> minor: Can't update object
> #003: H5FDsec2.c line 990 in H5FD_sec2_unlock(): file locking
> disabled on this file system (use HDF5_USE_FILE_LOCKING environment
> variable to override), errno = 38, error message = 'Function not
> implemented'
> major: File accessibilty
> minor: Bad file ID accessed
> *FAILED*
> at swmr.c:2649 in test_start_swmr_write_concur()...
> Testing H5Fstart_swmr_write()--concurrent access for
> non-latest-format*FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> *FAILED*
> at swmr.c:2416 in test_start_swmr_write_concur()...
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> Testing H5Fstart_swmr_write()--stress object header messages
> *FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> *FAILED*
> at swmr.c:2998 in test_start_swmr_write_stress_ohdr()...
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> Testing H5Pget/set_obj_flush_cb()
> *FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> *FAILED*
> at swmr.c:3240 in test_object_flush_cb()...
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> Testing H5Fget/set_append_flush() for a generic dataset access property
> list PASSED
> Testing H5Fget/set_append_flush() for a chunked dataset's access
> property list*FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> *FAILED*
> at swmr.c:3632 in test_append_flush_dataset_chunked()...
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> Testing H5Fget/set_append_flush() for a non-chunked dataset's access
> property list*FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> *FAILED*
> at swmr.c:3845 in test_append_flush_dataset_fixed()...
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> Testing H5Fget/set_append_flush() for multiple opens of a chunked
> dataset*FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> *FAILED*
> at swmr.c:4086 in test_append_flush_dataset_multiple()...
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> Testing SWMR-enabled VFD flag functionality
> *FAILED*
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> *FAILED*
> at swmr.c:6218 in test_swmr_vfd_flag()...
> HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
> #000: H5F.c line 444 in H5Fcreate(): unable to create file
> major: File accessibilty
> minor: Unable to open file
> #001: H5Fint.c line 1531 in H5F_open(): unable to truncate a file
> which is already open
> major: File accessibilty
> minor: Unable to open file
> Testing File locking environment variable
> PASSED
> ***** 1 SWMR TEST FAILED! *****
> 
> So the environment variable workaround does NOT allow the user to get
> without the flock, at least for some features of the library.
> 
> Let us get now to NetCDF.
> 
> # NetCDF C Configuration Summary
> ==============================
> 
> # General
> -------
> NetCDF Version:               4.6.3
> Configured On:                Sat Mar 16 08:58:37 CET 2019
> Host System:          x86_64-pc-linux-gnu
> Build Directory:      /SCRATCH/ictp/testh5/netcdf-c-4.6.3
> Install Prefix:         /home/ictp/SMR3243
> 
> # Compiling Options
> -----------------
> C Compiler:           /home/ictp/SMR3243/bin/mpicc
> CFLAGS:
> CPPFLAGS:             -I/home/ictp/SMR3243/include
> LDFLAGS:              -L/home/ictp/SMR3243/lib
> AM_CFLAGS:
> AM_CPPFLAGS:
> AM_LDFLAGS:
> Shared Library:               yes
> Static Library:               yes
> Extra libraries:      -lpnetcdf -lhdf5_hl -lhdf5 -lm -ldl -lsz -lz -lcurl
> 
> # Features
> --------
> NetCDF-2 API:         yes
> HDF4 Support:         no
> HDF5 Support:         yes
> NetCDF-4 API:         yes
> NC-4 Parallel Support:        yes
> PnetCDF Support:      yes
> DAP2 Support:         yes
> DAP4 Support:         yes
> Diskless Support:     yes
> MMap Support:         no
> JNA Support:          no
> CDF5 Support:         yes
> ERANGE Fill Support:  yes
> Relaxed Boundary Check:       yes
> 
> Configuring and making netcdf4 on lustre is ok, but make check fails.
> 
> make[2]: Entering directory `/SCRATCH/ictp/testh5/netcdf-c-4.6.3/nc_test'
> make[3]: Entering directory `/SCRATCH/ictp/testh5/netcdf-c-4.6.3/nc_test'
> FAIL: t_nc
> PASS: tst_small
> FAIL: nc_test
> PASS: tst_misc
> PASS: tst_norm
> PASS: tst_names
> PASS: tst_nofill
> PASS: tst_nofill2
> PASS: tst_nofill3
> PASS: tst_atts3
> PASS: tst_meta
> PASS: tst_inq_type
> PASS: tst_utf8_validate
> PASS: tst_utf8_phrases
> PASS: tst_global_fillval
> PASS: tst_max_var_dims
> PASS: tst_formats
> PASS: tst_def_var_fill
> PASS: tst_err_enddef
> PASS: tst_default_format
> PASS: tst_formatx_pnetcdf
> PASS: tst_default_format_pnetcdf
> PASS: tst_cdf5format
> PASS: tst_diskless6
> PASS: run_diskless.sh
> PASS: run_diskless5.sh
> PASS: run_inmemory.sh
> PASS: run_pnetcdf_test.sh
> PASS: run_cdf5.sh
> ============================================================================
> Testsuite summary for netCDF 4.6.3
> ============================================================================
> # TOTAL: 29
> # PASS:  27
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  2
> # XPASS: 0
> # ERROR: 0
> ============================================================================
> See nc_test/test-suite.log
> Please report to address@hidden
> ============================================================================
> make[3]: *** [test-suite.log] Error 1
> 
> Attached the error report.
> 
> Remember that if I compile the code on ANY other system mounted
> partition (NFS4 and ext4) on same system I have NO ERROR at make check.
> To be clear, on the system the check fails only if the compilation and
> the checks are performed on lustre, which is tricky in itself to spot.
> 
> The application RegCM with the supposed workaround from HDF5 library
> produces output files, which can be randomly corrupted. The larger the
> output file (the longer the simulation), the larger the probability to
> have an HDF error on the file. Note that the library do not report ANY
> error while running and producing corrupted files.
> 
> Looks like HDF5/netcdf are counting on HDF5 file lock, even if HDF5 it
> is not locking the file. The test-suite log reports that the fail may be
> due to the missing flock attribute on lustre mount, and the HDF5 with
> the NO_FLOCK fails at make check.
> 
> The netcdf3, pnetcdf and HDF5 (with the FILE_LOCKING disabled in
> environment) interfaces work on the lustre file system.
> 
> The HDF5/netcdf4 both serial and parallel fail.
> 
> Note that I am using netCDF4 because some of my output variables do not
> fit in netCDF3.
> 
> G
> 
> 

=Dennis Heimbigner
  Unidata


Ticket Details
===================
Ticket ID: IYL-401919
Department: Support netCDF
Priority: Normal
Status: Open
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.