[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #NXL-777870]: NetCDF Fortran - "Abort trap signal" error on nf03_test



Hello,

This appears to be related to a known issue rep: openmpi and tests attempting 
to spawn more processes than there are processors available (as you correctly 
identify).  I believe it is safe to ignore this error because it does not 
actually reflect an error with netCDF, but rather with how the test is being 
invoked.  

Looking at the CESM/netCDF-Fortran failure, I will move this discussion to the 
github issue you opened.  I am setting up an environment to try to recreate 
this issue and will report there shortly.

Thank you!

-Ward

> Dear Unidata’s netCDF support,
> 
> I am attempting to build the netCDF libraries on my department cluster
> using OpenMPI compiled with the Intel compilers. The libraries are intended
> to be used with the CESM model I am trying to run.
> 
> However, I am getting an error in one NetCDF Fortran test, which does not
> prevent building the libraries but which seems to be important anyway: it
> is the same error message that are on the CESM logs after an unsuccessful
> simulation - the model runs but does not produce any output.
> 
> My software versions are:
> Intel C and Fortran compilers 17
> OpenMPI 3.0.0
> HDF5 1.10.2
> netCDF C 4.6.1
> netCDF Fortran 4.4.4
> 
> First of all, this is how I load the Intel compiler and Open MPI modules:
> 
> module load intel/17.0.1
> module load openmpi/3.0.0/intel/17.0.1
> 
> And here is some (hopefully) useful information on the modules:
> 
> module show intel/17.0.1
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> /sw/arcts/centos7/modulefiles/intel/17.0.1.lua:
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> help([[
> The Intel module enables the Intel family of compilers (C/C++
> and Fortran) and updates the $PATH, $LD_LIBRARY_PATH,
> $INCLUDE, and $MANPATH environment variables to access the
> compiler binaries, libraries, include files, and available man
> pages, respectively.
> 
> The following additional environment variables are also defined:
> 
> $ICC_BIN                (path to icc/icpc compilers          )
> $ICC_LIB                (path to C/C++  libraries            )
> $IFC_BIN                (path to ifort compiler              )
> $IFC_LIB                (path to Fortran libraries           )
> 
> See the man pages for icc, icpc, and ifort for detailed information
> on available compiler options and command-line syntax.
> 
> ]])
> whatis("Name: Intel")
> whatis("Description: Intel compiler suite.")
> whatis("License information: None provided")
> whatis("Category: Library, Development, Core")
> whatis("Package documentation: None provided")
> whatis("Version: 17.0.1")
> setenv("ICC_BIN","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/bin/intel64")
> setenv("IFC_BIN","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/bin/intel64")
> setenv("ICC_LIB","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64")
> setenv("IFC_LIB","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64")
> prepend_path("INTEL_LICENSE_FILE","/sw/arcts/centos7/intel/licenses/network.lic")
> prepend_path("PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/bin/intel64")
> prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/tbb/lib/intel64_lin/gcc4.4")
> prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/daal/lib/intel64_lin")
> prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/debugger_2017/libipt/intel64/lib")
> prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/debugger_2017/iga/lib")
> prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/tbb/lib/intel64/gcc4.7")
> prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin")
> prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/ipp/lib/intel64")
> prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64_lin")
> prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64")
> prepend_path("LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/tbb/lib/intel64_lin/gcc4.4")
> prepend_path("LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/daal/lib/intel64_lin")
> prepend_path("LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/tbb/lib/intel64/gcc4.7")
> prepend_path("LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin")
> prepend_path("LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64_lin")
> prepend_path("LIBRARY_PATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/ipp/lib/intel64")
> prepend_path("MANPATH","/sw/arcts/centos7/intel/17.0.1-1/man/common")
> prepend_path("NLSPATH","/sw/arcts/centos7/intel/17.0.1-1/debugger_2017/gdb/intel64/share/locale/%l_%t/%N")
> prepend_path("NLSPATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/locale/%l_%t/%N")
> prepend_path("NLSPATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/locale/%l_%t/%N")
> prepend_path("MKLROOT","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/mkl")
> prepend_path("CPATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/daal/include")
> prepend_path("CPATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/tbb/include")
> prepend_path("CPATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/mkl/include")
> prepend_path("CPATH","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/ipp/include")
> setenv("IPPROOT","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/ipp")
> setenv("TBBROOT","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/tbb")
> setenv("DAALROOT","/sw/arcts/centos7/intel/17.0.1-1/compilers_and_libraries_2017.1.132/linux/daal")
> 
> module show openmpi/3.0.0/intel/17.0.1
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> /sw/arcts/centos7/modulefiles/openmpi/3.0.0/intel/17.0.1.lua:
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> help([[
> OpenMPI consists of a set of compiler 'wrappers' that include the appropriate
> settings for compiling MPI programs on the cluster.  The most commonly used
> of these are
> 
> mpicc
> mpic++
> mpif90
> 
> Those are used in the same way as the regular compiler program, for example,
> 
> $ mpicc -o hello hello.c
> 
> will produce an executable program file, hello, from C source code in hello.c.
> 
> In addition to adding the OpenMPI executables to your path, the following
> environment variables set by the openmpi module.
> 
> $MPI_HOME
> 
> ]])
> whatis("Name: openmpi")
> whatis("Description: OpenMPI implementation of the MPI protocol")
> whatis("License information: https://www.open-mpi.org/community/license.php";)
> whatis("Category: Utility, Development, Core")
> whatis("Package documentation: https://www.open-mpi.org/doc/";)
> whatis("ARC examples: /scratch/data/examples/openmpi")
> whatis("Version: 3.0.0")
> prereq("intel/17.0.1")
> prepend_path("PATH","/sw/arcts/centos7/openmpi/3.0.0-intel-17.0.1-1/bin")
> prepend_path("MANPATH","/sw/arcts/centos7/openmpi/3.0.0-intel-17.0.1-1/share/man")
> prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/openmpi/3.0.0-intel-17.0.1-1/lib")
> setenv("MPI_HOME","/sw/arcts/centos7/openmpi/3.0.0-intel-17.0.1-1")
> setenv("OMPI_MCA_btl_openib_warn_no_device_params_found","0")
> 
> Then, this the command I used to build HDF5:
> 
> FC=mpif90 CC=mpicc CXX=mpicxx ./configure --with-zlib=${ZDIR}
> --with-szlib=${SZDIR} --prefix=${H5DIR} --enable-parallel
> make
> make install
> make check-install
> 
> and this is how I am building the NetCDF C and Fortran libraries
> respectively:
> 
> CPPFLAGS=-I${H5DIR}/include LDFLAGS=-L${H5DIR}/lib CC=mpicc
> ./configure --prefix=${NCDIR} --enable-shared --disable-dap
> --enable-parallel-tests
> make
> make install
> make check
> 
> and
> 
> CPPFLAGS=-I${NCDIR}/include LDFLAGS=-L${NCDIR}/lib CC=mpicc F77=mpif77
> FC=mpif90 ./configure --prefix=${NCDIR}
> make
> make install
> make check
> 
> where all the paths ${ZDIR}, ${SZDIR}, ${H5DIR} and ${NCDIR} are exported
> to my LD_LIBRARY_PATH environmental variable.
> 
> All the building processes do finish successfully, but I do get errors
> during make check:
> 
> a) on NetCDF C, saying:
> 
> ===========================================
> netCDF 4.6.1: nc_test4/test-suite.log
> ===========================================
> 
> # TOTAL: 68
> # PASS:  67
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  1
> # XPASS: 0
> # ERROR: 0
> 
> .. contents:: :depth: 2
> 
> FAIL: run_par_test
> ==================
> 
> Testing MPI parallel I/O with various other mode flags...
> 
> *** Testing illegal mode combinations
> *** Testing create + MPIO + fletcher32
> *** Testing create + MPIO + deflation
> ok.
> *** Tests successful!
> 
> Testing MPI parallel I/O without netCDF...
> 
> *** Testing basic MPI file I/O.
> *** testing file create with parallel I/O with MPI...ok.
> *** Tests successful!
> 
> Testing very simple parallel I/O with 4 processors...
> 
> *** tst_parallel testing very basic parallel access.
> *** tst_parallel testing whether we can create file for parallel
> access and write to it...ok.
> *** Tests successful!
> 
> Testing simple parallel I/O with 16 processors...
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 16 slots
> that were requested by the application:
> ./tst_parallel3
> 
> Either request fewer slots for your application, or make more slots available
> for use.
> --------------------------------------------------------------------------
> FAIL run_par_test.sh (exit status: 1)
> 
> which is expected since the login nodes on my cluster have only 12 cores.
> As you can see, the test with only four cores finished successfully.
> 
> b) on NetCDF Fortran, saying:
> 
> *** testing nf_copy_att ...
> bad var id: NetCDF: Attribute not found
> nf_copy_att: NetCDF: Attribute not found
> bad var id: NetCDF: Attribute not found
> nf_copy_att: NetCDF: Attribute not found
> forrtl: error (76): Abort trap signal
> Image              PC                Routine            Line        Source
> nf03_test          00000000004F54A1  tbk_trace_stack_i     Unknown  Unknown
> nf03_test          00000000004F35DB  tbk_string_stack_     Unknown  Unknown
> nf03_test          00000000004AB2F4  Unknown               Unknown  Unknown
> nf03_test          00000000004AB106  tbk_stack_trace       Unknown  Unknown
> nf03_test          00000000004742F9  for__issue_diagno     Unknown  Unknown
> nf03_test          0000000000477B04  for__signal_handl     Unknown  Unknown
> libpthread-2.17.s  00002AF3675F45E0  Unknown               Unknown
> Unknownlibc-2.17.so       00002AF3678361F7  gsignal
> Unknown  Unknownlibc-2.17.so       00002AF3678378E8  abort
> Unknown  Unknown
> nf03_test          000000000040B362  Unknown               Unknown  Unknown
> nf03_test          0000000000463BA6  Unknown               Unknown  Unknown
> nf03_test          0000000000466E75  Unknown               Unknown  Unknown
> nf03_test          000000000045F4BB  Unknown               Unknown  Unknown
> nf03_test          0000000000455BCE  Unknown               Unknown  Unknown
> nf03_test          0000000000456C55  Unknown               Unknown  Unknown
> nf03_test          000000000040B31E  Unknown               Unknown
> Unknownlibc-2.17.so       00002AF367822C05  __libc_start_main
> Unknown  Unknown
> nf03_test          000000000040B229  Unknown               Unknown  Unknown
> 
> Error b) above is the one that really concerns me. As I mentioned before,
> on CESM’s logs I see several *Attribute not found* errors, and no output
> from the model is written to disk (although the job is not killed). I
> suspect there is some kind of connection between the errors.
> 
> Please see both the config.log and test-suite.log logs for the NetCDF
> Fortran error attached to this message.
> 
> Just for the records, I have also filed an issue on Github at
> https://github.com/Unidata/netcdf-fortran/issues/81.
> 
> Do you guys have any ideas on what can be causing this error? I really
> appreciate any help you can provide on how to fix it!
> 
> Regards,
> ?
> --
> Thiago V. dos Santos
> Postdoctoral research fellow
> Department of Climate and Space Sciences and Engineering
> University of Michigan
> 
> 


Ticket Details
===================
Ticket ID: NXL-777870
Department: Support netCDF
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.