[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #PMD-881681]: Segementation fault in netcdf-fortran parallel test



Looking more closely at the HDF5-1.9.10-patch1 release notes, I cut and pasted 
the
wrong entry, as that was just the addition of a couple of new functions and 
unlikely
to fix anything.  So, never mind ...

--Russ

> Hi Orion,
> 

> I note that HDF5-1.8.10-patch1, the current and latest release, has fixes for 
> parallel I/O
> problems, described in the RELEASE_NOTES as:
> 
> Parallel Library
> ----------------
> - Added the H5Pget_mpio_no_collective_cause() function that retrieves
> reasons why the collective I/O was broken during read/write IO access.
> (JKM - 2012/08/30 HDFFV-8143)
> 
> - Added H5Pget_mpio_actual_io_mode_f (MSB - 2012/09/27)
> 
> Would it be practical/convenient for you to rerun your test using 
> HDF5-1.8.10-patch1?
> I can't tell from the above if the latest fixes are relevant to the problem 
> you're
> reporting, but it seems like a time-saving possibility.
> 
> --Russ
> 
> > Since I'm trying to track down some various netcdf mpi issues I'm seeing on
> > Fedora, here is another.
> >
> > I started seeing this on Dec 3 trying to rebuild netcdf 4.2.1.1 for hdf5
> > 1.8.10 with:
> > mpich2 1.5
> > gcc 4.7.2-8.
> >
> > A previous build on Nov 1 succeeded with:
> > hdf5 1.8.9
> > gcc 4.7.2-6.
> > mpich2 1.5
> >
> > So I suspect a change in hdf5 between 1.8.9 and 1.8.10.
> >
> >
> > I'm currently testing with netcdf 4.3.0-rc1, gcc 4.8.0-0.14, hdf5-1.8.10,
> > mpich2 1.5.
> >
> > The test hangs here:
> >
> > Testing very simple parallel I/O with 4 processors...
> >
> > *** tst_parallel testing very basic parallel access.
> > *** tst_parallel testing whether we can create file for parallel access and
> > write to it...
> >
> > Three of the four processes traces when attached with gdb looks like:
> > (gdb) bt
> > #0  0x0000003819ab86b1 in MPID_nem_tcp_connpoll () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #1  0x0000003819aa5fd5 in MPIDI_CH3I_Progress () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #2  0x0000003819a601ad in MPIC_Wait () from 
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #3  0x0000003819a60852 in MPIC_Sendrecv () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #4  0x0000003819a60cb4 in MPIC_Sendrecv_ft () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #5  0x0000003819adb172 in MPIR_Barrier_intra () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #6  0x0000003819adb26d in MPIR_Barrier_or_coll_fn () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #7  0x0000003819adb711 in MPIR_Barrier_impl () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #8  0x0000003819adba58 in PMPI_Barrier () from 
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #9  0x0000003818a642a9 in H5AC_rsp__dist_md_write__flush (f=0x20d2780,
> > dxpl_id=167772175,
> > cache_ptr=0x7f3d0bc2b010) at ../../src/H5AC.c:4424
> > #10 0x0000003818a650c2 in H5AC_run_sync_point (f=0x20d2780, 
> > dxpl_id=167772175,
> > sync_point_op=1)
> > at ../../src/H5AC.c:4870
> > #11 0x0000003818a65532 in H5AC_flush_entries (f=0x20d2780) at
> > ../../src/H5AC.c:5050
> > #12 0x0000003818a5c7d5 in H5AC_flush (f=0x20d2780, dxpl_id=167772174) at
> > ../../src/H5AC.c:838
> > #13 0x0000003818ae490d in H5F_flush (f=0x20d2780, dxpl_id=167772174, 
> > closing=0)
> > at ../../src/H5F.c:1758
> > #14 0x0000003818af0fba in H5F_flush_mounts_recurse (f=0x20d2780,
> > dxpl_id=167772174)
> > at ../../src/H5Fmount.c:659
> > #15 0x0000003818af1175 in H5F_flush_mounts (f=0x20d2780, dxpl_id=167772174)
> > at ../../src/H5Fmount.c:698
> > #16 0x0000003818ae4648 in H5Fflush (object_id=16777216, 
> > scope=H5F_SCOPE_GLOBAL)
> > at ../../src/H5F.c:1704
> > #17 0x00007f3d0e24199c in sync_netcdf4_file (h5=0x20d1270) at
> > ../../libsrc4/nc4file.c:2964
> > #18 0x00007f3d0e242862 in NC4_enddef (ncid=<optimized out>) at
> > ../../libsrc4/nc4file.c:2922
> > #19 0x00007f3d0e1f44d2 in nc_enddef (ncid=65536) at 
> > ../../libdispatch/dfile.c:786
> > #20 0x0000000000400f59 in main (argc=1, argv=0x7fffece50d88)
> > at ../../nc_test4/tst_parallel.c:111
> >
> >
> > The other looks like:
> > (gdb) bt
> > #0  0x0000003819aa6005 in MPIDI_CH3I_Progress () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #1  0x0000003819a601ad in MPIC_Wait () from 
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #2  0x0000003819a60436 in MPIC_Recv () from 
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #3  0x0000003819a60af9 in MPIC_Recv_ft () from 
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #4  0x0000003819addab2 in MPIR_Bcast_binomial.isra.1 ()
> > from /usr/lib64/mpich2/lib/libmpich.so.8
> > #5  0x0000003819addef3 in MPIR_Bcast_intra () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #6  0x0000003819adeb7d in MPIR_Bcast_impl () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #7  0x0000003819ad90c7 in MPIR_Allreduce_intra () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #8  0x0000003819ada6f2 in MPIR_Allreduce_impl () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #9  0x0000003819adacde in PMPI_Allreduce () from
> > /usr/lib64/mpich2/lib/libmpich.so.8
> > #10 0x0000003818ac1f7b in H5D__mpio_opt_possible (io_info=0x7ffffa47e2a0,
> > file_space=0x84e150,
> > mem_space=0x867950, type_info=0x7ffffa47e220, fm=0x7ffffa47e380,
> > dx_plist=0x854540)
> > at ../../src/H5Dmpio.c:241
> > #11 0x0000003818ac0050 in H5D__ioinfo_adjust (io_info=0x7ffffa47e2a0,
> > dset=0x854900,
> > dxpl_id=167772189, file_space=0x84e150, mem_space=0x867950,
> > type_info=0x7ffffa47e220,
> > fm=0x7ffffa47e380) at ../../src/H5Dio.c:999
> > #12 0x0000003818abf1bc in H5D__write (dataset=0x854900, 
> > mem_type_id=50331660,
> > mem_space=0x867950, file_space=0x84e150, dxpl_id=167772189,
> > buf=0x7ffffa489170)
> > at ../../src/H5Dio.c:667
> > #13 0x0000003818abd8e9 in H5Dwrite (dset_id=83886083, mem_type_id=50331660,
> > mem_space_id=67108867, file_space_id=67108866, dxpl_id=167772189,
> > buf=0x7ffffa489170)
> > at ../../src/H5Dio.c:265
> > #14 0x00007f407ab6992a in nc4_put_vara (nc=<optimized out>,
> > ncid=ncid@entry=65536,
> > varid=varid@entry=0, startp=startp@entry=0x7ffffa489130,
> > countp=countp@entry=0x7ffffa489150, mem_nc_type=mem_nc_type@entry=4,
> > is_long=is_long@entry=0, data=data@entry=0x7ffffa489170) at
> > ../../libsrc4/nc4hdf.c:795
> > #15 0x00007f407ab6418b in nc4_put_vara_tc (mem_type_is_long=0, 
> > op=0x7ffffa489170,
> > countp=0x7ffffa489150, startp=0x7ffffa489130, mem_type=4, varid=0,
> > ncid=65536)
> > at ../../libsrc4/nc4var.c:1350
> > #16 NC4_put_vara (ncid=65536, varid=0, startp=0x7ffffa489130,
> > countp=0x7ffffa489150,
> > op=0x7ffffa489170, memtype=4) at ../../libsrc4/nc4var.c:1484
> > #17 0x00007f407ab17075 in NC_put_vara (ncid=ncid@entry=65536,
> > varid=varid@entry=0,
> > start=start@entry=0x7ffffa489130, edges=edges@entry=0x7ffffa489150,
> > value=value@entry=0x7ffffa489170, memtype=memtype@entry=4)
> > at ../../libdispatch/dvarput.c:79
> > #18 0x00007f407ab17f0f in nc_put_vara_int (ncid=65536, varid=0,
> > startp=startp@entry=0x7ffffa489130, countp=countp@entry=0x7ffffa489150,
> > op=op@entry=0x7ffffa489170) at ../../libdispatch/dvarput.c:628
> > #19 0x0000000000401010 in main (argc=1, argv=0x7ffffa489648)
> > at ../../nc_test4/tst_parallel.c:138
> >
> >
> > With openmpi 1.6.3, it appears to hang at the previous test:
> >
> > Testing very simple parallel I/O with 4 processors...
> >
> > *** tst_parallel testing very basic parallel access.
> >
> >
> > Similar backtraces:
> >
> > Three with:
> > (gdb) bt
> > #0  0x00000037a6cda4c7 in sched_yield () at 
> > ../sysdeps/unix/syscall-template.S:81
> > #1  0x000000381a317a5d in opal_progress () from 
> > /usr/lib64/openmpi/lib/libmpi.so.1
> > #2  0x000000381a261acd in ompi_request_default_wait_all ()
> > from /usr/lib64/openmpi/lib/libmpi.so.1
> > #3  0x00007f44f1d3a6e7 in ompi_coll_tuned_sendrecv_actual ()
> > from /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so
> > #4  0x00007f44f1d423ae in ompi_coll_tuned_barrier_intra_recursivedoubling ()
> > from /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so
> > #5  0x000000381a26fc86 in PMPI_Barrier () from 
> > /usr/lib64/openmpi/lib/libmpi.so.1
> > #6  0x00007f44f7cc85c4 in H5AC_rsp__dist_md_write__flush (f=0x1924a30,
> > dxpl_id=167772175,
> > cache_ptr=0x19256d0) at ../../src/H5AC.c:4424
> > #7  0x00007f44f7cc93e1 in H5AC_run_sync_point (f=0x1924a30, 
> > dxpl_id=167772175,
> > sync_point_op=1)
> > at ../../src/H5AC.c:4870
> > #8  0x00007f44f7cc9851 in H5AC_flush_entries (f=0x1924a30) at
> > ../../src/H5AC.c:5050
> > #9  0x00007f44f7cc0ad0 in H5AC_flush (f=0x1924a30, dxpl_id=167772174) at
> > ../../src/H5AC.c:838
> > #10 0x00007f44f7d48d15 in H5F_flush (f=0x1924a30, dxpl_id=167772174, 
> > closing=0)
> > at ../../src/H5F.c:1758
> > #11 0x00007f44f7d553c2 in H5F_flush_mounts_recurse (f=0x1924a30,
> > dxpl_id=167772174)
> > at ../../src/H5Fmount.c:659
> > #12 0x00007f44f7d5557d in H5F_flush_mounts (f=0x1924a30, dxpl_id=167772174)
> > at ../../src/H5Fmount.c:698
> > #13 0x00007f44f7d48a50 in H5Fflush (object_id=16777216, 
> > scope=H5F_SCOPE_GLOBAL)
> > at ../../src/H5F.c:1704
> > #14 0x00007f44f8537adc in sync_netcdf4_file (h5=0x191cc50) at
> > ../../libsrc4/nc4file.c:2964
> > #15 0x00007f44f85389a2 in NC4_enddef (ncid=<optimized out>) at
> > ../../libsrc4/nc4file.c:2922
> > #16 0x00007f44f84ea612 in nc_enddef (ncid=65536) at 
> > ../../libdispatch/dfile.c:786
> > #17 0x0000000000400f88 in main (argc=1, argv=0x7fff476e1958)
> > at ../../nc_test4/tst_parallel.c:111
> >
> > One with:
> > (gdb) bt
> > #0  0x00000037a6cda4c7 in sched_yield () at 
> > ../sysdeps/unix/syscall-template.S:81
> > #1  0x000000381a317a5d in opal_progress () from 
> > /usr/lib64/openmpi/lib/libmpi.so.1
> > #2  0x000000381a261acd in ompi_request_default_wait_all ()
> > from /usr/lib64/openmpi/lib/libmpi.so.1
> > #3  0x00007f26db7d4c99 in ompi_coll_tuned_allreduce_intra_recursivedoubling 
> > ()
> > from /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so
> > #4  0x000000381a26e66b in PMPI_Allreduce () from
> > /usr/lib64/openmpi/lib/libmpi.so.1
> > #5  0x00007f26e17be321 in H5D__mpio_opt_possible (io_info=0x7fffbcbb0110,
> > file_space=0x210d3b0, mem_space=0x21afc20, type_info=0x7fffbcbb0090,
> > fm=0x7fffbcbb0200,
> > dx_plist=0x219c7e0) at ../../src/H5Dmpio.c:241
> > #6  0x00007f26e17bc3ee in H5D__ioinfo_adjust (io_info=0x7fffbcbb0110,
> > dset=0x219cba0,
> > dxpl_id=167772189, file_space=0x210d3b0, mem_space=0x21afc20,
> > type_info=0x7fffbcbb0090,
> > fm=0x7fffbcbb0200) at ../../src/H5Dio.c:999
> > #7  0x00007f26e17bb550 in H5D__write (dataset=0x219cba0, 
> > mem_type_id=50331660,
> > mem_space=0x21afc20, file_space=0x210d3b0, dxpl_id=167772189,
> > buf=0x7fffbcbbaff0)
> > at ../../src/H5Dio.c:667
> > #8  0x00007f26e17b9c7d in H5Dwrite (dset_id=83886083, mem_type_id=50331660,
> > mem_space_id=67108867, file_space_id=67108866, dxpl_id=167772189,
> > buf=0x7fffbcbbaff0)
> > at ../../src/H5Dio.c:265
> > #9  0x00007f26e1fd9a6a in nc4_put_vara (nc=<optimized out>,
> > ncid=ncid@entry=65536,
> > varid=varid@entry=0, startp=startp@entry=0x7fffbcbbafb0,
> > countp=countp@entry=0x7fffbcbbafd0, mem_nc_type=mem_nc_type@entry=4,
> > is_long=is_long@entry=0, data=data@entry=0x7fffbcbbaff0) at
> > ../../libsrc4/nc4hdf.c:795
> > #10 0x00007f26e1fd42cb in nc4_put_vara_tc (mem_type_is_long=0, 
> > op=0x7fffbcbbaff0,
> > countp=0x7fffbcbbafd0, startp=0x7fffbcbbafb0, mem_type=4, varid=0,
> > ncid=65536)
> > at ../../libsrc4/nc4var.c:1350
> > #11 NC4_put_vara (ncid=65536, varid=0, startp=0x7fffbcbbafb0,
> > countp=0x7fffbcbbafd0,
> > op=0x7fffbcbbaff0, memtype=4) at ../../libsrc4/nc4var.c:1484
> > #12 0x00007f26e1f871b5 in NC_put_vara (ncid=ncid@entry=65536,
> > varid=varid@entry=0,
> > start=start@entry=0x7fffbcbbafb0, edges=edges@entry=0x7fffbcbbafd0,
> > value=value@entry=0x7fffbcbbaff0, memtype=memtype@entry=4)
> > at ../../libdispatch/dvarput.c:79
> > #13 0x00007f26e1f8804f in nc_put_vara_int (ncid=65536, varid=0,
> > startp=startp@entry=0x7fffbcbbafb0, countp=countp@entry=0x7fffbcbbafd0,
> > op=op@entry=0x7fffbcbbaff0) at ../../libdispatch/dvarput.c:628
> > #14 0x000000000040103b in main (argc=1, argv=0x7fffbcbbb528)
> > at ../../nc_test4/tst_parallel.c:138
> >
> > --
> > Orion Poplawski
> > Technical Manager                     303-415-9701 x222
> > NWRA, Boulder Office                  FAX: 303-415-9702
> > 3380 Mitchell Lane                       address@hidden
> > Boulder, CO 80301                   http://www.nwra.com
> >
> >
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                      http://www.unidata.ucar.edu
> 
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: PMD-881681
Department: Support netCDF
Priority: Normal
Status: Closed