[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #RQB-854711]: MPI/IO with unlimited dimensions



Sebastian,

> did you try running it with 2 ranks:
> mpiexec -np 2 ./test
> 
> You may also switch need to switch to the "wrong result" version:
> //size_t count[] = {1, rank}; // Deadlock
> size_t count[] = {1, rank+1}; // Only one written for all ranks
> Otherwise you have ranks that do not write any data. I'm not sure if
> this is supported by netCDF/HDF5.

Here's a possible fix from Quincey Koziol:

        I think it's failing on his system because he has an older version of 
OpenMPI, and we're using newer versions on our Macs.  Can he try this patch and 
see if it fixes the problem:

--- a/libsrc4/nc4hdf.c
+++ b/libsrc4/nc4hdf.c
@@ -798,11 +798,11 @@ nc4_put_vara(NC *nc, int ncid, int varid, const size_t *st
                 BAIL(NC_ECANTEXTEND);
 
             /* Reach consensus about dimension sizes to extend to */
-            /* (Note: Somewhat hackish, with the use of MPI_BYTE, but MPI_MAX 
is
+            /* (Note: Somewhat hackish, with the use of MPI_INTEGER, but 
MPI_MAX is
              *        correct with this usage, as long as it's not executed on
              *        heterogenous systems)
              */
-            if(MPI_SUCCESS != MPI_Allreduce(MPI_IN_PLACE, &xtend_size, 
(var->ndims * sizeof(hsize_t)), MPI_BYTE, MPI_MAX, h5->comm))
+            if(MPI_SUCCESS != MPI_Allreduce(MPI_IN_PLACE, &xtend_size, 
(var->ndims * (sizeof(hsize_t) / sizeof(int))), MPI_INTEGER, MPI_MAX, h5->comm))
                 BAIL(NC_EMPI);
          }
 #endif /* USE_PARALLEL */


        Still not a great solution, but since hsize_t is defined by HDF5 and 
doesn't map easily to a predefined MPI datatype, it's going to be a little bit 
of a hack no matter what.  (We use the MPI_BYTE type for this purpose most of 
the time in the HDF5 library, but we aren't calling MPI_Allreduce, so that's 
probably what's tripping this code up)

--Russ

> On 11.09.2013 23:55, Unidata netCDF Support wrote:
> > Hi Sebastian,
> >
> >> I tried to compile with netcCDF 4.3.1-rc2 but now, my program now
> >> craches because of an MPI error:
> >>
> >> *** An error occurred in MPI_Allreduce: the reduction operation MPI_MAX
> >> is not defined on the MPI_BYTE datatype
> >> *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
> >> *** MPI_ERR_OP: invalid reduce operation
> >> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> >>
> >> I'm using OpenMPI 1.4.3.
> >
> > I'm assuming the program that crashes is the test.cpp you attached in
> > your original support question.  I tried to duplicate the problem using
> > OpenMPI 1.7.2_1 on an OSX platform, and got a different error:
> >
> >    $ mpicxx test.cpp -o test -I{NCDIR}/include -I${H5DIR}/include 
> > -L${NCDIR}/lib -L${H5DIR}/lib -lnetcdf -lhdf5_hl -lhdf5 -ldl -lm -lz -lcurl
> >    ./test
> >    Start on rank 0: 0 0
> >    Count on rank 0: 1 0
> >    Assertion failed: (size), function H5MM_calloc, file ../../src/H5MM.c, 
> > line 95.
> >    [mort:71677] *** Process received signal ***
> >    [mort:71677] Signal: Abort trap: 6 (6)
> >    [mort:71677] Signal code:  (0)
> >    [mort:71677] [ 0] 2   libsystem_c.dylib                   
> > 0x00007fff939b994a _sigtramp + 26
> >    [mort:71677] [ 1] 3   ???                                 
> > 0x0000000000000000 0x0 + 0
> >    [mort:71677] [ 2] 4   libsystem_c.dylib                   
> > 0x00007fff93a11e2a __assert_rtn + 146
> >    [mort:71677] [ 3] 5   test                                
> > 0x0000000108eeea10 H5MM_calloc + 256
> >    [mort:71677] [ 4] 6   test                                
> > 0x0000000108d4ca3e H5D__chunk_io_init + 1534
> >    [mort:71677] [ 5] 7   test                                
> > 0x0000000108d8a45c H5D__write + 4028
> >    [mort:71677] [ 6] 8   test                                
> > 0x0000000108d87460 H5D__pre_write + 3552
> >    [mort:71677] [ 7] 9   test                                
> > 0x0000000108d8658c H5Dwrite + 732
> >    [mort:71677] [ 8] 10  test                                
> > 0x0000000108c8ac27 nc4_put_vara + 3991
> >    [mort:71677] [ 9] 11  test                                
> > 0x0000000108ca0564 nc4_put_vara_tc + 164
> >    [mort:71677] [10] 12  test                                
> > 0x0000000108ca04ab NC4_put_vara + 75
> >    [mort:71677] [11] 13  test                                
> > 0x0000000108c08240 NC_put_vara + 288
> >    [mort:71677] [12] 14  test                                
> > 0x0000000108c092d4 nc_put_vara_int + 100
> >    [mort:71677] [13] 15  test                                
> > 0x0000000108bf2e56 main + 630
> >    [mort:71677] [14] 16  libdyld.dylib                       
> > 0x00007fff886fd7e1 start + 0
> >    [mort:71677] [15] 17  ???                                 
> > 0x0000000000000001 0x0 + 1
> >    [mort:71677] *** End of error message ***
> >    Abort
> >
> >> I think, the bug was introduced in this commit:
> >> https://github.com/Unidata/netcdf-c/pull/4
> >
> > We're looking at the problem, thanks for reporting it.
> >
> > --Russ
> >
> >> Best regards,
> >> Sebastian
> >>
> >> On 22.08.2013 18:28, Unidata netCDF Support wrote:
> >>> Hi Sebastian,
> >>>
> >>>> my problem sounds similar but to the bug but it is different. My program
> >>>> also hangs when using collective MPI I/O.
> >>>>
> >>>> According to the bug report, only an issue with independent I/O was 
> >>>> fixed.
> >>>
> >>> You're right, but we think we have a fix for the collective I/O hang now,
> >>> available in the netCDF-C 4.3.1-rc2 version (a release candidate):
> >>>
> >>>     https://github.com/Unidata/netcdf-c/releases/tag/v4.3.1-rc2
> >>>
> >>> At your convenience, please let us know if it fixes the problem.
> >>>
> >>> --Russ
> >>>
> >>>> On 06.08.2013 00:09, Unidata netCDF Support wrote:
> >>>>> Hi Sebastian,
> >>>>>
> >>>>> Could you tell us if this recently fixed bug sounds like what you
> >>>>> found?
> >>>>>
> >>>>>      https://bugtracking.unidata.ucar.edu/browse/NCF-250
> >>>>>
> >>>>> If so, the fix will be in netCDF release 4.3.1, a release candidate
> >>>>> for which will soon be announced.
> >>>>>
> >>>>> --Russ
> >>>>>
> >>>>>> Hi everybody,
> >>>>>>
> >>>>>> I just figured out that using collective MPI/IO in variables with
> >>>>>> unlimited dimensions can lead to deadlocks or wrong files.
> >>>>>>
> >>>>>> I have attached a small example program which can reproduce deadlock
> >>>>>> (and wrong output files depending on the variable "count").
> >>>>>>
> >>>>>> Did I do anything wrong or is this a known bug?
> >>>>>>
> >>>>>> My configuration:
> >>>>>> hdf5 1.8.11
> >>>>>> netcdf 4.3
> >>>>>> openmpi (default ubuntu installation)
> >>>>>>
> >>>>>> Compile command:
> >>>>>> mpicxx test.cpp -I/usr/local/include -L/usr/local/lib -lnetcdf 
> >>>>>> -lhdf5_hl
> >>>>>> -lhdf5 -lz
> >>>>>> (netcdf and hdf5 are installed in /usr/local)
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Sebastian
> >>>>>>
> >>>>>> --
> >>>>>> Sebastian Rettenberger, M.Sc.
> >>>>>> Technische Universität München
> >>>>>> Department of Informatics
> >>>>>> Chair of Scientific Computing
> >>>>>> Boltzmannstrasse 3, 85748 Garching, Germany
> >>>>>> http://www5.in.tum.de/
> >>>>>>
> >>>>>>
> >>>>> Russ Rew                                         UCAR Unidata Program
> >>>>> address@hidden                      http://www.unidata.ucar.edu
> >>>>>
> >>>>>
> >>>>>
> >>>>> Ticket Details
> >>>>> ===================
> >>>>> Ticket ID: RQB-854711
> >>>>> Department: Support netCDF
> >>>>> Priority: Normal
> >>>>> Status: Closed
> >>>>>
> >>>>
> >>>> --
> >>>> Sebastian Rettenberger, M.Sc.
> >>>> Technische Universität München
> >>>> Department of Informatics
> >>>> Chair of Scientific Computing
> >>>> Boltzmannstrasse 3, 85748 Garching, Germany
> >>>> http://www5.in.tum.de/
> >>>>
> >>>>
> >>>>
> >>> Russ Rew                                         UCAR Unidata Program
> >>> address@hidden                      http://www.unidata.ucar.edu
> >>>
> >>>
> >>>
> >>> Ticket Details
> >>> ===================
> >>> Ticket ID: RQB-854711
> >>> Department: Support netCDF
> >>> Priority: Normal
> >>> Status: Closed
> >>>
> >>
> >> --
> >> Sebastian Rettenberger, M.Sc.
> >> Technische Universität München
> >> Department of Informatics
> >> Chair of Scientific Computing
> >> Boltzmannstrasse 3, 85748 Garching, Germany
> >> http://www5.in.tum.de/
> >>
> >>
> >>
> >> Hello,
> >>
> >> I tried to compile with netcCDF 4.3.1-rc2 but now, my program now
> >> craches because of an MPI error:
> >>
> >> *** An error occurred in MPI_Allreduce: the reduction operation MPI_MAX
> >> is not defined on the MPI_BYTE datatype
> >> *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
> >> *** MPI_ERR_OP: invalid reduce operation
> >> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> >>
> >> I'm using OpenMPI 1.4.3.
> >>
> >> I think, the bug was introduced in this commit:
> >> https://github.com/Unidata/netcdf-c/pull/4
> >>
> >> Best regards,
> >> Sebastian
> >>
> >> On 22.08.2013 18:28, Unidata netCDF Support wrote:
> >>> Hi Sebastian,
> >>>
> >>>> my problem sounds similar but to the bug but it is different. My program
> >>>> also hangs when using collective MPI I/O.
> >>>>
> >>>> According to the bug report, only an issue with independent I/O was 
> >>>> fixed.
> >>>
> >>> You're right, but we think we have a fix for the collective I/O hang now,
> >>> available in the netCDF-C 4.3.1-rc2 version (a release candidate):
> >>>
> >>>     https://github.com/Unidata/netcdf-c/releases/tag/v4.3.1-rc2
> >>>
> >>> At your convenience, please let us know if it fixes the problem.
> >>>
> >>> --Russ
> >>>
> >>>> On 06.08.2013 00:09, Unidata netCDF Support wrote:
> >>>>> Hi Sebastian,
> >>>>>
> >>>>> Could you tell us if this recently fixed bug sounds like what you
> >>>>> found?
> >>>>>
> >>>>>      https://bugtracking.unidata.ucar.edu/browse/NCF-250
> >>>>>
> >>>>> If so, the fix will be in netCDF release 4.3.1, a release candidate
> >>>>> for which will soon be announced.
> >>>>>
> >>>>> --Russ
> >>>>>
> >>>>>> Hi everybody,
> >>>>>>
> >>>>>> I just figured out that using collective MPI/IO in variables with
> >>>>>> unlimited dimensions can lead to deadlocks or wrong files.
> >>>>>>
> >>>>>> I have attached a small example program which can reproduce deadlock
> >>>>>> (and wrong output files depending on the variable "count").
> >>>>>>
> >>>>>> Did I do anything wrong or is this a known bug?
> >>>>>>
> >>>>>> My configuration:
> >>>>>> hdf5 1.8.11
> >>>>>> netcdf 4.3
> >>>>>> openmpi (default ubuntu installation)
> >>>>>>
> >>>>>> Compile command:
> >>>>>> mpicxx test.cpp -I/usr/local/include -L/usr/local/lib -lnetcdf 
> >>>>>> -lhdf5_hl
> >>>>>> -lhdf5 -lz
> >>>>>> (netcdf and hdf5 are installed in /usr/local)
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Sebastian
> >>>>>>
> >>>>>> --
> >>>>>> Sebastian Rettenberger, M.Sc.
> >>>>>> Technische Universität München
> >>>>>> Department of Informatics
> >>>>>> Chair of Scientific Computing
> >>>>>> Boltzmannstrasse 3, 85748 Garching, Germany
> >>>>>> http://www5.in.tum.de/
> >>>>>>
> >>>>>>
> >>>>> Russ Rew                                         UCAR Unidata Program
> >>>>> address@hidden                      http://www.unidata.ucar.edu
> >>>>>
> >>>>>
> >>>>>
> >>>>> Ticket Details
> >>>>> ===================
> >>>>> Ticket ID: RQB-854711
> >>>>> Department: Support netCDF
> >>>>> Priority: Normal
> >>>>> Status: Closed
> >>>>>
> >>>>
> >>>> --
> >>>> Sebastian Rettenberger, M.Sc.
> >>>> Technische Universität München
> >>>> Department of Informatics
> >>>> Chair of Scientific Computing
> >>>> Boltzmannstrasse 3, 85748 Garching, Germany
> >>>> http://www5.in.tum.de/
> >>>>
> >>>>
> >>>>
> >>> Russ Rew                                         UCAR Unidata Program
> >>> address@hidden                      http://www.unidata.ucar.edu
> >>>
> >>>
> >>>
> >>> Ticket Details
> >>> ===================
> >>> Ticket ID: RQB-854711
> >>> Department: Support netCDF
> >>> Priority: Normal
> >>> Status: Closed
> >>>
> >>
> >> --
> >> Sebastian Rettenberger, M.Sc.
> >> Technische Universität München
> >> Department of Informatics
> >> Chair of Scientific Computing
> >> Boltzmannstrasse 3, 85748 Garching, Germany
> >> http://www5.in.tum.de/
> >>
> >>
> >>
> >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: RQB-854711
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> 
> --
> Sebastian Rettenberger, M.Sc.
> Technische Universität München
> Department of Informatics
> Chair of Scientific Computing
> Boltzmannstrasse 3, 85748 Garching, Germany
> http://www5.in.tum.de/
> 
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: RQB-854711
Department: Support netCDF
Priority: Normal
Status: Closed