[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #TIR-820282]: NetCDF-4 Parallel independent access with unlimited dimension (Fortran 90)



Reto,

I've finally created a Jira ticket for this issue, in case you want to follow
its status:

  https://bugtracking.unidata.ucar.edu/browse/NCF-250

--Russ

> Russ,
> 
> So, I've now also recompiled the whole NetCDF/HDF5 suite with MPICH 3.0.3 
> instead of Openmpi. Same story.
> 
> I've traced down the blocking statement to the HDF5 library called from the 
> netcdf library during nc_put_vara_int:
> 
> In nc4hdf.c (around line 770) it is calling the H5D.c routine H5Dset_extent:
> if (H5Dset_extent(var->hdf_datasetid, xtend_size) < 0)
> BAIL(NC_EHDFERR);
> 
> This is where write processes wait during an independent write operation 
> involving 1 unlimited dimension (where the dataset extent needs to be 
> extended) when not all processes take place of the write operation.
> 
> Reto
> 
> 
> On Apr 12, 2013, at 7:32 PM, Unidata netCDF Support wrote:
> 
> > Reto,
> >
> >> Yes, the POSIX parallel I/O tests fail on OSX with OpenMPI, but that is 
> >> fine. OSX and OpenMPI uses MPIIO. So to my understanding the parallel 
> >> tests are ok if either POSIX or MPIIO work and the other one fails.
> >>
> >> I am actually not using a parallel file system on OSX. I use the regular 
> >> file system (basic OSX installation) and I think that the parallel I/O has 
> >> to work in collective and independent mode even when using a regular file 
> >> system.
> >
> > I'm curious how you installed parallel HDF5, because my "make check" fails 
> > before finishing
> > the tests.  Did you build HDF5 without --enable-parallel, or without using 
> > CC=mpicc?  Or did
> > you build it with parallel I/O, but run "make install" even though "make 
> > check" failed as a
> > result of not having a parallel file system?
> >
> > --Russ
> >
> >> I will test the same installation on Linux and then start debugging on 
> >> OSX, and maybe we find out something.
> >>
> >> Btw. the netcdf-fortran 4.4 beta failed to compile alltogether on OSX, so 
> >> I'm still using netcdf-fortran 4.2.
> >>
> >> Have a great weekend,
> >>
> >> Reto
> >>
> >>
> >> On Apr 12, 2013, at 5:59 PM, Unidata netCDF Support wrote:
> >>
> >>> Reto,
> >>>
> >>>> I've tried the following configuration
> >>>> - hdf5 1.8.11-snap16
> >>>> - netcdf-4.3.0-rc4
> >>>> - netcdf-fortran-4.2
> >>>> - openmpi-1.6.3
> >>>> - gcc/gfortran 4.6.3
> >>>>
> >>>> Same issue. If I let all processes do the write, then it works fine. If 
> >>>> I for instance exclude process #0,1,2 or 3 from the writing, then the 
> >>>> write hangs (all metadata/open/close is collective, only the write is 
> >>>> independent.). It seems to me that somehow on my system all writes are 
> >>>> collective by default and thus the write operation is not executed as 
> >>>> independent.
> >>>>
> >>>> Do you have a configuration with openmpi on OSX somewhere around?
> >>>
> >>> Yes, I had to deactivate my mpich configuration first, but now have 
> >>> openmpi 1.6.4 on
> >>> OSX 10.8.3.  However, when I try to build hdf5 1.8.11-pre1 with it, using
> >>>
> >>> CC=/opt/local/lib/openmpi/bin/mpicc ./configure
> >>> make
> >>> make check
> >>>
> >>> Some tests fail in "make check", for example testing "ph5diff 
> >>> h5diff_basiccl.h5", that
> >>> may be due to not having a POSIX-compliant parallel file system 
> >>> installed.  Also I
> >>> jut noticed that the earlier test t_posix_compliant test for 
> >>> allwrite_allread_blocks
> >>> with POSIX IO failed, though it returned 0 so as not to stop the hdf5 
> >>> testing.
> >>>
> >>>
> >>> Are you using a parallel file system?  Do you set the environment variable
> >>> HDF5_PARAPREFIX to a directory in a parallel file system?  What file 
> >>> system are you
> >>> using for your parallel I/O tests?
> >>>
> >>> I'm afraid I don't know much about parallel I/O, and the netCDF parallel 
> >>> I/O expert
> >>> got lured away to a different job some time ago, so we may need some help 
> >>> or pointers
> >>> where to look to install a parallel file system on our OS X platform for 
> >>> this kind of
> >>> testing and debugging.
> >>>
> >>>> I will start putting some debugging commands into the netcdf-fortran 
> >>>> library and see where the process really hangs and whether the 
> >>>> collective/independent write is executed correctly.
> >>>
> >>> Thanks, that would be helpful ...
> >>>
> >>> --Russ
> >>>
> >>>> Reto
> >>>>
> >>>>
> >>>> On Apr 9, 2013, at 11:01 PM, Unidata netCDF Support wrote:
> >>>>
> >>>>> Hi Reto,
> >>>>>
> >>>>> Sorry to have taken so long to respond to your question.
> >>>>>> I have been using NetCDF-4 Parallel I/O with the Fortran 90 interface 
> >>>>>> for some time with success. Thank you for this great tool!
> >>>>>>
> >>>>>> However, I now have an issue with independent access:
> >>>>>>
> >>>>>> - NetCDF F90 Parallel access (NetCDF-4, MPIIO)
> >>>>>> - 3 fixed and 1 unlimited dimension
> >>>>>> - alle processes open/close the file and write metadata
> >>>>>> - only a few processes write to the file (-> independent access)
> >>>>>> - the write hangs. It works fine if all processes take place.
> >>>>>>
> >>>>>> I've changed your example F90 parallel I/O file simple_xy_par_wr.f90 
> >>>>>> to include a unlimited dimension and independent access of only a 
> >>>>>> subset of processes. Same issue. Even if I explicitly set the access 
> >>>>>> type to independent for the variable. Can you reproduce the issue on 
> >>>>>> your side?
> >>>>>>
> >>>>>> The following system configuration on my side:
> >>>>>> - NetCDF 4.2.1.1 and F90 interface 4.2
> >>>>>> - hdf5 1.8.9
> >>>>>> - Openmpi 1.
> >>>>>> - OSX, gcc 4.6.3
> >>>>>
> >>>>> No, I haven't been able to reproduce the issue, but I can't exactly 
> >>>>> duplicate
> >>>>> your configuration easily, and there have been some updates and bug 
> >>>>> fixes that
> >>>>> may have made a difference.
> >>>>>
> >>>>> First I tried this configuration, which worked fine on your attached 
> >>>>> example:
> >>>>>
> >>>>> - NetCDF 4.3.0-rc4 and F90 interface 4.2
> >>>>> - hdf5 1.8.11 (release candidate from svn repository)
> >>>>> - mpich2-1.3.1
> >>>>> - Linux Fedora, mpicc, mpif90 wrapping gcc, gfortran 4.5.1
> >>>>>
> >>>>> So if you can build those versions, it should work for you.  I'm not 
> >>>>> sure whether
> >>>>> the fix is in netCDF-4.3.0 or in hdf5-1.8.11, but both have a fix for 
> >>>>> at least one
> >>>>> parallel I/O hanging process issue:
> >>>>>
> >>>>> https://bugtracking.unidata.ucar.edu/browse/NCF-214  (fix in 
> >>>>> netCDF-4.3.0)
> >>>>> https://bugtracking.unidata.ucar.edu/browse/NCF-240  (fix in 
> >>>>> HDF5-1.8.11)
> >>>>>
> >>>>> --Russ
> >>>>>
> >>>>> Russ Rew                                         UCAR Unidata Program
> >>>>> address@hidden                      http://www.unidata.ucar.edu
> >>>>>
> >>>>>
> >>>>>
> >>>>> Ticket Details
> >>>>> ===================
> >>>>> Ticket ID: TIR-820282
> >>>>> Department: Support netCDF
> >>>>> Priority: High
> >>>>> Status: Closed
> >>>>>
> >>>>
> >>>>
> >>>
> >>> Russ Rew                                         UCAR Unidata Program
> >>> address@hidden                      http://www.unidata.ucar.edu
> >>>
> >>>
> >>>
> >>> Ticket Details
> >>> ===================
> >>> Ticket ID: TIR-820282
> >>> Department: Support netCDF
> >>> Priority: High
> >>> Status: Closed
> >>>
> >>
> >>
> >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: TIR-820282
> > Department: Support netCDF
> > Priority: High
> > Status: Closed
> >
> 
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: TIR-820282
Department: Support netCDF
Priority: Emergency
Status: Closed