[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #TIR-820282]: NetCDF-4 Parallel independent access with unlimited dimension (Fortran 90)



> Thank you very much.
> 
> I have one last question concerning the fact why my example never caused 
> problems on your side: did you really run my example with 4 processes? thus 
> mpirun -n 4?

No, sorry, I never saw any mention of using mpirun in your report on the 
problem.  That would probably explain
why it worked for me.  I guess we should be using mpirun in our tests!

--Russ

> I had the same hang also on the NCAR yellowstone supercomputer with the 
> standard NetCDF 4.2 / HDF5 1.8.9 install they have there.
> 
> Reto
> 
> On Apr 26, 2013, at 11:34 PM, Unidata netCDF Support wrote:
> 
> > Reto,
> >
> > I've finally created a Jira ticket for this issue, in case you want to 
> > follow
> > its status:
> >
> >  https://bugtracking.unidata.ucar.edu/browse/NCF-250
> >
> > --Russ
> >
> >> Russ,
> >>
> >> So, I've now also recompiled the whole NetCDF/HDF5 suite with MPICH 3.0.3 
> >> instead of Openmpi. Same story.
> >>
> >> I've traced down the blocking statement to the HDF5 library called from 
> >> the netcdf library during nc_put_vara_int:
> >>
> >> In nc4hdf.c (around line 770) it is calling the H5D.c routine 
> >> H5Dset_extent:
> >> if (H5Dset_extent(var->hdf_datasetid, xtend_size) < 0)
> >> BAIL(NC_EHDFERR);
> >>
> >> This is where write processes wait during an independent write operation 
> >> involving 1 unlimited dimension (where the dataset extent needs to be 
> >> extended) when not all processes take place of the write operation.
> >>
> >> Reto
> >>
> >>
> >> On Apr 12, 2013, at 7:32 PM, Unidata netCDF Support wrote:
> >>
> >>> Reto,
> >>>
> >>>> Yes, the POSIX parallel I/O tests fail on OSX with OpenMPI, but that is 
> >>>> fine. OSX and OpenMPI uses MPIIO. So to my understanding the parallel 
> >>>> tests are ok if either POSIX or MPIIO work and the other one fails.
> >>>>
> >>>> I am actually not using a parallel file system on OSX. I use the regular 
> >>>> file system (basic OSX installation) and I think that the parallel I/O 
> >>>> has to work in collective and independent mode even when using a regular 
> >>>> file system.
> >>>
> >>> I'm curious how you installed parallel HDF5, because my "make check" 
> >>> fails before finishing
> >>> the tests.  Did you build HDF5 without --enable-parallel, or without 
> >>> using CC=mpicc?  Or did
> >>> you build it with parallel I/O, but run "make install" even though "make 
> >>> check" failed as a
> >>> result of not having a parallel file system?
> >>>
> >>> --Russ
> >>>
> >>>> I will test the same installation on Linux and then start debugging on 
> >>>> OSX, and maybe we find out something.
> >>>>
> >>>> Btw. the netcdf-fortran 4.4 beta failed to compile alltogether on OSX, 
> >>>> so I'm still using netcdf-fortran 4.2.
> >>>>
> >>>> Have a great weekend,
> >>>>
> >>>> Reto
> >>>>
> >>>>
> >>>> On Apr 12, 2013, at 5:59 PM, Unidata netCDF Support wrote:
> >>>>
> >>>>> Reto,
> >>>>>
> >>>>>> I've tried the following configuration
> >>>>>> - hdf5 1.8.11-snap16
> >>>>>> - netcdf-4.3.0-rc4
> >>>>>> - netcdf-fortran-4.2
> >>>>>> - openmpi-1.6.3
> >>>>>> - gcc/gfortran 4.6.3
> >>>>>>
> >>>>>> Same issue. If I let all processes do the write, then it works fine. 
> >>>>>> If I for instance exclude process #0,1,2 or 3 from the writing, then 
> >>>>>> the write hangs (all metadata/open/close is collective, only the write 
> >>>>>> is independent.). It seems to me that somehow on my system all writes 
> >>>>>> are collective by default and thus the write operation is not executed 
> >>>>>> as independent.
> >>>>>>
> >>>>>> Do you have a configuration with openmpi on OSX somewhere around?
> >>>>>
> >>>>> Yes, I had to deactivate my mpich configuration first, but now have 
> >>>>> openmpi 1.6.4 on
> >>>>> OSX 10.8.3.  However, when I try to build hdf5 1.8.11-pre1 with it, 
> >>>>> using
> >>>>>
> >>>>> CC=/opt/local/lib/openmpi/bin/mpicc ./configure
> >>>>> make
> >>>>> make check
> >>>>>
> >>>>> Some tests fail in "make check", for example testing "ph5diff 
> >>>>> h5diff_basiccl.h5", that
> >>>>> may be due to not having a POSIX-compliant parallel file system 
> >>>>> installed.  Also I
> >>>>> jut noticed that the earlier test t_posix_compliant test for 
> >>>>> allwrite_allread_blocks
> >>>>> with POSIX IO failed, though it returned 0 so as not to stop the hdf5 
> >>>>> testing.
> >>>>>
> >>>>>
> >>>>> Are you using a parallel file system?  Do you set the environment 
> >>>>> variable
> >>>>> HDF5_PARAPREFIX to a directory in a parallel file system?  What file 
> >>>>> system are you
> >>>>> using for your parallel I/O tests?
> >>>>>
> >>>>> I'm afraid I don't know much about parallel I/O, and the netCDF 
> >>>>> parallel I/O expert
> >>>>> got lured away to a different job some time ago, so we may need some 
> >>>>> help or pointers
> >>>>> where to look to install a parallel file system on our OS X platform 
> >>>>> for this kind of
> >>>>> testing and debugging.
> >>>>>
> >>>>>> I will start putting some debugging commands into the netcdf-fortran 
> >>>>>> library and see where the process really hangs and whether the 
> >>>>>> collective/independent write is executed correctly.
> >>>>>
> >>>>> Thanks, that would be helpful ...
> >>>>>
> >>>>> --Russ
> >>>>>
> >>>>>> Reto
> >>>>>>
> >>>>>>
> >>>>>> On Apr 9, 2013, at 11:01 PM, Unidata netCDF Support wrote:
> >>>>>>
> >>>>>>> Hi Reto,
> >>>>>>>
> >>>>>>> Sorry to have taken so long to respond to your question.
> >>>>>>>> I have been using NetCDF-4 Parallel I/O with the Fortran 90 
> >>>>>>>> interface for some time with success. Thank you for this great tool!
> >>>>>>>>
> >>>>>>>> However, I now have an issue with independent access:
> >>>>>>>>
> >>>>>>>> - NetCDF F90 Parallel access (NetCDF-4, MPIIO)
> >>>>>>>> - 3 fixed and 1 unlimited dimension
> >>>>>>>> - alle processes open/close the file and write metadata
> >>>>>>>> - only a few processes write to the file (-> independent access)
> >>>>>>>> - the write hangs. It works fine if all processes take place.
> >>>>>>>>
> >>>>>>>> I've changed your example F90 parallel I/O file simple_xy_par_wr.f90 
> >>>>>>>> to include a unlimited dimension and independent access of only a 
> >>>>>>>> subset of processes. Same issue. Even if I explicitly set the access 
> >>>>>>>> type to independent for the variable. Can you reproduce the issue on 
> >>>>>>>> your side?
> >>>>>>>>
> >>>>>>>> The following system configuration on my side:
> >>>>>>>> - NetCDF 4.2.1.1 and F90 interface 4.2
> >>>>>>>> - hdf5 1.8.9
> >>>>>>>> - Openmpi 1.
> >>>>>>>> - OSX, gcc 4.6.3
> >>>>>>>
> >>>>>>> No, I haven't been able to reproduce the issue, but I can't exactly 
> >>>>>>> duplicate
> >>>>>>> your configuration easily, and there have been some updates and bug 
> >>>>>>> fixes that
> >>>>>>> may have made a difference.
> >>>>>>>
> >>>>>>> First I tried this configuration, which worked fine on your attached 
> >>>>>>> example:
> >>>>>>>
> >>>>>>> - NetCDF 4.3.0-rc4 and F90 interface 4.2
> >>>>>>> - hdf5 1.8.11 (release candidate from svn repository)
> >>>>>>> - mpich2-1.3.1
> >>>>>>> - Linux Fedora, mpicc, mpif90 wrapping gcc, gfortran 4.5.1
> >>>>>>>
> >>>>>>> So if you can build those versions, it should work for you.  I'm not 
> >>>>>>> sure whether
> >>>>>>> the fix is in netCDF-4.3.0 or in hdf5-1.8.11, but both have a fix for 
> >>>>>>> at least one
> >>>>>>> parallel I/O hanging process issue:
> >>>>>>>
> >>>>>>> https://bugtracking.unidata.ucar.edu/browse/NCF-214  (fix in 
> >>>>>>> netCDF-4.3.0)
> >>>>>>> https://bugtracking.unidata.ucar.edu/browse/NCF-240  (fix in 
> >>>>>>> HDF5-1.8.11)
> >>>>>>>
> >>>>>>> --Russ
> >>>>>>>
> >>>>>>> Russ Rew                                         UCAR Unidata Program
> >>>>>>> address@hidden                      http://www.unidata.ucar.edu
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Ticket Details
> >>>>>>> ===================
> >>>>>>> Ticket ID: TIR-820282
> >>>>>>> Department: Support netCDF
> >>>>>>> Priority: High
> >>>>>>> Status: Closed
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Russ Rew                                         UCAR Unidata Program
> >>>>> address@hidden                      http://www.unidata.ucar.edu
> >>>>>
> >>>>>
> >>>>>
> >>>>> Ticket Details
> >>>>> ===================
> >>>>> Ticket ID: TIR-820282
> >>>>> Department: Support netCDF
> >>>>> Priority: High
> >>>>> Status: Closed
> >>>>>
> >>>>
> >>>>
> >>>
> >>> Russ Rew                                         UCAR Unidata Program
> >>> address@hidden                      http://www.unidata.ucar.edu
> >>>
> >>>
> >>>
> >>> Ticket Details
> >>> ===================
> >>> Ticket ID: TIR-820282
> >>> Department: Support netCDF
> >>> Priority: High
> >>> Status: Closed
> >>>
> >>
> >>
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: TIR-820282
> > Department: Support netCDF
> > Priority: Emergency
> > Status: Closed
> >
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: TIR-820282
Department: Support netCDF
Priority: Emergency
Status: Closed