[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #TIR-820282]: NetCDF-4 Parallel independent access with unlimited dimension (Fortran 90)



Reto,

> Yes, the POSIX parallel I/O tests fail on OSX with OpenMPI, but that is fine. 
> OSX and OpenMPI uses MPIIO. So to my understanding the parallel tests are ok 
> if either POSIX or MPIIO work and the other one fails.
> 
> I am actually not using a parallel file system on OSX. I use the regular file 
> system (basic OSX installation) and I think that the parallel I/O has to work 
> in collective and independent mode even when using a regular file system.

I'm curious how you installed parallel HDF5, because my "make check" fails 
before finishing 
the tests.  Did you build HDF5 without --enable-parallel, or without using 
CC=mpicc?  Or did
you build it with parallel I/O, but run "make install" even though "make check" 
failed as a
result of not having a parallel file system?

--Russ

> I will test the same installation on Linux and then start debugging on OSX, 
> and maybe we find out something.
> 
> Btw. the netcdf-fortran 4.4 beta failed to compile alltogether on OSX, so I'm 
> still using netcdf-fortran 4.2.
> 
> Have a great weekend,
> 
> Reto
> 
> 
> On Apr 12, 2013, at 5:59 PM, Unidata netCDF Support wrote:
> 
> > Reto,
> >
> >> I've tried the following configuration
> >> - hdf5 1.8.11-snap16
> >> - netcdf-4.3.0-rc4
> >> - netcdf-fortran-4.2
> >> - openmpi-1.6.3
> >> - gcc/gfortran 4.6.3
> >>
> >> Same issue. If I let all processes do the write, then it works fine. If I 
> >> for instance exclude process #0,1,2 or 3 from the writing, then the write 
> >> hangs (all metadata/open/close is collective, only the write is 
> >> independent.). It seems to me that somehow on my system all writes are 
> >> collective by default and thus the write operation is not executed as 
> >> independent.
> >>
> >> Do you have a configuration with openmpi on OSX somewhere around?
> >
> > Yes, I had to deactivate my mpich configuration first, but now have openmpi 
> > 1.6.4 on
> > OSX 10.8.3.  However, when I try to build hdf5 1.8.11-pre1 with it, using
> >
> >  CC=/opt/local/lib/openmpi/bin/mpicc ./configure
> >  make
> >  make check
> >
> > Some tests fail in "make check", for example testing "ph5diff 
> > h5diff_basiccl.h5", that
> > may be due to not having a POSIX-compliant parallel file system installed.  
> > Also I
> > jut noticed that the earlier test t_posix_compliant test for 
> > allwrite_allread_blocks
> > with POSIX IO failed, though it returned 0 so as not to stop the hdf5 
> > testing.
> >
> >
> > Are you using a parallel file system?  Do you set the environment variable
> > HDF5_PARAPREFIX to a directory in a parallel file system?  What file system 
> > are you
> > using for your parallel I/O tests?
> >
> > I'm afraid I don't know much about parallel I/O, and the netCDF parallel 
> > I/O expert
> > got lured away to a different job some time ago, so we may need some help 
> > or pointers
> > where to look to install a parallel file system on our OS X platform for 
> > this kind of
> > testing and debugging.
> >
> >> I will start putting some debugging commands into the netcdf-fortran 
> >> library and see where the process really hangs and whether the 
> >> collective/independent write is executed correctly.
> >
> > Thanks, that would be helpful ...
> >
> > --Russ
> >
> >> Reto
> >>
> >>
> >> On Apr 9, 2013, at 11:01 PM, Unidata netCDF Support wrote:
> >>
> >>> Hi Reto,
> >>>
> >>> Sorry to have taken so long to respond to your question.
> >>>> I have been using NetCDF-4 Parallel I/O with the Fortran 90 interface 
> >>>> for some time with success. Thank you for this great tool!
> >>>>
> >>>> However, I now have an issue with independent access:
> >>>>
> >>>> - NetCDF F90 Parallel access (NetCDF-4, MPIIO)
> >>>> - 3 fixed and 1 unlimited dimension
> >>>> - alle processes open/close the file and write metadata
> >>>> - only a few processes write to the file (-> independent access)
> >>>> - the write hangs. It works fine if all processes take place.
> >>>>
> >>>> I've changed your example F90 parallel I/O file simple_xy_par_wr.f90 to 
> >>>> include a unlimited dimension and independent access of only a subset of 
> >>>> processes. Same issue. Even if I explicitly set the access type to 
> >>>> independent for the variable. Can you reproduce the issue on your side?
> >>>>
> >>>> The following system configuration on my side:
> >>>> - NetCDF 4.2.1.1 and F90 interface 4.2
> >>>> - hdf5 1.8.9
> >>>> - Openmpi 1.
> >>>> - OSX, gcc 4.6.3
> >>>
> >>> No, I haven't been able to reproduce the issue, but I can't exactly 
> >>> duplicate
> >>> your configuration easily, and there have been some updates and bug fixes 
> >>> that
> >>> may have made a difference.
> >>>
> >>> First I tried this configuration, which worked fine on your attached 
> >>> example:
> >>>
> >>> - NetCDF 4.3.0-rc4 and F90 interface 4.2
> >>> - hdf5 1.8.11 (release candidate from svn repository)
> >>> - mpich2-1.3.1
> >>> - Linux Fedora, mpicc, mpif90 wrapping gcc, gfortran 4.5.1
> >>>
> >>> So if you can build those versions, it should work for you.  I'm not sure 
> >>> whether
> >>> the fix is in netCDF-4.3.0 or in hdf5-1.8.11, but both have a fix for at 
> >>> least one
> >>> parallel I/O hanging process issue:
> >>>
> >>> https://bugtracking.unidata.ucar.edu/browse/NCF-214  (fix in netCDF-4.3.0)
> >>> https://bugtracking.unidata.ucar.edu/browse/NCF-240  (fix in HDF5-1.8.11)
> >>>
> >>> --Russ
> >>>
> >>> Russ Rew                                         UCAR Unidata Program
> >>> address@hidden                      http://www.unidata.ucar.edu
> >>>
> >>>
> >>>
> >>> Ticket Details
> >>> ===================
> >>> Ticket ID: TIR-820282
> >>> Department: Support netCDF
> >>> Priority: High
> >>> Status: Closed
> >>>
> >>
> >>
> >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: TIR-820282
> > Department: Support netCDF
> > Priority: High
> > Status: Closed
> >
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: TIR-820282
Department: Support netCDF
Priority: High
Status: Closed