[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #PDL-125161]: Writing parallel files with zero-size chunks



Hi Thomas,

> I have an MPI-parallel application with a decomposition such that each array 
> is
> completely handled by one process for I/O purposes and the arrays are
> distributed in a round-robin fashion, i.e. task 0 holds all of array A, task 1
> holds all of array B and so forth.
> 
> My expectation was that I could write this with netcdf4 parallel I/O, so I
> compiled netcdf 4.2.1.1 for OpenMPI 1.4.2 and hdf5 1.8.9 on Debian GNU/Linux
> x86_64 and started testing.
> 
> Unfortunately, when I issue nc_create_par with NC_MPIPOSIX and 
> nc_var_par_access
> with flag NC_INDEPENDENT I only get invalid output, when I change the
> nc_create_par option to NC_MPIIO the program hangs on nc_close.
> 
> I've reduced my use-case to a small test mostly resembling one of the
> demonstration programs. I think the most relevant part is that the processes 
> not
> having any elements from the array each use start and count values of 0 for
> every dimension.
> 
> Please see the attached files for more information.
> 
> When running the attached program with
> 
> $ mpirun -n 5 ./nc4partest
> mpi_name: taifun size: 5 rank: 0, isDataWriter=0
> mpi_name: taifun size: 5 rank: 1, isDataWriter=0
> mpi_name: taifun size: 5 rank: 2, isDataWriter=1
> mpi_name: taifun size: 5 rank: 4, isDataWriter=0
> mpi_name: taifun size: 5 rank: 3, isDataWriter=0
> mpi_rank=1 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> mpi_rank=2 start[0]=0 start[1]=0 count[0]=24 count[1]=24
> mpi_rank=0 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> mpi_rank=3 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> mpi_rank=4 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> 
> and from this point on the program hangs.
> 
> I've tried to locate a hint how to use the nc_put_vara_int call for this case
> but found nothing.
> 
> Do I have to redistribute the data before writing? Are there other values for
> start/count I could use?

I just succeeded in running a test case that used count[0] = 0 on an MPI 
parallel 
file system using the netCDF-4 parallel I/O inherited from HDF5, and it ran 
fine.

The test I ran just inserted the following code in a loop after line 136 in
nc_test4/tst_parallel.c:

       /* See if count dimension == 0 returns error */
       count_save = count[0];
       count[0] = 0;
       if (nc_put_vara_int(ncid, v1id, start, count, slab_data)) ERR;
       count[0] = count_save ;

Discussing this with CISL consultants indicates the problem may be 
platform-specific.

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: PDL-125161
Department: Support netCDF
Priority: Normal
Status: Closed


NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.