[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #TDA-721004]: NetCDF-C: failed regression tests inside a SLURM/Docker container



Hello Carl,

We use Docker for our regression testing as well, for serial and MPI-based 
builds, and I'm not currently seeing the same issue that you are.  This makes 
me suspect it is something specific to SLURM/Pyxis (neither of which I am 
terribly familiar with).  

You can run our mpich-based parallel tests with the following docker command:

    $ docker run --rm -it -e TESTPROC=16 -e USEAC=TRUE unidata/nctests:mpich

You can tweak the value of TESTPROC to tell the docker container how many 
processors to use.  

I'm at a bit of a loss due to my SLURM/Pyxis blind spot.  I know there are 
errors similar to what you reported when using the openmpi package (instead of 
mpich2), but this is apparently a known issue.  If nothing else, perhaps your 
trained eye can compare what the unidata/nctests:mpich docker container is 
doing against your local docker containers.  If you run the container 
interactively (e.g. docker run --rm -it unidata/nctests:mpich bash), you will 
find the config file used to build the image, Dockerfile.mpich, as well as the 
shell script used to run the tests, run_par_tests.sh.

Are your docker images on a public repo that I can pull them down from? I would 
be happy to take a look at them hands-on.

Thanks, have a great day,

-Ward

> I'm starting to build NetCDF & other libraries inside Docker containers,
> and have been running the make check regression-tests to validate the
> installs.
> 
> I'm running them under SLURM/Pyxis so the usual mpirun & mpiexec might
> not be working the way I'm used to.
> 
> In particular, all the PNetCDF regressions were failing until I added
> these extra settings
> 
> make *TESTSEQRUN="mpirun -n 1" TESTMPIRUN="mpiexec -n NP"* -i -k check
> 
> because (I think) the test harness wasn't recognizing the parallel
> environment and not invoking mpiexec by default.
> 
> With NetCDF-C I'm seeing the make check use this command, for example
> 
> exec /usr/local/src/netcdf-c-4.7.4/nc_test/.libs/t_nc
> 
> which gives an error:
> 
> [circe-n001:06587] OPAL ERROR: Not initialized in file
> pmix3x_client.c at line 112
> --------------------------------------------------------------------------
> The application appears to have been direct launched using "srun",
> but OMPI was not built with SLURM's PMI support and therefore cannot
> execute. There are several options for building PMI support under
> SLURM, depending upon the SLURM version you are using:
> 
>   version 16.05 or later: you can use SLURM's PMIx support. This
>   requires that you configure and build SLURM --with-pmix.
> 
>   Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>   PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>   install PMI-2. You must then build Open MPI using --with-pmi pointing
>   to the SLURM PMI library location.
> 
> Please configure as appropriate and try again.
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [circe-n001:06587] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not
> able to guarantee that all other processes were killed!
> 
> If I (manually) run it with an explicit mpirun instead
> 
> mpirun -N 1 /usr/local/src/netcdf-c-4.7.4/nc_test/.libs/t_nc
> 
> then it looks like its working correctly.
> 
> Are there some extra settings I should be using with the NetCDF-C &
> NetCDF-F regression-tests?
> 
> Most of the test are passing already, but I get these failures that I
> wouldn't see outside the container environment:
> 
> tst_netcdf4.sh
> tst_nccopy4.sh
> t_nc
> tst_small
> nc_test
> tst_misc
> tst_norm
> tst_nofill
> tst_atts3
> tst_formatx_pnetcdf
> tst_default_format_pnetcdf
> tst_cdf5format
> run_pnetcdf_test.sh
> tst_compounds
> tst_compounds3
> tst_atts3
> 
> I'm not sure that they're all from the same root-cause that I'm showing
> above, but I'm hoping it'd be the easiest one to fix for starters.
> Thanks,
> 
> Carl Ponder, Ph.D.
> Senior Engineer, NVIDIA Developer Technology
> 
> 
> 


Ticket Details
===================
Ticket ID: TDA-721004
Department: Support netCDF
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.