[netCDF #TDA-721004]: NetCDF-C: failed regression tests inside a SLURM/Docker container

Hello Carl,

We use Docker for our regression testing as well, for serial and MPI-based 
builds, and I'm not currently seeing the same issue that you are.  This makes 
me suspect it is something specific to SLURM/Pyxis (neither of which I am 
terribly familiar with).  

You can run our mpich-based parallel tests with the following docker command:

    $ docker run --rm -it -e TESTPROC=16 -e USEAC=TRUE unidata/nctests:mpich

You can tweak the value of TESTPROC to tell the docker container how many 
processors to use.  

I'm at a bit of a loss due to my SLURM/Pyxis blind spot.  I know there are 
errors similar to what you reported when using the openmpi package (instead of 
mpich2), but this is apparently a known issue.  If nothing else, perhaps your 
trained eye can compare what the unidata/nctests:mpich docker container is 
doing against your local docker containers.  If you run the container 
interactively (e.g. docker run --rm -it unidata/nctests:mpich bash), you will 
find the config file used to build the image, Dockerfile.mpich, as well as the 
shell script used to run the tests, run_par_tests.sh.

Are your docker images on a public repo that I can pull them down from? I would 
be happy to take a look at them hands-on.

Thanks, have a great day,


> I'm starting to build NetCDF & other libraries inside Docker containers,
> and have been running the make check regression-tests to validate the
> installs.
> I'm running them under SLURM/Pyxis so the usual mpirun & mpiexec might
> not be working the way I'm used to.
> In particular, all the PNetCDF regressions were failing until I added
> these extra settings
> make *TESTSEQRUN="mpirun -n 1" TESTMPIRUN="mpiexec -n NP"* -i -k check
> because (I think) the test harness wasn't recognizing the parallel
> environment and not invoking mpiexec by default.
> With NetCDF-C I'm seeing the make check use this command, for example
> exec /usr/local/src/netcdf-c-4.7.4/nc_test/.libs/t_nc
> which gives an error:
> [circe-n001:06587] OPAL ERROR: Not initialized in file
> pmix3x_client.c at line 112
> --------------------------------------------------------------------------
> The application appears to have been direct launched using "srun",
> but OMPI was not built with SLURM's PMI support and therefore cannot
> execute. There are several options for building PMI support under
> SLURM, depending upon the SLURM version you are using:
>   version 16.05 or later: you can use SLURM's PMIx support. This
>   requires that you configure and build SLURM --with-pmix.
>   Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>   PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>   install PMI-2. You must then build Open MPI using --with-pmi pointing
>   to the SLURM PMI library location.
> Please configure as appropriate and try again.
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [circe-n001:06587] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not
> able to guarantee that all other processes were killed!
> If I (manually) run it with an explicit mpirun instead
> mpirun -N 1 /usr/local/src/netcdf-c-4.7.4/nc_test/.libs/t_nc
> then it looks like its working correctly.
> Are there some extra settings I should be using with the NetCDF-C &
> NetCDF-F regression-tests?
> Most of the test are passing already, but I get these failures that I
> wouldn't see outside the container environment:
> tst_netcdf4.sh
> tst_nccopy4.sh
> t_nc
> tst_small
> nc_test
> tst_misc
> tst_norm
> tst_nofill
> tst_atts3
> tst_formatx_pnetcdf
> tst_default_format_pnetcdf
> tst_cdf5format
> run_pnetcdf_test.sh
> tst_compounds
> tst_compounds3
> tst_atts3
> I'm not sure that they're all from the same root-cause that I'm showing
> above, but I'm hoping it'd be the easiest one to fix for starters.
> Thanks,
> Carl Ponder, Ph.D.
> Senior Engineer, NVIDIA Developer Technology

