[gpaw-users] MPI Error, Fatal Error in PMPI_Comm_dup

Ali Malik malik at mm.tu-darmstadt.de
Thu Aug 1 10:38:04 CEST 2019


Dear gpaw-users,

I have been doing calculations on slabs of many systems including 
Cr2AlC, Cr2GaC etc. Recently, I have been facing the MPI error, which 
seems to occur randomly during the execution of the job. Sometimes, the 
jobs get completed, but most of the time, I get this MPI error, which 
terminates the job. I have been unable to identify the root cause of 
this error.

I am running calculations on HPC cluster, using gpaw-1.5.2, 
intelmpi-2018.4, python-3.7.2, scalapack-2.0.2. The HPC support desk 
told me the error is due to the probable bug in 
gpaw.logger(gpaw/io/logger.py lines 32-46). Their response:

   "when you call calc.set(txt = "...") the old logfile is not closed 
properly, only a new one is created. I suspect that you reach the limit 
of concurrently open files".

The input script and error output file is attached. If you need anything 
else, please feel free to ask. Any help to debug the issue would be 
highly appreciated. Or should I report this in bug tracker? Thanks

Here is the function*gpaw_optimize*, used in the input script which is 
just a wrapper:

def gpaw_optimize(atoms, calc, relax='', fmax=0.01, relaxalgorithm= 
"BFGS", mask=None, attach=False, gpawwrite="", verbose=True, **alargs):
     """
         wrapper function for relaxation

     :param atoms: ase atom object
     :param calc:  Calculator object
     :param relax: string (cell, full, "") , type of relaxation,
     :param fmax: number, force criteria
     :param relaxalgorithm: relax algorithm
     :param attach: bool, default False
     :param verbose: bool, default True
     :return: atoms object
     """

     if not attach:

         atoms.set_calculator(calc)

         if verbose:

             parprint("attaching the calculator", flush=True)

     if atoms.get_calculator() is None: # recheck

         if verbose:

             parprint("The Calculator is not attached", flush=True)

         atoms.set_calculator(calc)
         if verbose:

             parprint("It has been attached", flush=True)
         attach=True



     optimizer_algorithms = {"QuasiNewton": QuasiNewton, "BFGS": BFGS, 
"CG": CG, "ScBFGS": ScBFGS, "BFGSLS": BFGSLS} # relaxation algorithms

     if relaxalgorithm in optimizer_algorithms:
         pass
     else:

         raise KeyError("The %s is invalid or  not found.\n The 
available algorithms are: %s"
                                % ( relaxalgorithm, 
optimizer_algorithms.values()) )

     #TODO: single relax statement outside if.

     if relax == 'full':

         uf = UnitCellFilter(atoms, mask=mask)
         relax = optimizer_algorithms[relaxalgorithm](uf, 
logfile="rel-all.log", **alargs)

         if verbose:

             parprint("Full relaxation", flush=True)


     elif relax == 'cell':

         cf = StrainFilter(atoms, mask=mask)
         relax = optimizer_algorithms[relaxalgorithm](cf, 
logfile="rel-cell.log", **alargs)

         if verbose:

             parprint("Cell relaxation only", flush=True)


     elif relax == 'ions':  # ionic_relaxation


         relax = optimizer_algorithms[relaxalgorithm](atoms, 
logfile="rel-ionic.log", **alargs)

         if verbose:

             parprint("Ions relaxation only", flush=True)

     else:

         raise RelaxationTypeException("The entered relaxation string is 
incorrect")


     relax.run(fmax=fmax)

     if gpawwrite:  # last state only

         calc.write(gpawwrite, mode="all")

     return atoms


Best Regards,

Ali Muhammad Malik







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20190801/2ef3932a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: relax-slab.py
Type: text/x-python
Size: 1793 bytes
Desc: not available
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20190801/2ef3932a/attachment-0001.py>
-------------- next part --------------
Lmod: unloading intel 2018.4 
Lmod: unloading intelmpi 2018.4 
Lmod: unloading python 3.7.2 
Lmod: loading intel 2018.4 
Lmod: loading intelmpi 2018.4 
Lmod: loading python 3.7.2 
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013833, new_comm=0x7ffc10d15770) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffd336613f0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffce64ed080) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffe603b2d50) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4021c0a, new_comm=0x7ffd9e321db0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013833, new_comm=0x7ffca2e0df40) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffe66f527f0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c19, new_comm=0x7fff82d885c0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013833, new_comm=0x7fff2a51cb90) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7fff0d85fdc0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c19, new_comm=0x7ffd864f8db0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffd581b8c70) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffde18f8b90) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7fff6c5056f0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffff2b813d0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4021c0a, new_comm=0x7fffe5acdaa0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7fff8f24db00) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013833, new_comm=0x7ffd614bb350) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffc1639e620) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffc0ba89ad0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7fff8dc1b060) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7fff8ea127f0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffe24045f50) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffcba595b10) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffd5c17deb0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffe79a251b0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffe632c5560) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffd896c2b90) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffcb1b2c5c0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7fff47867890) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
slurmstepd: error: *** STEP 12727838.0 ON hpb0190 CANCELLED AT 2019-08-01T01:26:33 ***
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffd4e076f60) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffe7676c680) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
srun: error: hpb0190: tasks 0-15: Killed
srun: error: hpb0212: tasks 16-31: Killed


More information about the gpaw-users mailing list