[gpaw-users] MPI Error, Fatal Error in PMPI_Comm_dup
Ali Malik
malik at mm.tu-darmstadt.de
Thu Aug 1 10:38:04 CEST 2019
Dear gpaw-users,
I have been doing calculations on slabs of many systems including
Cr2AlC, Cr2GaC etc. Recently, I have been facing the MPI error, which
seems to occur randomly during the execution of the job. Sometimes, the
jobs get completed, but most of the time, I get this MPI error, which
terminates the job. I have been unable to identify the root cause of
this error.
I am running calculations on HPC cluster, using gpaw-1.5.2,
intelmpi-2018.4, python-3.7.2, scalapack-2.0.2. The HPC support desk
told me the error is due to the probable bug in
gpaw.logger(gpaw/io/logger.py lines 32-46). Their response:
"when you call calc.set(txt = "...") the old logfile is not closed
properly, only a new one is created. I suspect that you reach the limit
of concurrently open files".
The input script and error output file is attached. If you need anything
else, please feel free to ask. Any help to debug the issue would be
highly appreciated. Or should I report this in bug tracker? Thanks
Here is the function*gpaw_optimize*, used in the input script which is
just a wrapper:
def gpaw_optimize(atoms, calc, relax='', fmax=0.01, relaxalgorithm=
"BFGS", mask=None, attach=False, gpawwrite="", verbose=True, **alargs):
"""
wrapper function for relaxation
:param atoms: ase atom object
:param calc: Calculator object
:param relax: string (cell, full, "") , type of relaxation,
:param fmax: number, force criteria
:param relaxalgorithm: relax algorithm
:param attach: bool, default False
:param verbose: bool, default True
:return: atoms object
"""
if not attach:
atoms.set_calculator(calc)
if verbose:
parprint("attaching the calculator", flush=True)
if atoms.get_calculator() is None: # recheck
if verbose:
parprint("The Calculator is not attached", flush=True)
atoms.set_calculator(calc)
if verbose:
parprint("It has been attached", flush=True)
attach=True
optimizer_algorithms = {"QuasiNewton": QuasiNewton, "BFGS": BFGS,
"CG": CG, "ScBFGS": ScBFGS, "BFGSLS": BFGSLS} # relaxation algorithms
if relaxalgorithm in optimizer_algorithms:
pass
else:
raise KeyError("The %s is invalid or not found.\n The
available algorithms are: %s"
% ( relaxalgorithm,
optimizer_algorithms.values()) )
#TODO: single relax statement outside if.
if relax == 'full':
uf = UnitCellFilter(atoms, mask=mask)
relax = optimizer_algorithms[relaxalgorithm](uf,
logfile="rel-all.log", **alargs)
if verbose:
parprint("Full relaxation", flush=True)
elif relax == 'cell':
cf = StrainFilter(atoms, mask=mask)
relax = optimizer_algorithms[relaxalgorithm](cf,
logfile="rel-cell.log", **alargs)
if verbose:
parprint("Cell relaxation only", flush=True)
elif relax == 'ions': # ionic_relaxation
relax = optimizer_algorithms[relaxalgorithm](atoms,
logfile="rel-ionic.log", **alargs)
if verbose:
parprint("Ions relaxation only", flush=True)
else:
raise RelaxationTypeException("The entered relaxation string is
incorrect")
relax.run(fmax=fmax)
if gpawwrite: # last state only
calc.write(gpawwrite, mode="all")
return atoms
Best Regards,
Ali Muhammad Malik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20190801/2ef3932a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: relax-slab.py
Type: text/x-python
Size: 1793 bytes
Desc: not available
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20190801/2ef3932a/attachment-0001.py>
-------------- next part --------------
Lmod: unloading intel 2018.4
Lmod: unloading intelmpi 2018.4
Lmod: unloading python 3.7.2
Lmod: loading intel 2018.4
Lmod: loading intelmpi 2018.4
Lmod: loading python 3.7.2
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013833, new_comm=0x7ffc10d15770) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffd336613f0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffce64ed080) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffe603b2d50) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4021c0a, new_comm=0x7ffd9e321db0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013833, new_comm=0x7ffca2e0df40) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffe66f527f0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c19, new_comm=0x7fff82d885c0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013833, new_comm=0x7fff2a51cb90) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7fff0d85fdc0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c19, new_comm=0x7ffd864f8db0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffd581b8c70) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffde18f8b90) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7fff6c5056f0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffff2b813d0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4021c0a, new_comm=0x7fffe5acdaa0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7fff8f24db00) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013833, new_comm=0x7ffd614bb350) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7271/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffc1639e620) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffc0ba89ad0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7fff8dc1b060) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7fff8ea127f0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffe24045f50) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffcba595b10) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffd5c17deb0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffe79a251b0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffe632c5560) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffd896c2b90) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffcb1b2c5c0) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7fff47867890) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
slurmstepd: error: *** STEP 12727838.0 ON hpb0190 CANCELLED AT 2019-08-01T01:26:33 ***
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4011c17, new_comm=0x7ffd4e076f60) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(192)..................: MPI_Comm_dup(comm=0xc4013831, new_comm=0x7ffe7676c680) failed
PMPI_Comm_dup(171)..................: fail failed
MPIR_Comm_dup_impl(57)..............: fail failed
MPIR_Comm_copy(838).................: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (7272/16384 free on this process; ignore_id=0)
srun: error: hpb0190: tasks 0-15: Killed
srun: error: hpb0212: tasks 16-31: Killed
More information about the gpaw-users
mailing list