[gpaw-users] Problem running parallel GPAW

Mon Apr 30 16:53:32 CEST 2012

Hi list

I did a bit more investigation. From
https://wiki.fysik.dtu.dk/gpaw/install/installationguide.html#running-tests
 i ran the rank example:

[gpaw]$ mpirun -np 2 gpaw-python -c "import gpaw.mpi as mpi; print mpi.rank"

which ran flawlessly for up to 10 nodes (120  cores total).

I then ran [gpaw]$ mpirun -np 2 gpaw-python gpaw/test/spinpol.py

which fails as soon as I go beyounf the one node. It runs fine for 12
cores within the same node. I know that tehre is no speedup (actaully
tehre is heavy speeddown) using this many cores, but it executes
correctly, which is all i wants for the moment. Part of output ftom
spinpol.py on 2 nodes is given below, as it is quit large.

Any ideas?

R.

Jesper

File "/usr/lib/python2.6/site-packages/ase/atoms.py", line 548, in
get_potential_energy
    self.initialize(atoms)
  File "/usr/lib64/python2.6/site-packages/gpaw/paw.py", line 591, in initialize
    return self._calc.get_potential_energy(self)
  File "/usr/lib64/python2.6/site-packages/gpaw/aseinterface.py", line
37, in get_potential_energy
    self.calculate(atoms, converge=True)
  File "/usr/lib64/python2.6/site-packages/gpaw/paw.py", line 221, in calculate
    self.initialize(atoms)
  File "/usr/lib64/python2.6/site-packages/gpaw/paw.py", line 591, in initialize
    magmom_a, par.hund)
    return self._calc.get_potential_energy(self)
  File "/usr/lib64/python2.6/site-packages/gpaw/aseinterface.py", line
37, in get_potential_energy
    self.calculate(atoms, converge=True)
  File "/usr/lib64/python2.6/site-packages/gpaw/paw.py", line 221, in calculate
  File "/usr/lib64/python2.6/site-packages/gpaw/density.py", line 74,
in initialize
    allocate=False)
  File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
144, in Transformer
    self.initialize(atoms)
  File "/usr/lib64/python2.6/site-packages/gpaw/paw.py", line 591, in initialize
    t = _Transformer(gdin, gdout, nn, dtype, allocate)
  File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
49, in __init__
    gdout.n_c[0] * gdout.n_c[1])
AssertionError
    magmom_a, par.hund)
  File "/usr/lib64/python2.6/site-packages/gpaw/density.py", line 74,
in initialize
    allocate=False)
  File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
144, in Transformer
    t = _Transformer(gdin, gdout, nn, dtype, allocate)
  File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
49, in __init__
    gdout.n_c[0] * gdout.n_c[1])
    magmom_a, par.hund)
  File "/usr/lib64/python2.6/site-packages/gpaw/density.py", line 74,
in initialize
GPAW CLEANUP (node 1): <type 'exceptions.AssertionError'> occurred.
Calling MPI_Abort!
    allocate=False)
  File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
144, in Transformer
AssertionError
    t = _Transformer(gdin, gdout, nn, dtype, allocate)
  File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
49, in __init__
GPAW CLEANUP (node 17): <type 'exceptions.AssertionError'> occurred.
Calling MPI_Abort!
    gdout.n_c[0] * gdout.n_c[1])
AssertionError
GPAW CLEANUP (node 2): <type 'exceptions.AssertionError'> occurred.
Calling MPI_Abort!
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 42.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 5881 on
node DFT030 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[DFT030:05880] 10 more processes have sent help message
help-mpi-api.txt / mpi-abort
[DFT030:05880] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages

2012/4/24 Jesper Rude Selknæs <jesperrude at gmail.com>:
> Hi LIst
>
> I have recently installed GPAW, ASE3 and DACAPO on a cluster,
> consisting of 42 nodes of 12 cores each.
>
> I am using Redhat Enterprise 6.0 as my OS, OpenMPI 1.5.3 and TOPQUE.
>
> I have absoutely no problems running parallel GPAW aplications, as
> long as alle the porcesses are on the same node. However, as soon as i
> spread the processes over more than one node, i get into trouble. The
> stderr file from the TORQUE job give me:
>
>
> "
> [DFT046][[35378,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],4][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],6][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> =>> PBS: job killed: node 1 (DFT045) requested job terminate, 'EOF'
> (code 1099) - received SISTER_EOF attempting to communicate with
> sister MOM's
> mpirun: killing job...
>
> "
>
>
>
> And from the mom logs, i simply get a "sister process died" kinf of message.
>
> I have tried to run a simple "MPI HELLO WORLD" test on as much af 10
> nodes, utilizing all cores, and i have no trouble doing that.
>
> Does anybody has expieiences like this?
>
> Regards
>
> Jesper Selknæs