[gpaw-users] Problem running parallel GPAW

Jesper Rude Selknæs jesperrude at gmail.com
Mon Apr 30 14:37:32 CEST 2012


HI list

Running the script interactively on the cluster, i managed to get a
bit more informations about the problme. When the problme occurs,
something like this is outputted (I have only pasted in part of the
traveback lines, but they are all the same):

  File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
line 25, in equal
  File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
line 25, in equal
  File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
line 25, in equal
  File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
line 25, in equal
  File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
line 25, in equal
  File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
line 25, in equal
    raise AssertionError(msg)
AssertionError: 22 != 24 (error: |-2| > 0)
GPAW CLEANUP (node 2): <type 'exceptions.AssertionError'> occurred.
Calling MPI_Abort!
    raise AssertionError(msg)
AssertionError: 22 != 24 (error: |-2| > 0)
GPAW CLEANUP (node 32): <type 'exceptions.AssertionError'> occurred.
Calling MPI_Abort!
    raise AssertionError(msg)
AssertionError: 22 != 24 (error: |-2| > 0)
GPAW CLEANUP (node 27): <type 'exceptions.AssertionError'> occurred.
Calling MPI_Abort!
    raise AssertionError(msg)
AssertionError: 22 != 24 (error: |-2| > 0)
    raise AssertionError(msg)
AssertionError: 22 != 24 (error: |-2| > 0)
    raise AssertionError(msg)
AssertionError: 22 != 24 (error: |-2| > 0)
GPAW CLEANUP (node 10): <type 'exceptions.AssertionError'> occurred.
Calling MPI_Abort!
GPAW CLEANUP (node 13): <type 'exceptions.AssertionError'> occurred.
Calling MPI_Abort!
GPAW CLEANUP (node 0): <type 'exceptions.AssertionError'> occurred.
Calling MPI_Abort!
    raise AssertionError(msg)
AssertionError: 22 != 24 (error: |-2| > 0)
GPAW CLEANUP (node 3): <type 'exceptions.AssertionError'> occurred.
Calling MPI_Abort!
    raise AssertionError(msg)
AssertionError: 22 != 24 (error: |-2| > 0)
    raise AssertionError(msg)
AssertionError: 22 != 24 (error: |-2| > 0)
:


I tried googleing the assertion error, but it only yielded results
concerning other conditions than the 22 != 24 as is the case
soncerning my problme.

Thanks in advance for any new input

R.

Jesper


2012/4/24 Jesper Rude Selknæs <jesperrude at gmail.com>:
> Hi LIst
>
> I have recently installed GPAW, ASE3 and DACAPO on a cluster,
> consisting of 42 nodes of 12 cores each.
>
> I am using Redhat Enterprise 6.0 as my OS, OpenMPI 1.5.3 and TOPQUE.
>
> I have absoutely no problems running parallel GPAW aplications, as
> long as alle the porcesses are on the same node. However, as soon as i
> spread the processes over more than one node, i get into trouble. The
> stderr file from the TORQUE job give me:
>
>
> "
> [DFT046][[35378,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],4][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],6][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> =>> PBS: job killed: node 1 (DFT045) requested job terminate, 'EOF'
> (code 1099) - received SISTER_EOF attempting to communicate with
> sister MOM's
> mpirun: killing job...
>
> "
>
>
>
> And from the mom logs, i simply get a "sister process died" kinf of message.
>
> I have tried to run a simple "MPI HELLO WORLD" test on as much af 10
> nodes, utilizing all cores, and i have no trouble doing that.
>
> Does anybody has expieiences like this?
>
> Regards
>
> Jesper Selknæs



More information about the gpaw-users mailing list