[gpaw-users] Problem running parallel GPAW

Tue Apr 24 11:28:08 CEST 2012

Hi LIst

I have recently installed GPAW, ASE3 and DACAPO on a cluster,
consisting of 42 nodes of 12 cores each.

I am using Redhat Enterprise 6.0 as my OS, OpenMPI 1.5.3 and TOPQUE.

I have absoutely no problems running parallel GPAW aplications, as
long as alle the porcesses are on the same node. However, as soon as i
spread the processes over more than one node, i get into trouble. The
stderr file from the TORQUE job give me:

"
[DFT046][[35378,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[DFT046][[35378,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[DFT046][[35378,1],4][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[DFT046][[35378,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[DFT046][[35378,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[DFT046][[35378,1],6][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
=>> PBS: job killed: node 1 (DFT045) requested job terminate, 'EOF'
(code 1099) - received SISTER_EOF attempting to communicate with
sister MOM's
mpirun: killing job...

"

And from the mom logs, i simply get a "sister process died" kinf of message.

I have tried to run a simple "MPI HELLO WORLD" test on as much af 10
nodes, utilizing all cores, and i have no trouble doing that.

Does anybody has expieiences like this?

Regards

Jesper Selknæs