[gpaw-users] Problem running parallel GPAW

Tue Apr 24 14:20:08 CEST 2012

Hi,

On 04/24/12 11:28, Jesper Rude Selknæs wrote:
> Hi LIst
>
> I have recently installed GPAW, ASE3 and DACAPO on a cluster,
> consisting of 42 nodes of 12 cores each.
>
> I am using Redhat Enterprise 6.0 as my OS, OpenMPI 1.5.3 and TOPQUE.
>
> I have absoutely no problems running parallel GPAW aplications, as
> long as alle the porcesses are on the same node. However, as soon as i
> spread the processes over more than one node, i get into trouble. The
> stderr file from the TORQUE job give me:
is openmpi compiled with torque support?
How do you run parallel gpaw (exact command line)?
Does this work across the nodes?:
mpiexec gpaw-python -c "from gpaw.mpi import rank; print rank"
Then, could you run this example 
https://svn.fysik.dtu.dk/projects/gpaw/trunk/gpaw/test/2Al.py
Do other programs run correctly?
For example ring_c.c or connectivity_c.c from 
http://svn.open-mpi.org/svn/ompi/trunk/examples/

Marcin
>
> "
> [DFT046][[35378,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],4][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [DFT046][[35378,1],6][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> =>>  PBS: job killed: node 1 (DFT045) requested job terminate, 'EOF'
> (code 1099) - received SISTER_EOF attempting to communicate with
> sister MOM's
> mpirun: killing job...
>
> "
>
>
>
> And from the mom logs, i simply get a "sister process died" kinf of message.
>
> I have tried to run a simple "MPI HELLO WORLD" test on as much af 10
> nodes, utilizing all cores, and i have no trouble doing that.
>
> Does anybody has expieiences like this?

>
> Regards
>
> Jesper Selknæs
>
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>