[gpaw-users] Problem running parallel GPAW

Wed Apr 25 13:17:43 CEST 2012

On 04/25/12 10:29, Jesper Rude Selknæs wrote:
> Hi
>
> Thanks for your quick response, it is indeed appreciated.
>
> To answer your questions:
>
> " is openmpi compiled with torque support?"
> Yes, and it is working with a simple MPI aplication, outputting "Hello
> world" on all processes.
>
>
> "How do you run parallel gpaw (exact command line)?"
> I created a gpaw_qsub script, som the "Simple submit" section of
> https://wiki.fysik.dtu.dk/gpaw/documentation/parallel_runs/parallel_runs.html
>   My excact script look like this:
>
> #!/usr/bin/env python
> from sys import argv
> import os
> options = ' '.join(argv[1:-1])
> job = argv[-1]
> dir = os.getcwd()
> f = open('script.sh', 'w')
> f.write("""\
> #PBS -N %s
> NP=`wc -l<  $PBS_NODEFILE`
> cd %s
> mpirun -np $NP -machinefile $PBS_NODEFILE gpaw-python %s
> """ % (job, dir, job))
> f.close()
> os.system('qsub ' + options + ' script.sh')
as a side comment:
if openmpi is built with torque support one can drop
"-np $NP -machinefile $PBS_NODEFILE".
I remember also that some openmpi (around version 1.2) had problems when specifying both -np and -machinefile options.

>
> And I execute the script like this:
> gpaw_qsub -q default_new -l nodes=1:ppn=12:new,walltime=72:00:00 -o
> $HOME/test_mpi/2al_smp.out -e $HOME/test_mpi/2al_smp.err 2AL.py
>
>
> The reference to "new" has to do with the fact that my cluster is
> divided into a range of old node nodes, and a range of new nodes.
>
does printing the ranks work?:

mpiexec gpaw-python -c "from gpaw.mpi import rank; print rank"

This is equivalent to having
from gpaw.mpi import rank; print rank
in the python script (2Al.py).

> "Then, could you run this example
>> https://svn.fysik.dtu.dk/projects/gpaw/trunk/gpaw/test/2Al.py"
> It give me the same problem, however stderr, yields a bit more
> information this time. I've attached the stderr file, as it is quit
> large. If you prefer that i paste it in here, please let me know and i
> will do so.
>
> " Do other programs run correctly?
>> For example ring_c.c or connectivity_c.c from
>> http://svn.open-mpi.org/svn/ompi/trunk/examples/"
> Works like a charm.
>
>
>
> Thanks for your effort so far.
>
> Regards.
>
> Jesper
>
>
>
>
> 24. apr. 2012 14.20 skrev Marcin Dulak<Marcin.Dulak at fysik.dtu.dk>:
>> Hi,
>>
>>
>> On 04/24/12 11:28, Jesper Rude Selknæs wrote:
>>> Hi LIst
>>>
>>> I have recently installed GPAW, ASE3 and DACAPO on a cluster,
>>> consisting of 42 nodes of 12 cores each.
>>>
>>> I am using Redhat Enterprise 6.0 as my OS, OpenMPI 1.5.3 and TOPQUE.
>>>
>>> I have absoutely no problems running parallel GPAW aplications, as
>>> long as alle the porcesses are on the same node. However, as soon as i
>>> spread the processes over more than one node, i get into trouble. The
>>> stderr file from the TORQUE job give me:
>> is openmpi compiled with torque support?
>> How do you run parallel gpaw (exact command line)?
>> Does this work across the nodes?:
>> mpiexec gpaw-python -c "from gpaw.mpi import rank; print rank"
>> Then, could you run this example
>> https://svn.fysik.dtu.dk/projects/gpaw/trunk/gpaw/test/2Al.py
>> Do other programs run correctly?
>> For example ring_c.c or connectivity_c.c from
>> http://svn.open-mpi.org/svn/ompi/trunk/examples/
>>
>> Marcin
>>
>>> "
>>> [DFT046][[35378,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>> [DFT046][[35378,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>> [DFT046][[35378,1],4][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>> [DFT046][[35378,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>> [DFT046][[35378,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>> [DFT046][[35378,1],6][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>> =>>    PBS: job killed: node 1 (DFT045) requested job terminate, 'EOF'
>>> (code 1099) - received SISTER_EOF attempting to communicate with
>>> sister MOM's
>>> mpirun: killing job...
>>>
>>> "
>>>
>>>
>>>
>>> And from the mom logs, i simply get a "sister process died" kinf of
>>> message.
>>>
>>> I have tried to run a simple "MPI HELLO WORLD" test on as much af 10
>>> nodes, utilizing all cores, and i have no trouble doing that.
>>>
>>> Does anybody has expieiences like this?
>>
>>> Regards
>>>
>>> Jesper Selknæs
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>