[gpaw-users] Problem running parallel GPAW

Wed Apr 25 14:31:32 CEST 2012

Hi

Printing the ranks worked fine, even on as much as ten nodes,
utilizing all cores (I could not lay my hand on more nodes at the
momet).

I Removed the -NP lines from the submission script, and reran the
2al.py example, but the results are the same: As soon as i go beyounf
the one node, it stops working.

R.

Jesper

25. apr. 2012 13.17 skrev Marcin Dulak <Marcin.Dulak at fysik.dtu.dk>:
> On 04/25/12 10:29, Jesper Rude Selknæs wrote:
>>
>> Hi
>>
>> Thanks for your quick response, it is indeed appreciated.
>>
>> To answer your questions:
>>
>> " is openmpi compiled with torque support?"
>> Yes, and it is working with a simple MPI aplication, outputting "Hello
>> world" on all processes.
>>
>>
>> "How do you run parallel gpaw (exact command line)?"
>> I created a gpaw_qsub script, som the "Simple submit" section of
>>
>> https://wiki.fysik.dtu.dk/gpaw/documentation/parallel_runs/parallel_runs.html
>>  My excact script look like this:
>>
>> #!/usr/bin/env python
>> from sys import argv
>> import os
>> options = ' '.join(argv[1:-1])
>> job = argv[-1]
>> dir = os.getcwd()
>> f = open('script.sh', 'w')
>> f.write("""\
>> #PBS -N %s
>> NP=`wc -l<  $PBS_NODEFILE`
>> cd %s
>> mpirun -np $NP -machinefile $PBS_NODEFILE gpaw-python %s
>> """ % (job, dir, job))
>> f.close()
>> os.system('qsub ' + options + ' script.sh')
>
> as a side comment:
> if openmpi is built with torque support one can drop
> "-np $NP -machinefile $PBS_NODEFILE".
> I remember also that some openmpi (around version 1.2) had problems when
> specifying both -np and -machinefile options.
>
>
>>
>> And I execute the script like this:
>> gpaw_qsub -q default_new -l nodes=1:ppn=12:new,walltime=72:00:00 -o
>> $HOME/test_mpi/2al_smp.out -e $HOME/test_mpi/2al_smp.err 2AL.py
>>
>>
>> The reference to "new" has to do with the fact that my cluster is
>> divided into a range of old node nodes, and a range of new nodes.
>>
> does printing the ranks work?:
>
>
> mpiexec gpaw-python -c "from gpaw.mpi import rank; print rank"
>
> This is equivalent to having
>
> from gpaw.mpi import rank; print rank
> in the python script (2Al.py).
>
>
>
>> "Then, could you run this example
>>>
>>> https://svn.fysik.dtu.dk/projects/gpaw/trunk/gpaw/test/2Al.py"
>>
>> It give me the same problem, however stderr, yields a bit more
>> information this time. I've attached the stderr file, as it is quit
>> large. If you prefer that i paste it in here, please let me know and i
>> will do so.
>>
>> " Do other programs run correctly?
>>>
>>> For example ring_c.c or connectivity_c.c from
>>> http://svn.open-mpi.org/svn/ompi/trunk/examples/"
>>
>> Works like a charm.
>>
>>
>>
>> Thanks for your effort so far.
>>
>> Regards.
>>
>> Jesper
>>
>>
>>
>>
>> 24. apr. 2012 14.20 skrev Marcin Dulak<Marcin.Dulak at fysik.dtu.dk>:
>>>
>>> Hi,
>>>
>>>
>>> On 04/24/12 11:28, Jesper Rude Selknæs wrote:
>>>>
>>>> Hi LIst
>>>>
>>>> I have recently installed GPAW, ASE3 and DACAPO on a cluster,
>>>> consisting of 42 nodes of 12 cores each.
>>>>
>>>> I am using Redhat Enterprise 6.0 as my OS, OpenMPI 1.5.3 and TOPQUE.
>>>>
>>>> I have absoutely no problems running parallel GPAW aplications, as
>>>> long as alle the porcesses are on the same node. However, as soon as i
>>>> spread the processes over more than one node, i get into trouble. The
>>>> stderr file from the TORQUE job give me:
>>>
>>> is openmpi compiled with torque support?
>>> How do you run parallel gpaw (exact command line)?
>>> Does this work across the nodes?:
>>> mpiexec gpaw-python -c "from gpaw.mpi import rank; print rank"
>>> Then, could you run this example
>>> https://svn.fysik.dtu.dk/projects/gpaw/trunk/gpaw/test/2Al.py
>>> Do other programs run correctly?
>>> For example ring_c.c or connectivity_c.c from
>>> http://svn.open-mpi.org/svn/ompi/trunk/examples/
>>>
>>> Marcin
>>>
>>>> "
>>>> [DFT046][[35378,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>> [DFT046][[35378,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>> [DFT046][[35378,1],4][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>> [DFT046][[35378,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>> [DFT046][[35378,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>> [DFT046][[35378,1],6][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>> =>>    PBS: job killed: node 1 (DFT045) requested job terminate, 'EOF'
>>>> (code 1099) - received SISTER_EOF attempting to communicate with
>>>> sister MOM's
>>>> mpirun: killing job...
>>>>
>>>> "
>>>>
>>>>
>>>>
>>>> And from the mom logs, i simply get a "sister process died" kinf of
>>>> message.
>>>>
>>>> I have tried to run a simple "MPI HELLO WORLD" test on as much af 10
>>>> nodes, utilizing all cores, and i have no trouble doing that.
>>>>
>>>> Does anybody has expieiences like this?
>>>
>>>
>>>> Regards
>>>>
>>>> Jesper Selknæs
>>>>
>>>> _______________________________________________
>>>> gpaw-users mailing list
>>>> gpaw-users at listserv.fysik.dtu.dk
>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>>
>>>
>