[gpaw-users] Problem running parallel GPAW

Wed Apr 25 16:02:21 CEST 2012

Hi Jesper,

I experience the exact same problem and after days of debugging we came up with this solution:

The cluster nodes as currently configured have an extra ethernet device called virbr0.
This is a virtual bridge device meant for communications within the XEN virtualization environment.
It gets in the way of MPI programs that span compute nodes. If you run on more than one compute node you will need to tell MPI not to use this interface.
Add the following option to your mpirun or orterun command line:

–mca btl_tcp_if_include eth0,lo

Maybe your cluster setup has the same problem?

Cheers,
Lars

-------------------------------------------------------------------------------------------------------------
Lars C. Grabow
Assistant Professor of Chemical and Biomolecular Engineering
University of Houston
S335 Engineering Building 1
Houston, TX 77204-4004
email: grabow at uh.edu			web: http://www.chee.uh.edu/faculty/grabow
phone: (+1) 713-743-4326		fax: (+1) 713-743-4323

On Apr 25, 2012, at 7:31 AM, Jesper Rude Selknæs wrote:

> Hi
> 
> Printing the ranks worked fine, even on as much as ten nodes,
> utilizing all cores (I could not lay my hand on more nodes at the
> momet).
> 
> I Removed the -NP lines from the submission script, and reran the
> 2al.py example, but the results are the same: As soon as i go beyounf
> the one node, it stops working.
> 
> R.
> 
> Jesper
> 
> 25. apr. 2012 13.17 skrev Marcin Dulak <Marcin.Dulak at fysik.dtu.dk>:
>> On 04/25/12 10:29, Jesper Rude Selknæs wrote:
>>> 
>>> Hi
>>> 
>>> Thanks for your quick response, it is indeed appreciated.
>>> 
>>> To answer your questions:
>>> 
>>> " is openmpi compiled with torque support?"
>>> Yes, and it is working with a simple MPI aplication, outputting "Hello
>>> world" on all processes.
>>> 
>>> 
>>> "How do you run parallel gpaw (exact command line)?"
>>> I created a gpaw_qsub script, som the "Simple submit" section of
>>> 
>>> https://wiki.fysik.dtu.dk/gpaw/documentation/parallel_runs/parallel_runs.html
>>>  My excact script look like this:
>>> 
>>> #!/usr/bin/env python
>>> from sys import argv
>>> import os
>>> options = ' '.join(argv[1:-1])
>>> job = argv[-1]
>>> dir = os.getcwd()
>>> f = open('script.sh', 'w')
>>> f.write("""\
>>> #PBS -N %s
>>> NP=`wc -l<  $PBS_NODEFILE`
>>> cd %s
>>> mpirun -np $NP -machinefile $PBS_NODEFILE gpaw-python %s
>>> """ % (job, dir, job))
>>> f.close()
>>> os.system('qsub ' + options + ' script.sh')
>> 
>> as a side comment:
>> if openmpi is built with torque support one can drop
>> "-np $NP -machinefile $PBS_NODEFILE".
>> I remember also that some openmpi (around version 1.2) had problems when
>> specifying both -np and -machinefile options.
>> 
>> 
>>> 
>>> And I execute the script like this:
>>> gpaw_qsub -q default_new -l nodes=1:ppn=12:new,walltime=72:00:00 -o
>>> $HOME/test_mpi/2al_smp.out -e $HOME/test_mpi/2al_smp.err 2AL.py
>>> 
>>> 
>>> The reference to "new" has to do with the fact that my cluster is
>>> divided into a range of old node nodes, and a range of new nodes.
>>> 
>> does printing the ranks work?:
>> 
>> 
>> mpiexec gpaw-python -c "from gpaw.mpi import rank; print rank"
>> 
>> This is equivalent to having
>> 
>> from gpaw.mpi import rank; print rank
>> in the python script (2Al.py).
>> 
>> 
>> 
>>> "Then, could you run this example
>>>> 
>>>> https://svn.fysik.dtu.dk/projects/gpaw/trunk/gpaw/test/2Al.py"
>>> 
>>> It give me the same problem, however stderr, yields a bit more
>>> information this time. I've attached the stderr file, as it is quit
>>> large. If you prefer that i paste it in here, please let me know and i
>>> will do so.
>>> 
>>> " Do other programs run correctly?
>>>> 
>>>> For example ring_c.c or connectivity_c.c from
>>>> http://svn.open-mpi.org/svn/ompi/trunk/examples/"
>>> 
>>> Works like a charm.
>>> 
>>> 
>>> 
>>> Thanks for your effort so far.
>>> 
>>> Regards.
>>> 
>>> Jesper
>>> 
>>> 
>>> 
>>> 
>>> 24. apr. 2012 14.20 skrev Marcin Dulak<Marcin.Dulak at fysik.dtu.dk>:
>>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> On 04/24/12 11:28, Jesper Rude Selknæs wrote:
>>>>> 
>>>>> Hi LIst
>>>>> 
>>>>> I have recently installed GPAW, ASE3 and DACAPO on a cluster,
>>>>> consisting of 42 nodes of 12 cores each.
>>>>> 
>>>>> I am using Redhat Enterprise 6.0 as my OS, OpenMPI 1.5.3 and TOPQUE.
>>>>> 
>>>>> I have absoutely no problems running parallel GPAW aplications, as
>>>>> long as alle the porcesses are on the same node. However, as soon as i
>>>>> spread the processes over more than one node, i get into trouble. The
>>>>> stderr file from the TORQUE job give me:
>>>> 
>>>> is openmpi compiled with torque support?
>>>> How do you run parallel gpaw (exact command line)?
>>>> Does this work across the nodes?:
>>>> mpiexec gpaw-python -c "from gpaw.mpi import rank; print rank"
>>>> Then, could you run this example
>>>> https://svn.fysik.dtu.dk/projects/gpaw/trunk/gpaw/test/2Al.py
>>>> Do other programs run correctly?
>>>> For example ring_c.c or connectivity_c.c from
>>>> http://svn.open-mpi.org/svn/ompi/trunk/examples/
>>>> 
>>>> Marcin
>>>> 
>>>>> "
>>>>> [DFT046][[35378,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>>> [DFT046][[35378,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>>> [DFT046][[35378,1],4][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>>> [DFT046][[35378,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>>> [DFT046][[35378,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>>> [DFT046][[35378,1],6][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>>>>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>>>>> =>>    PBS: job killed: node 1 (DFT045) requested job terminate, 'EOF'
>>>>> (code 1099) - received SISTER_EOF attempting to communicate with
>>>>> sister MOM's
>>>>> mpirun: killing job...
>>>>> 
>>>>> "
>>>>> 
>>>>> 
>>>>> 
>>>>> And from the mom logs, i simply get a "sister process died" kinf of
>>>>> message.
>>>>> 
>>>>> I have tried to run a simple "MPI HELLO WORLD" test on as much af 10
>>>>> nodes, utilizing all cores, and i have no trouble doing that.
>>>>> 
>>>>> Does anybody has expieiences like this?
>>>> 
>>>> 
>>>>> Regards
>>>>> 
>>>>> Jesper Selknæs
>>>>> 
>>>>> _______________________________________________
>>>>> gpaw-users mailing list
>>>>> gpaw-users at listserv.fysik.dtu.dk
>>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>>> 
>>>> 
>> 
> 
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20120425/6fbd822c/attachment-0001.html