[gpaw-users] Problem running parallel GPAW

Jesper Rude Selknæs jesperrude at gmail.com
Wed Apr 25 10:29:16 CEST 2012


Hi

Thanks for your quick response, it is indeed appreciated.

To answer your questions:

" is openmpi compiled with torque support?"
Yes, and it is working with a simple MPI aplication, outputting "Hello
world" on all processes.


"How do you run parallel gpaw (exact command line)?"
I created a gpaw_qsub script, som the "Simple submit" section of
https://wiki.fysik.dtu.dk/gpaw/documentation/parallel_runs/parallel_runs.html
 My excact script look like this:

#!/usr/bin/env python
from sys import argv
import os
options = ' '.join(argv[1:-1])
job = argv[-1]
dir = os.getcwd()
f = open('script.sh', 'w')
f.write("""\
#PBS -N %s
NP=`wc -l < $PBS_NODEFILE`
cd %s
mpirun -np $NP -machinefile $PBS_NODEFILE gpaw-python %s
""" % (job, dir, job))
f.close()
os.system('qsub ' + options + ' script.sh')



And I execute the script like this:
gpaw_qsub -q default_new -l nodes=1:ppn=12:new,walltime=72:00:00 -o
$HOME/test_mpi/2al_smp.out -e $HOME/test_mpi/2al_smp.err 2AL.py


The reference to "new" has to do with the fact that my cluster is
divided into a range of old node nodes, and a range of new nodes.


"Then, could you run this example
> https://svn.fysik.dtu.dk/projects/gpaw/trunk/gpaw/test/2Al.py"

It give me the same problem, however stderr, yields a bit more
information this time. I've attached the stderr file, as it is quit
large. If you prefer that i paste it in here, please let me know and i
will do so.


" Do other programs run correctly?
> For example ring_c.c or connectivity_c.c from
> http://svn.open-mpi.org/svn/ompi/trunk/examples/"

Works like a charm.



Thanks for your effort so far.

Regards.

Jesper




24. apr. 2012 14.20 skrev Marcin Dulak <Marcin.Dulak at fysik.dtu.dk>:
> Hi,
>
>
> On 04/24/12 11:28, Jesper Rude Selknæs wrote:
>>
>> Hi LIst
>>
>> I have recently installed GPAW, ASE3 and DACAPO on a cluster,
>> consisting of 42 nodes of 12 cores each.
>>
>> I am using Redhat Enterprise 6.0 as my OS, OpenMPI 1.5.3 and TOPQUE.
>>
>> I have absoutely no problems running parallel GPAW aplications, as
>> long as alle the porcesses are on the same node. However, as soon as i
>> spread the processes over more than one node, i get into trouble. The
>> stderr file from the TORQUE job give me:
>
> is openmpi compiled with torque support?
> How do you run parallel gpaw (exact command line)?
> Does this work across the nodes?:
> mpiexec gpaw-python -c "from gpaw.mpi import rank; print rank"
> Then, could you run this example
> https://svn.fysik.dtu.dk/projects/gpaw/trunk/gpaw/test/2Al.py
> Do other programs run correctly?
> For example ring_c.c or connectivity_c.c from
> http://svn.open-mpi.org/svn/ompi/trunk/examples/
>
> Marcin
>
>>
>> "
>> [DFT046][[35378,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],4][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],6][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> =>>  PBS: job killed: node 1 (DFT045) requested job terminate, 'EOF'
>> (code 1099) - received SISTER_EOF attempting to communicate with
>> sister MOM's
>> mpirun: killing job...
>>
>> "
>>
>>
>>
>> And from the mom logs, i simply get a "sister process died" kinf of
>> message.
>>
>> I have tried to run a simple "MPI HELLO WORLD" test on as much af 10
>> nodes, utilizing all cores, and i have no trouble doing that.
>>
>> Does anybody has expieiences like this?
>
>
>>
>> Regards
>>
>> Jesper Selknæs
>>
>> _______________________________________________
>> gpaw-users mailing list
>> gpaw-users at listserv.fysik.dtu.dk
>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>
>



More information about the gpaw-users mailing list