[gpaw-users] Problem running parallel GPAW

Marcin Dulak Marcin.Dulak at fysik.dtu.dk
Tue May 1 10:48:44 CEST 2012


Hi,


On 04/30/12 16:53, Jesper Rude Selknæs wrote:
> Hi list
>
> I did a bit more investigation. From
> https://wiki.fysik.dtu.dk/gpaw/install/installationguide.html#running-tests
>   i ran the rank example:
>
> [gpaw]$ mpirun -np 2 gpaw-python -c "import gpaw.mpi as mpi; print mpi.rank"
>
> which ran flawlessly for up to 10 nodes (120  cores total).
>
> I then ran [gpaw]$ mpirun -np 2 gpaw-python gpaw/test/spinpol.py
>
> which fails as soon as I go beyounf the one node. It runs fine for 12
> cores within the same node. I know that tehre is no speedup (actaully
> tehre is heavy speeddown) using this many cores, but it executes
> correctly, which is all i wants for the moment. Part of output ftom
> spinpol.py on 2 nodes is given below, as it is quit large.
>
> Any ideas?

i think the failures you see are simply due to running small test 
examples on may cores,
like the case of spinpol.py on 24 cores. Can you try to run spinpol.py 
across two nodes,
but limiting the number of cores to 12 (mpiexec -np 12 --bynode)?
After that you may try some larger examples (those should work across 
few nodes), like:
https://wiki.fysik.dtu.dk/gpaw/devel/benchmarks.html#medium-size-system
It looks like the standard tests:

mpiexec gpaw-python `which gpaw-test`

(https://wiki.fysik.dtu.dk/gpaw/install/installationguide.html#run-the-tests) 
should not be run in parallel on more than X cores,
but there is currently nothing that prevents one from doing so. We don't 
run the standard tests on more than 8 cores.
You can still run the tests on a 12-cores node using the -np N option, 
with N=1,8. Do all tests pass on 12 cores?
If anything fails, please report it.

Best regards,

Marcin

>
> R.
>
> Jesper
>
>
>
> File "/usr/lib/python2.6/site-packages/ase/atoms.py", line 548, in
> get_potential_energy
>      self.initialize(atoms)
>    File "/usr/lib64/python2.6/site-packages/gpaw/paw.py", line 591, in initialize
>      return self._calc.get_potential_energy(self)
>    File "/usr/lib64/python2.6/site-packages/gpaw/aseinterface.py", line
> 37, in get_potential_energy
>      self.calculate(atoms, converge=True)
>    File "/usr/lib64/python2.6/site-packages/gpaw/paw.py", line 221, in calculate
>      self.initialize(atoms)
>    File "/usr/lib64/python2.6/site-packages/gpaw/paw.py", line 591, in initialize
>      magmom_a, par.hund)
>      return self._calc.get_potential_energy(self)
>    File "/usr/lib64/python2.6/site-packages/gpaw/aseinterface.py", line
> 37, in get_potential_energy
>      self.calculate(atoms, converge=True)
>    File "/usr/lib64/python2.6/site-packages/gpaw/paw.py", line 221, in calculate
>    File "/usr/lib64/python2.6/site-packages/gpaw/density.py", line 74,
> in initialize
>      allocate=False)
>    File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
> 144, in Transformer
>      self.initialize(atoms)
>    File "/usr/lib64/python2.6/site-packages/gpaw/paw.py", line 591, in initialize
>      t = _Transformer(gdin, gdout, nn, dtype, allocate)
>    File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
> 49, in __init__
>      gdout.n_c[0] * gdout.n_c[1])
> AssertionError
>      magmom_a, par.hund)
>    File "/usr/lib64/python2.6/site-packages/gpaw/density.py", line 74,
> in initialize
>      allocate=False)
>    File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
> 144, in Transformer
>      t = _Transformer(gdin, gdout, nn, dtype, allocate)
>    File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
> 49, in __init__
>      gdout.n_c[0] * gdout.n_c[1])
>      magmom_a, par.hund)
>    File "/usr/lib64/python2.6/site-packages/gpaw/density.py", line 74,
> in initialize
> GPAW CLEANUP (node 1):<type 'exceptions.AssertionError'>  occurred.
> Calling MPI_Abort!
>      allocate=False)
>    File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
> 144, in Transformer
> AssertionError
>      t = _Transformer(gdin, gdout, nn, dtype, allocate)
>    File "/usr/lib64/python2.6/site-packages/gpaw/transformers.py", line
> 49, in __init__
> GPAW CLEANUP (node 17):<type 'exceptions.AssertionError'>  occurred.
> Calling MPI_Abort!
>      gdout.n_c[0] * gdout.n_c[1])
> AssertionError
> GPAW CLEANUP (node 2):<type 'exceptions.AssertionError'>  occurred.
> Calling MPI_Abort!
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
> with errorcode 42.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 5881 on
> node DFT030 exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [DFT030:05880] 10 more processes have sent help message
> help-mpi-api.txt / mpi-abort
> [DFT030:05880] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
>
>
>
>
> 2012/4/24 Jesper Rude Selknæs<jesperrude at gmail.com>:
>> Hi LIst
>>
>> I have recently installed GPAW, ASE3 and DACAPO on a cluster,
>> consisting of 42 nodes of 12 cores each.
>>
>> I am using Redhat Enterprise 6.0 as my OS, OpenMPI 1.5.3 and TOPQUE.
>>
>> I have absoutely no problems running parallel GPAW aplications, as
>> long as alle the porcesses are on the same node. However, as soon as i
>> spread the processes over more than one node, i get into trouble. The
>> stderr file from the TORQUE job give me:
>>
>>
>> "
>> [DFT046][[35378,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],4][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],6][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> =>>  PBS: job killed: node 1 (DFT045) requested job terminate, 'EOF'
>> (code 1099) - received SISTER_EOF attempting to communicate with
>> sister MOM's
>> mpirun: killing job...
>>
>> "
>>
>>
>>
>> And from the mom logs, i simply get a "sister process died" kinf of message.
>>
>> I have tried to run a simple "MPI HELLO WORLD" test on as much af 10
>> nodes, utilizing all cores, and i have no trouble doing that.
>>
>> Does anybody has expieiences like this?
>>
>> Regards
>>
>> Jesper Selknæs
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>


-- 
***********************************

Marcin Dulak
Technical University of Denmark
Department of Physics
Building 307, Room 229
DK-2800 Kongens Lyngby
Denmark
Tel.: (+45) 4525 3157
Fax.: (+45) 4593 2399
email: Marcin.Dulak at fysik.dtu.dk

***********************************



More information about the gpaw-users mailing list