[gpaw-users] Problem running parallel GPAW

Mon Apr 30 14:44:09 CEST 2012

Hi,

that looks like an assertion about the number of SCF interactions in one 
of the tests.
Maybe line 34 in 
https://trac.fysik.dtu.dk/projects/gpaw/browser/trunk/gpaw/test/2Al.py ?

Best regards,

Marcin

On 04/30/12 14:37, Jesper Rude Selknæs wrote:
> HI list
>
> Running the script interactively on the cluster, i managed to get a
> bit more informations about the problme. When the problme occurs,
> something like this is outputted (I have only pasted in part of the
> traveback lines, but they are all the same):
>
>    File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
> line 25, in equal
>    File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
> line 25, in equal
>    File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
> line 25, in equal
>    File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
> line 25, in equal
>    File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
> line 25, in equal
>    File "/usr/lib64/python2.6/site-packages/gpaw/test/__init__.py",
> line 25, in equal
>      raise AssertionError(msg)
> AssertionError: 22 != 24 (error: |-2|>  0)
> GPAW CLEANUP (node 2):<type 'exceptions.AssertionError'>  occurred.
> Calling MPI_Abort!
>      raise AssertionError(msg)
> AssertionError: 22 != 24 (error: |-2|>  0)
> GPAW CLEANUP (node 32):<type 'exceptions.AssertionError'>  occurred.
> Calling MPI_Abort!
>      raise AssertionError(msg)
> AssertionError: 22 != 24 (error: |-2|>  0)
> GPAW CLEANUP (node 27):<type 'exceptions.AssertionError'>  occurred.
> Calling MPI_Abort!
>      raise AssertionError(msg)
> AssertionError: 22 != 24 (error: |-2|>  0)
>      raise AssertionError(msg)
> AssertionError: 22 != 24 (error: |-2|>  0)
>      raise AssertionError(msg)
> AssertionError: 22 != 24 (error: |-2|>  0)
> GPAW CLEANUP (node 10):<type 'exceptions.AssertionError'>  occurred.
> Calling MPI_Abort!
> GPAW CLEANUP (node 13):<type 'exceptions.AssertionError'>  occurred.
> Calling MPI_Abort!
> GPAW CLEANUP (node 0):<type 'exceptions.AssertionError'>  occurred.
> Calling MPI_Abort!
>      raise AssertionError(msg)
> AssertionError: 22 != 24 (error: |-2|>  0)
> GPAW CLEANUP (node 3):<type 'exceptions.AssertionError'>  occurred.
> Calling MPI_Abort!
>      raise AssertionError(msg)
> AssertionError: 22 != 24 (error: |-2|>  0)
>      raise AssertionError(msg)
> AssertionError: 22 != 24 (error: |-2|>  0)
> :
>
>
> I tried googleing the assertion error, but it only yielded results
> concerning other conditions than the 22 != 24 as is the case
> soncerning my problme.
>
> Thanks in advance for any new input
>
> R.
>
> Jesper
>
>
> 2012/4/24 Jesper Rude Selknæs<jesperrude at gmail.com>:
>> Hi LIst
>>
>> I have recently installed GPAW, ASE3 and DACAPO on a cluster,
>> consisting of 42 nodes of 12 cores each.
>>
>> I am using Redhat Enterprise 6.0 as my OS, OpenMPI 1.5.3 and TOPQUE.
>>
>> I have absoutely no problems running parallel GPAW aplications, as
>> long as alle the porcesses are on the same node. However, as soon as i
>> spread the processes over more than one node, i get into trouble. The
>> stderr file from the TORQUE job give me:
>>
>>
>> "
>> [DFT046][[35378,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],4][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> [DFT046][[35378,1],6][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> =>>  PBS: job killed: node 1 (DFT045) requested job terminate, 'EOF'
>> (code 1099) - received SISTER_EOF attempting to communicate with
>> sister MOM's
>> mpirun: killing job...
>>
>> "
>>
>>
>>
>> And from the mom logs, i simply get a "sister process died" kinf of message.
>>
>> I have tried to run a simple "MPI HELLO WORLD" test on as much af 10
>> nodes, utilizing all cores, and i have no trouble doing that.
>>
>> Does anybody has expieiences like this?
>>
>> Regards
>>
>> Jesper Selknæs
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>