[gpaw-users] Error when relaxing atoms

jingzhe Chen jingzhe.chen at gmail.com
Sun Feb 8 01:42:43 CET 2015


Hi all,
      In one cluster this error is repeated again and again where gpaw is
compiled with blas/lapack, even the forces are not the same after
broadcasting.  While it disappeared when I try the same script on another
cluster(also blas/lapack).

      Best.
      Jingzhe

On Fri, Feb 6, 2015 at 12:41 PM, jingzhe <jingzhe.chen at gmail.com> wrote:

> Dear all,
>
>              I ran again in the debug mode, the results I got for the
> atoms positions on
> different ranks can differ in the order of 0.01A. And even the forces on
> different
> ranks differ in the order of  1eV/A, while every time there is only one
> rank behaves
> oddly,  now I have exchanged the two lines ( broadcast and symmetric
> correction)
> in the force calculator to see what will happen.
>
>             Best.
>
>              Jingzhe
>
>
> 于 2015年02月05日 15:53, Jens Jørgen Mortensen 写道:
>
>  On 02/04/2015 05:12 PM, Ask Hjorth Larsen wrote:
>>
>>> I committed something in r12401 which should make the check more
>>> reliable.  It does not use hashing because the atoms object is sent
>>> anyway.
>>>
>>
>> Thanks a lot for fixing this!  Should there also be some tolerance for
>> the unit cell?
>>
>> Jens Jørgen
>>
>>  Best regards
>>> Ask
>>>
>>> 2015-02-04 14:47 GMT+01:00 Ask Hjorth Larsen <asklarsen at gmail.com>:
>>>
>>>> Well, to clarify a bit.
>>>>
>>>> The hashing is useful if we don't want to send stuff around.
>>>>
>>>> If we are actually sending the positions now (by broadcast; I am only
>>>> strictly aware that the forces are broadcast), then each core can
>>>> compare locally without the need for hashing, to see if it wants to
>>>> raise an error.  (Raising errors on some cores but not all is
>>>> sometimes annoying though.)
>>>>
>>>> Best regards
>>>> Ask
>>>>
>>>> 2015-02-04 12:57 GMT+01:00 Ask Hjorth Larsen <asklarsen at gmail.com>:
>>>>
>>>>> Hello
>>>>>
>>>>> 2015-02-04 10:21 GMT+01:00 Torsten Hahn <torstenhahn at fastmail.fm>:
>>>>>
>>>>>> Probably we could do this but my feeling is, that this would only
>>>>>> cure the symptoms not the real origin of this annoying bug.
>>>>>>
>>>>>>
>>>>>> In fact there is code in
>>>>>>
>>>>>> mpi/__init__.py
>>>>>>
>>>>>> that says:
>>>>>>
>>>>>> # Construct fingerprint:
>>>>>> # ASE may return slightly different atomic positions (e.g. due
>>>>>> # to MKL) so compare only first 8 decimals of positions
>>>>>>
>>>>>>
>>>>>> The code says that only 8 decimal positions are used for the
>>>>>> generation of atomic „fingerprints“. These code relies on numpy and
>>>>>> therefore lapack/blas functions. However i have no idea what that md5_array
>>>>>> etc. stuff really does. But there is some debug-code which should at least
>>>>>> tell you which Atom(s) causes the problems.
>>>>>>
>>>>> md5_array calculates the md5 sum of the data of an array. It is a
>>>>> kind of checksum.
>>>>>
>>>>> Rounding unfortunately does not solve the problem.  For any epsilon
>>>>> however little, there exist numbers that differ by epsilon but round
>>>>> to different numbers.  So the check will not work the way it is
>>>>> implemented at the moment: Positions that are "close enough" can
>>>>> currently generate an error.  In other words if you get this error,
>>>>> maybe there was no problem at all.  Given the vast thousands of DFT
>>>>> calculations that are done, this may not be so unlikely.
>>>>>
>>>>>  However, that error is *very* strange because mpi.broadcast(...)
>>>>>> should result in *exactly* the same objects on all cores. No idea why there
>>>>>> should be any difference at all and what was the intention behind the fancy
>>>>>> fingerprint-generation stuff in the compare_atoms(atoms, comm=world) method.
>>>>>>
>>>>> The check was introduced because there were (infrequent) situations
>>>>> where different cores had different positions, due e.g. to the finicky
>>>>> numerics elsewhere discussed.  Later, I guess we have accepted the
>>>>> numerical issues and relaxed the check so it is no longer exact,
>>>>> preferring instead to broadcast.  Evidently something else is
>>>>> happening aside from the broadcast, which allows things to go wrong.
>>>>> Perhaps the error in the rounding scheme mentioned above.
>>>>>
>>>>> To explain the hashing: We want to check that numbers on two different
>>>>> CPUs are equal.  Either we have to send all the numbers, or hash them
>>>>> and send the hash.  Hence hashing is much nicer.  But maybe it would
>>>>> be better to hash them with a continuous function.  For example adding
>>>>> all numbers with different (pseudorandom?) complex phase factors.
>>>>> Then one can compare the complex hashes and see if they are close
>>>>> enough to each other.  There are probably better ways.
>>>>>
>>>>> Best regards
>>>>> Ask
>>>>>
>>>>>  Best,
>>>>>> Torsten.
>>>>>>
>>>>>>  Am 04.02.2015 um 10:00 schrieb jingzhe <jingzhe.chen at gmail.com>:
>>>>>>>
>>>>>>> Hi Torsten,
>>>>>>>
>>>>>>>               Thanks for quick reply, but I use gcc and lapack/blas,
>>>>>>> I mean if the positions
>>>>>>> of the atoms are slightly different for different ranks because of
>>>>>>> compiler/lib stuff,
>>>>>>> can we just set a tolerance in the check_atoms and jump off the
>>>>>>> error?
>>>>>>>
>>>>>>>               Best.
>>>>>>>
>>>>>>>               Jingzhe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 于 2015年02月04日 14:32, Torsten Hahn 写道:
>>>>>>>
>>>>>>>> Dear Jingzhe,
>>>>>>>>
>>>>>>>> we often recognized this error if we use GPAW together with Intel
>>>>>>>> MKL <= 11.x on Intel CPU’s. I never tracked down the error because it was
>>>>>>>> gone after compiler/library upgrade.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Torsten.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Dr. Torsten Hahn
>>>>>>>> torstenhahn at fastmail.fm
>>>>>>>>
>>>>>>>>  Am 04.02.2015 um 07:27 schrieb jingzhe Chen <
>>>>>>>>> jingzhe.chen at gmail.com>:
>>>>>>>>>
>>>>>>>>> Dear GPAW guys,
>>>>>>>>>
>>>>>>>>>          I used the latest gpaw to run a relaxation job, and find
>>>>>>>>> the below
>>>>>>>>> error message.
>>>>>>>>>
>>>>>>>>>       RuntimeError: Atoms objects on different processors are not
>>>>>>>>> identical!
>>>>>>>>>
>>>>>>>>>          I find a line in the force calculator
>>>>>>>>> 'wfs.world.broadcast(self.F_av, 0)'
>>>>>>>>> so that all the forces on different ranks should be the same,
>>>>>>>>> which makes
>>>>>>>>> me confused, I can not think out any other reason can lead to this
>>>>>>>>> error.
>>>>>>>>>
>>>>>>>>>         Could anyone take a look at it?
>>>>>>>>>
>>>>>>>>>         I attached the structure file and running script here, I
>>>>>>>>> used 24 cores.
>>>>>>>>>
>>>>>>>>>         Thanks in advance.
>>>>>>>>>
>>>>>>>>>           Jingzhe
>>>>>>>>>
>>>>>>>>> <main.py><model.traj>_______________________________________________
>>>>>>>>>
>>>>>>>>> gpaw-users mailing list
>>>>>>>>> gpaw-users at listserv.fysik.dtu.dk
>>>>>>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>>>>>>>
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> gpaw-users mailing list
>>>>>> gpaw-users at listserv.fysik.dtu.dk
>>>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>>>>
>>>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>
>> _______________________________________________
>> gpaw-users mailing list
>> gpaw-users at listserv.fysik.dtu.dk
>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20150208/6cf6a5af/attachment.html>


More information about the gpaw-users mailing list