[gpaw-users] Error when relaxing atoms

Thu Feb 5 08:53:08 CET 2015

On 02/04/2015 05:12 PM, Ask Hjorth Larsen wrote:
> I committed something in r12401 which should make the check more
> reliable.  It does not use hashing because the atoms object is sent
> anyway.

Thanks a lot for fixing this!  Should there also be some tolerance for 
the unit cell?

Jens Jørgen

> Best regards
> Ask
>
> 2015-02-04 14:47 GMT+01:00 Ask Hjorth Larsen <asklarsen at gmail.com>:
>> Well, to clarify a bit.
>>
>> The hashing is useful if we don't want to send stuff around.
>>
>> If we are actually sending the positions now (by broadcast; I am only
>> strictly aware that the forces are broadcast), then each core can
>> compare locally without the need for hashing, to see if it wants to
>> raise an error.  (Raising errors on some cores but not all is
>> sometimes annoying though.)
>>
>> Best regards
>> Ask
>>
>> 2015-02-04 12:57 GMT+01:00 Ask Hjorth Larsen <asklarsen at gmail.com>:
>>> Hello
>>>
>>> 2015-02-04 10:21 GMT+01:00 Torsten Hahn <torstenhahn at fastmail.fm>:
>>>> Probably we could do this but my feeling is, that this would only cure the symptoms not the real origin of this annoying bug.
>>>>
>>>>
>>>> In fact there is code in
>>>>
>>>> mpi/__init__.py
>>>>
>>>> that says:
>>>>
>>>> # Construct fingerprint:
>>>> # ASE may return slightly different atomic positions (e.g. due
>>>> # to MKL) so compare only first 8 decimals of positions
>>>>
>>>>
>>>> The code says that only 8 decimal positions are used for the generation of atomic „fingerprints“. These code relies on numpy and therefore lapack/blas functions. However i have no idea what that md5_array etc. stuff really does. But there is some debug-code which should at least tell you which Atom(s) causes the problems.
>>> md5_array calculates the md5 sum of the data of an array.  It is a
>>> kind of checksum.
>>>
>>> Rounding unfortunately does not solve the problem.  For any epsilon
>>> however little, there exist numbers that differ by epsilon but round
>>> to different numbers.  So the check will not work the way it is
>>> implemented at the moment: Positions that are "close enough" can
>>> currently generate an error.  In other words if you get this error,
>>> maybe there was no problem at all.  Given the vast thousands of DFT
>>> calculations that are done, this may not be so unlikely.
>>>
>>>> However, that error is *very* strange because mpi.broadcast(...) should result in *exactly* the same objects on all cores. No idea why there should be any difference at all and what was the intention behind the fancy fingerprint-generation stuff in the compare_atoms(atoms, comm=world) method.
>>> The check was introduced because there were (infrequent) situations
>>> where different cores had different positions, due e.g. to the finicky
>>> numerics elsewhere discussed.  Later, I guess we have accepted the
>>> numerical issues and relaxed the check so it is no longer exact,
>>> preferring instead to broadcast.  Evidently something else is
>>> happening aside from the broadcast, which allows things to go wrong.
>>> Perhaps the error in the rounding scheme mentioned above.
>>>
>>> To explain the hashing: We want to check that numbers on two different
>>> CPUs are equal.  Either we have to send all the numbers, or hash them
>>> and send the hash.  Hence hashing is much nicer.  But maybe it would
>>> be better to hash them with a continuous function.  For example adding
>>> all numbers with different (pseudorandom?) complex phase factors.
>>> Then one can compare the complex hashes and see if they are close
>>> enough to each other.  There are probably better ways.
>>>
>>> Best regards
>>> Ask
>>>
>>>> Best,
>>>> Torsten.
>>>>
>>>>> Am 04.02.2015 um 10:00 schrieb jingzhe <jingzhe.chen at gmail.com>:
>>>>>
>>>>> Hi Torsten,
>>>>>
>>>>>               Thanks for quick reply, but I use gcc and lapack/blas, I mean if the positions
>>>>> of the atoms are slightly different for different ranks because of compiler/lib stuff,
>>>>> can we just set a tolerance in the check_atoms and jump off the error?
>>>>>
>>>>>               Best.
>>>>>
>>>>>               Jingzhe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 于 2015年02月04日 14:32, Torsten Hahn 写道:
>>>>>> Dear Jingzhe,
>>>>>>
>>>>>> we often recognized this error if we use GPAW together with Intel MKL <= 11.x on Intel CPU’s. I never tracked down the error because it was gone after compiler/library upgrade.
>>>>>>
>>>>>> Best,
>>>>>> Torsten.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dr. Torsten Hahn
>>>>>> torstenhahn at fastmail.fm
>>>>>>
>>>>>>> Am 04.02.2015 um 07:27 schrieb jingzhe Chen <jingzhe.chen at gmail.com>:
>>>>>>>
>>>>>>> Dear GPAW guys,
>>>>>>>
>>>>>>>          I used the latest gpaw to run a relaxation job, and find the below
>>>>>>> error message.
>>>>>>>
>>>>>>>       RuntimeError: Atoms objects on different processors are not identical!
>>>>>>>
>>>>>>>          I find a line in the force calculator  'wfs.world.broadcast(self.F_av, 0)'
>>>>>>> so that all the forces on different ranks should be the same, which makes
>>>>>>> me confused, I can not think out any other reason can lead to this error.
>>>>>>>
>>>>>>>         Could anyone take a look at it?
>>>>>>>
>>>>>>>         I attached the structure file and running script here, I used 24 cores.
>>>>>>>
>>>>>>>         Thanks in advance.
>>>>>>>
>>>>>>>           Jingzhe
>>>>>>>
>>>>>>> <main.py><model.traj>_______________________________________________
>>>>>>> gpaw-users mailing list
>>>>>>> gpaw-users at listserv.fysik.dtu.dk
>>>>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>>
>>>> _______________________________________________
>>>> gpaw-users mailing list
>>>> gpaw-users at listserv.fysik.dtu.dk
>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users