[gpaw-users] Error when relaxing atoms
Ask Hjorth Larsen
asklarsen at gmail.com
Wed Feb 4 17:12:14 CET 2015
I committed something in r12401 which should make the check more
reliable. It does not use hashing because the atoms object is sent
anyway.
Best regards
Ask
2015-02-04 14:47 GMT+01:00 Ask Hjorth Larsen <asklarsen at gmail.com>:
> Well, to clarify a bit.
>
> The hashing is useful if we don't want to send stuff around.
>
> If we are actually sending the positions now (by broadcast; I am only
> strictly aware that the forces are broadcast), then each core can
> compare locally without the need for hashing, to see if it wants to
> raise an error. (Raising errors on some cores but not all is
> sometimes annoying though.)
>
> Best regards
> Ask
>
> 2015-02-04 12:57 GMT+01:00 Ask Hjorth Larsen <asklarsen at gmail.com>:
>> Hello
>>
>> 2015-02-04 10:21 GMT+01:00 Torsten Hahn <torstenhahn at fastmail.fm>:
>>> Probably we could do this but my feeling is, that this would only cure the symptoms not the real origin of this annoying bug.
>>>
>>>
>>> In fact there is code in
>>>
>>> mpi/__init__.py
>>>
>>> that says:
>>>
>>> # Construct fingerprint:
>>> # ASE may return slightly different atomic positions (e.g. due
>>> # to MKL) so compare only first 8 decimals of positions
>>>
>>>
>>> The code says that only 8 decimal positions are used for the generation of atomic „fingerprints“. These code relies on numpy and therefore lapack/blas functions. However i have no idea what that md5_array etc. stuff really does. But there is some debug-code which should at least tell you which Atom(s) causes the problems.
>>
>> md5_array calculates the md5 sum of the data of an array. It is a
>> kind of checksum.
>>
>> Rounding unfortunately does not solve the problem. For any epsilon
>> however little, there exist numbers that differ by epsilon but round
>> to different numbers. So the check will not work the way it is
>> implemented at the moment: Positions that are "close enough" can
>> currently generate an error. In other words if you get this error,
>> maybe there was no problem at all. Given the vast thousands of DFT
>> calculations that are done, this may not be so unlikely.
>>
>>>
>>> However, that error is *very* strange because mpi.broadcast(...) should result in *exactly* the same objects on all cores. No idea why there should be any difference at all and what was the intention behind the fancy fingerprint-generation stuff in the compare_atoms(atoms, comm=world) method.
>>
>> The check was introduced because there were (infrequent) situations
>> where different cores had different positions, due e.g. to the finicky
>> numerics elsewhere discussed. Later, I guess we have accepted the
>> numerical issues and relaxed the check so it is no longer exact,
>> preferring instead to broadcast. Evidently something else is
>> happening aside from the broadcast, which allows things to go wrong.
>> Perhaps the error in the rounding scheme mentioned above.
>>
>> To explain the hashing: We want to check that numbers on two different
>> CPUs are equal. Either we have to send all the numbers, or hash them
>> and send the hash. Hence hashing is much nicer. But maybe it would
>> be better to hash them with a continuous function. For example adding
>> all numbers with different (pseudorandom?) complex phase factors.
>> Then one can compare the complex hashes and see if they are close
>> enough to each other. There are probably better ways.
>>
>> Best regards
>> Ask
>>
>>>
>>> Best,
>>> Torsten.
>>>
>>>> Am 04.02.2015 um 10:00 schrieb jingzhe <jingzhe.chen at gmail.com>:
>>>>
>>>> Hi Torsten,
>>>>
>>>> Thanks for quick reply, but I use gcc and lapack/blas, I mean if the positions
>>>> of the atoms are slightly different for different ranks because of compiler/lib stuff,
>>>> can we just set a tolerance in the check_atoms and jump off the error?
>>>>
>>>> Best.
>>>>
>>>> Jingzhe
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 于 2015年02月04日 14:32, Torsten Hahn 写道:
>>>>> Dear Jingzhe,
>>>>>
>>>>> we often recognized this error if we use GPAW together with Intel MKL <= 11.x on Intel CPU’s. I never tracked down the error because it was gone after compiler/library upgrade.
>>>>>
>>>>> Best,
>>>>> Torsten.
>>>>>
>>>>>
>>>>> --
>>>>> Dr. Torsten Hahn
>>>>> torstenhahn at fastmail.fm
>>>>>
>>>>>> Am 04.02.2015 um 07:27 schrieb jingzhe Chen <jingzhe.chen at gmail.com>:
>>>>>>
>>>>>> Dear GPAW guys,
>>>>>>
>>>>>> I used the latest gpaw to run a relaxation job, and find the below
>>>>>> error message.
>>>>>>
>>>>>> RuntimeError: Atoms objects on different processors are not identical!
>>>>>>
>>>>>> I find a line in the force calculator 'wfs.world.broadcast(self.F_av, 0)'
>>>>>> so that all the forces on different ranks should be the same, which makes
>>>>>> me confused, I can not think out any other reason can lead to this error.
>>>>>>
>>>>>> Could anyone take a look at it?
>>>>>>
>>>>>> I attached the structure file and running script here, I used 24 cores.
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>> Jingzhe
>>>>>>
>>>>>> <main.py><model.traj>_______________________________________________
>>>>>> gpaw-users mailing list
>>>>>> gpaw-users at listserv.fysik.dtu.dk
>>>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
More information about the gpaw-users
mailing list