[gpaw-users] Error when relaxing atoms

Ask Hjorth Larsen asklarsen at gmail.com
Wed Feb 4 12:57:49 CET 2015


Hello

2015-02-04 10:21 GMT+01:00 Torsten Hahn <torstenhahn at fastmail.fm>:
> Probably we could do this but my feeling is, that this would only cure the symptoms not the real origin of this annoying bug.
>
>
> In fact there is code in
>
> mpi/__init__.py
>
> that says:
>
> # Construct fingerprint:
> # ASE may return slightly different atomic positions (e.g. due
> # to MKL) so compare only first 8 decimals of positions
>
>
> The code says that only 8 decimal positions are used for the generation of atomic „fingerprints“. These code relies on numpy and therefore lapack/blas functions. However i have no idea what that md5_array etc. stuff really does. But there is some debug-code which should at least tell you which Atom(s) causes the problems.

md5_array calculates the md5 sum of the data of an array.  It is a
kind of checksum.

Rounding unfortunately does not solve the problem.  For any epsilon
however little, there exist numbers that differ by epsilon but round
to different numbers.  So the check will not work the way it is
implemented at the moment: Positions that are "close enough" can
currently generate an error.  In other words if you get this error,
maybe there was no problem at all.  Given the vast thousands of DFT
calculations that are done, this may not be so unlikely.

>
> However, that error is *very* strange because mpi.broadcast(...) should result in *exactly* the same objects on all cores. No idea why there should be any difference at all and what was the intention behind the fancy fingerprint-generation stuff in the compare_atoms(atoms, comm=world) method.

The check was introduced because there were (infrequent) situations
where different cores had different positions, due e.g. to the finicky
numerics elsewhere discussed.  Later, I guess we have accepted the
numerical issues and relaxed the check so it is no longer exact,
preferring instead to broadcast.  Evidently something else is
happening aside from the broadcast, which allows things to go wrong.
Perhaps the error in the rounding scheme mentioned above.

To explain the hashing: We want to check that numbers on two different
CPUs are equal.  Either we have to send all the numbers, or hash them
and send the hash.  Hence hashing is much nicer.  But maybe it would
be better to hash them with a continuous function.  For example adding
all numbers with different (pseudorandom?) complex phase factors.
Then one can compare the complex hashes and see if they are close
enough to each other.  There are probably better ways.

Best regards
Ask

>
> Best,
> Torsten.
>
>> Am 04.02.2015 um 10:00 schrieb jingzhe <jingzhe.chen at gmail.com>:
>>
>> Hi Torsten,
>>
>>              Thanks for quick reply, but I use gcc and lapack/blas, I mean if the positions
>> of the atoms are slightly different for different ranks because of compiler/lib stuff,
>> can we just set a tolerance in the check_atoms and jump off the error?
>>
>>              Best.
>>
>>              Jingzhe
>>
>>
>>
>>
>>
>> 于 2015年02月04日 14:32, Torsten Hahn 写道:
>>> Dear Jingzhe,
>>>
>>> we often recognized this error if we use GPAW together with Intel MKL <= 11.x on Intel CPU’s. I never tracked down the error because it was gone after compiler/library upgrade.
>>>
>>> Best,
>>> Torsten.
>>>
>>>
>>> --
>>> Dr. Torsten Hahn
>>> torstenhahn at fastmail.fm
>>>
>>>> Am 04.02.2015 um 07:27 schrieb jingzhe Chen <jingzhe.chen at gmail.com>:
>>>>
>>>> Dear GPAW guys,
>>>>
>>>>         I used the latest gpaw to run a relaxation job, and find the below
>>>> error message.
>>>>
>>>>      RuntimeError: Atoms objects on different processors are not identical!
>>>>
>>>>         I find a line in the force calculator  'wfs.world.broadcast(self.F_av, 0)'
>>>> so that all the forces on different ranks should be the same, which makes
>>>> me confused, I can not think out any other reason can lead to this error.
>>>>
>>>>        Could anyone take a look at it?
>>>>
>>>>        I attached the structure file and running script here, I used 24 cores.
>>>>
>>>>        Thanks in advance.
>>>>
>>>>          Jingzhe
>>>>
>>>> <main.py><model.traj>_______________________________________________
>>>> gpaw-users mailing list
>>>> gpaw-users at listserv.fysik.dtu.dk
>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>
>
>
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users



More information about the gpaw-users mailing list