[gpaw-users] Error when relaxing atoms

Marcin Dulak Marcin.Dulak at fysik.dtu.dk
Sun Feb 8 11:33:06 CET 2015


On 02/08/2015 01:42 AM, jingzhe Chen wrote:
> Hi all,
>       In one cluster this error is repeated again and again where gpaw is
> compiled with blas/lapack, even the forces are not the same after
> broadcasting.  While it disappeared when I try the same script on another
> cluster(also blas/lapack).
do the full gpaw-test pass in parallel?
How was numpy compiled on those clusters?
To have the full information, provide:

ldd `which gpaw-python`

python -c "import numpy; print numpy.__config__.show(); print  numpy.__version__"

In addition check the numpy's _dotblas.so linked libraries (_dotblas.so 
the source of problems most often) with:
ldd `python -c "from numpy.core import _dotblas; print _dotblas.__file__"`

Best regards,

Marcin
>
>       Best.
>       Jingzhe
>
> On Fri, Feb 6, 2015 at 12:41 PM, jingzhe <jingzhe.chen at gmail.com 
> <mailto:jingzhe.chen at gmail.com>> wrote:
>
>     Dear all,
>
>                  I ran again in the debug mode, the results I got for
>     the atoms positions on
>     different ranks can differ in the order of 0.01A. And even the
>     forces on different
>     ranks differ in the order of  1eV/A, while every time there is
>     only one rank behaves
>     oddly,  now I have exchanged the two lines ( broadcast and
>     symmetric correction)
>     in the force calculator to see what will happen.
>
>                 Best.
>
>                  Jingzhe
>
>
>     于 2015年02月05日 15:53, Jens Jørgen Mortensen 写道:
>
>         On 02/04/2015 05:12 PM, Ask Hjorth Larsen wrote:
>
>             I committed something in r12401 which should make the
>             check more
>             reliable.  It does not use hashing because the atoms
>             object is sent
>             anyway.
>
>
>         Thanks a lot for fixing this!  Should there also be some
>         tolerance for the unit cell?
>
>         Jens Jørgen
>
>             Best regards
>             Ask
>
>             2015-02-04 14:47 GMT+01:00 Ask Hjorth Larsen
>             <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>
>                 Well, to clarify a bit.
>
>                 The hashing is useful if we don't want to send stuff
>                 around.
>
>                 If we are actually sending the positions now (by
>                 broadcast; I am only
>                 strictly aware that the forces are broadcast), then
>                 each core can
>                 compare locally without the need for hashing, to see
>                 if it wants to
>                 raise an error.  (Raising errors on some cores but not
>                 all is
>                 sometimes annoying though.)
>
>                 Best regards
>                 Ask
>
>                 2015-02-04 12:57 GMT+01:00 Ask Hjorth Larsen
>                 <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>
>                     Hello
>
>                     2015-02-04 10:21 GMT+01:00 Torsten Hahn
>                     <torstenhahn at fastmail.fm
>                     <mailto:torstenhahn at fastmail.fm>>:
>
>                         Probably we could do this but my feeling is,
>                         that this would only cure the symptoms not the
>                         real origin of this annoying bug.
>
>
>                         In fact there is code in
>
>                         mpi/__init__.py
>
>                         that says:
>
>                         # Construct fingerprint:
>                         # ASE may return slightly different atomic
>                         positions (e.g. due
>                         # to MKL) so compare only first 8 decimals of
>                         positions
>
>
>                         The code says that only 8 decimal positions
>                         are used for the generation of atomic
>                         „fingerprints“. These code relies on numpy and
>                         therefore lapack/blas functions. However i
>                         have no idea what that md5_array etc. stuff
>                         really does. But there is some debug-code
>                         which should at least tell you which Atom(s)
>                         causes the problems.
>
>                     md5_array calculates the md5 sum of the data of an
>                     array. It is a
>                     kind of checksum.
>
>                     Rounding unfortunately does not solve the
>                     problem.  For any epsilon
>                     however little, there exist numbers that differ by
>                     epsilon but round
>                     to different numbers.  So the check will not work
>                     the way it is
>                     implemented at the moment: Positions that are
>                     "close enough" can
>                     currently generate an error.  In other words if
>                     you get this error,
>                     maybe there was no problem at all.  Given the vast
>                     thousands of DFT
>                     calculations that are done, this may not be so
>                     unlikely.
>
>                         However, that error is *very* strange because
>                         mpi.broadcast(...) should result in *exactly*
>                         the same objects on all cores. No idea why
>                         there should be any difference at all and what
>                         was the intention behind the fancy
>                         fingerprint-generation stuff in the
>                         compare_atoms(atoms, comm=world) method.
>
>                     The check was introduced because there were
>                     (infrequent) situations
>                     where different cores had different positions, due
>                     e.g. to the finicky
>                     numerics elsewhere discussed.  Later, I guess we
>                     have accepted the
>                     numerical issues and relaxed the check so it is no
>                     longer exact,
>                     preferring instead to broadcast.  Evidently
>                     something else is
>                     happening aside from the broadcast, which allows
>                     things to go wrong.
>                     Perhaps the error in the rounding scheme mentioned
>                     above.
>
>                     To explain the hashing: We want to check that
>                     numbers on two different
>                     CPUs are equal.  Either we have to send all the
>                     numbers, or hash them
>                     and send the hash.  Hence hashing is much nicer. 
>                     But maybe it would
>                     be better to hash them with a continuous
>                     function.  For example adding
>                     all numbers with different (pseudorandom?) complex
>                     phase factors.
>                     Then one can compare the complex hashes and see if
>                     they are close
>                     enough to each other.  There are probably better ways.
>
>                     Best regards
>                     Ask
>
>                         Best,
>                         Torsten.
>
>                             Am 04.02.2015 um 10:00 schrieb jingzhe
>                             <jingzhe.chen at gmail.com
>                             <mailto:jingzhe.chen at gmail.com>>:
>
>                             Hi Torsten,
>
>                                           Thanks for quick reply, but
>                             I use gcc and lapack/blas, I mean if the
>                             positions
>                             of the atoms are slightly different for
>                             different ranks because of compiler/lib stuff,
>                             can we just set a tolerance in the
>                             check_atoms and jump off the error?
>
>                                           Best.
>
>                                           Jingzhe
>
>
>
>
>
>                             于 2015年02月04日 14:32, Torsten Hahn 写道:
>
>                                 Dear Jingzhe,
>
>                                 we often recognized this error if we
>                                 use GPAW together with Intel MKL <=
>                                 11.x on Intel CPU’s. I never tracked
>                                 down the error because it was gone
>                                 after compiler/library upgrade.
>
>                                 Best,
>                                 Torsten.
>
>
>                                 -- 
>                                 Dr. Torsten Hahn
>                                 torstenhahn at fastmail.fm
>                                 <mailto:torstenhahn at fastmail.fm>
>
>                                     Am 04.02.2015 um 07:27 schrieb
>                                     jingzhe Chen
>                                     <jingzhe.chen at gmail.com
>                                     <mailto:jingzhe.chen at gmail.com>>:
>
>                                     Dear GPAW guys,
>
>                                              I used the latest gpaw to
>                                     run a relaxation job, and find the
>                                     below
>                                     error message.
>
>                                           RuntimeError: Atoms objects
>                                     on different processors are not
>                                     identical!
>
>                                              I find a line in the
>                                     force calculator
>                                     'wfs.world.broadcast(self.F_av, 0)'
>                                     so that all the forces on
>                                     different ranks should be the
>                                     same, which makes
>                                     me confused, I can not think out
>                                     any other reason can lead to this
>                                     error.
>
>                                             Could anyone take a look
>                                     at it?
>
>                                             I attached the structure
>                                     file and running script here, I
>                                     used 24 cores.
>
>                                             Thanks in advance.
>
>                                               Jingzhe
>
>                                     <main.py><model.traj>_______________________________________________
>
>                                     gpaw-users mailing list
>                                     gpaw-users at listserv.fysik.dtu.dk
>                                     <mailto:gpaw-users at listserv.fysik.dtu.dk>
>                                     https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>
>
>                         _______________________________________________
>                         gpaw-users mailing list
>                         gpaw-users at listserv.fysik.dtu.dk
>                         <mailto:gpaw-users at listserv.fysik.dtu.dk>
>                         https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>
>             _______________________________________________
>             gpaw-users mailing list
>             gpaw-users at listserv.fysik.dtu.dk
>             <mailto:gpaw-users at listserv.fysik.dtu.dk>
>             https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>
>
>         _______________________________________________
>         gpaw-users mailing list
>         gpaw-users at listserv.fysik.dtu.dk
>         <mailto:gpaw-users at listserv.fysik.dtu.dk>
>         https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>
>
>
>
>
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20150208/d5a375d3/attachment-0001.html>


More information about the gpaw-users mailing list