[gpaw-users] Error when relaxing atoms
Marcin Dulak
Marcin.Dulak at fysik.dtu.dk
Sun Feb 8 11:33:06 CET 2015
On 02/08/2015 01:42 AM, jingzhe Chen wrote:
> Hi all,
> In one cluster this error is repeated again and again where gpaw is
> compiled with blas/lapack, even the forces are not the same after
> broadcasting. While it disappeared when I try the same script on another
> cluster(also blas/lapack).
do the full gpaw-test pass in parallel?
How was numpy compiled on those clusters?
To have the full information, provide:
ldd `which gpaw-python`
python -c "import numpy; print numpy.__config__.show(); print numpy.__version__"
In addition check the numpy's _dotblas.so linked libraries (_dotblas.so
the source of problems most often) with:
ldd `python -c "from numpy.core import _dotblas; print _dotblas.__file__"`
Best regards,
Marcin
>
> Best.
> Jingzhe
>
> On Fri, Feb 6, 2015 at 12:41 PM, jingzhe <jingzhe.chen at gmail.com
> <mailto:jingzhe.chen at gmail.com>> wrote:
>
> Dear all,
>
> I ran again in the debug mode, the results I got for
> the atoms positions on
> different ranks can differ in the order of 0.01A. And even the
> forces on different
> ranks differ in the order of 1eV/A, while every time there is
> only one rank behaves
> oddly, now I have exchanged the two lines ( broadcast and
> symmetric correction)
> in the force calculator to see what will happen.
>
> Best.
>
> Jingzhe
>
>
> 于 2015年02月05日 15:53, Jens Jørgen Mortensen 写道:
>
> On 02/04/2015 05:12 PM, Ask Hjorth Larsen wrote:
>
> I committed something in r12401 which should make the
> check more
> reliable. It does not use hashing because the atoms
> object is sent
> anyway.
>
>
> Thanks a lot for fixing this! Should there also be some
> tolerance for the unit cell?
>
> Jens Jørgen
>
> Best regards
> Ask
>
> 2015-02-04 14:47 GMT+01:00 Ask Hjorth Larsen
> <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>
> Well, to clarify a bit.
>
> The hashing is useful if we don't want to send stuff
> around.
>
> If we are actually sending the positions now (by
> broadcast; I am only
> strictly aware that the forces are broadcast), then
> each core can
> compare locally without the need for hashing, to see
> if it wants to
> raise an error. (Raising errors on some cores but not
> all is
> sometimes annoying though.)
>
> Best regards
> Ask
>
> 2015-02-04 12:57 GMT+01:00 Ask Hjorth Larsen
> <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>
> Hello
>
> 2015-02-04 10:21 GMT+01:00 Torsten Hahn
> <torstenhahn at fastmail.fm
> <mailto:torstenhahn at fastmail.fm>>:
>
> Probably we could do this but my feeling is,
> that this would only cure the symptoms not the
> real origin of this annoying bug.
>
>
> In fact there is code in
>
> mpi/__init__.py
>
> that says:
>
> # Construct fingerprint:
> # ASE may return slightly different atomic
> positions (e.g. due
> # to MKL) so compare only first 8 decimals of
> positions
>
>
> The code says that only 8 decimal positions
> are used for the generation of atomic
> „fingerprints“. These code relies on numpy and
> therefore lapack/blas functions. However i
> have no idea what that md5_array etc. stuff
> really does. But there is some debug-code
> which should at least tell you which Atom(s)
> causes the problems.
>
> md5_array calculates the md5 sum of the data of an
> array. It is a
> kind of checksum.
>
> Rounding unfortunately does not solve the
> problem. For any epsilon
> however little, there exist numbers that differ by
> epsilon but round
> to different numbers. So the check will not work
> the way it is
> implemented at the moment: Positions that are
> "close enough" can
> currently generate an error. In other words if
> you get this error,
> maybe there was no problem at all. Given the vast
> thousands of DFT
> calculations that are done, this may not be so
> unlikely.
>
> However, that error is *very* strange because
> mpi.broadcast(...) should result in *exactly*
> the same objects on all cores. No idea why
> there should be any difference at all and what
> was the intention behind the fancy
> fingerprint-generation stuff in the
> compare_atoms(atoms, comm=world) method.
>
> The check was introduced because there were
> (infrequent) situations
> where different cores had different positions, due
> e.g. to the finicky
> numerics elsewhere discussed. Later, I guess we
> have accepted the
> numerical issues and relaxed the check so it is no
> longer exact,
> preferring instead to broadcast. Evidently
> something else is
> happening aside from the broadcast, which allows
> things to go wrong.
> Perhaps the error in the rounding scheme mentioned
> above.
>
> To explain the hashing: We want to check that
> numbers on two different
> CPUs are equal. Either we have to send all the
> numbers, or hash them
> and send the hash. Hence hashing is much nicer.
> But maybe it would
> be better to hash them with a continuous
> function. For example adding
> all numbers with different (pseudorandom?) complex
> phase factors.
> Then one can compare the complex hashes and see if
> they are close
> enough to each other. There are probably better ways.
>
> Best regards
> Ask
>
> Best,
> Torsten.
>
> Am 04.02.2015 um 10:00 schrieb jingzhe
> <jingzhe.chen at gmail.com
> <mailto:jingzhe.chen at gmail.com>>:
>
> Hi Torsten,
>
> Thanks for quick reply, but
> I use gcc and lapack/blas, I mean if the
> positions
> of the atoms are slightly different for
> different ranks because of compiler/lib stuff,
> can we just set a tolerance in the
> check_atoms and jump off the error?
>
> Best.
>
> Jingzhe
>
>
>
>
>
> 于 2015年02月04日 14:32, Torsten Hahn 写道:
>
> Dear Jingzhe,
>
> we often recognized this error if we
> use GPAW together with Intel MKL <=
> 11.x on Intel CPU’s. I never tracked
> down the error because it was gone
> after compiler/library upgrade.
>
> Best,
> Torsten.
>
>
> --
> Dr. Torsten Hahn
> torstenhahn at fastmail.fm
> <mailto:torstenhahn at fastmail.fm>
>
> Am 04.02.2015 um 07:27 schrieb
> jingzhe Chen
> <jingzhe.chen at gmail.com
> <mailto:jingzhe.chen at gmail.com>>:
>
> Dear GPAW guys,
>
> I used the latest gpaw to
> run a relaxation job, and find the
> below
> error message.
>
> RuntimeError: Atoms objects
> on different processors are not
> identical!
>
> I find a line in the
> force calculator
> 'wfs.world.broadcast(self.F_av, 0)'
> so that all the forces on
> different ranks should be the
> same, which makes
> me confused, I can not think out
> any other reason can lead to this
> error.
>
> Could anyone take a look
> at it?
>
> I attached the structure
> file and running script here, I
> used 24 cores.
>
> Thanks in advance.
>
> Jingzhe
>
> <main.py><model.traj>_______________________________________________
>
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> <mailto:gpaw-users at listserv.fysik.dtu.dk>
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>
>
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> <mailto:gpaw-users at listserv.fysik.dtu.dk>
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> <mailto:gpaw-users at listserv.fysik.dtu.dk>
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>
>
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> <mailto:gpaw-users at listserv.fysik.dtu.dk>
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>
>
>
>
>
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20150208/d5a375d3/attachment-0001.html>
More information about the gpaw-users
mailing list