[gpaw-users] Error when relaxing atoms

Mon Feb 9 05:29:59 CET 2015

Hi Marcin,

              My bad , I did not try the gpaw-test in parallel, just 
tried relax.py and
transport.py.  The gpaw-test in parallel failed with the following message

--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

   Local host:          ip03 (PID 3374)
   MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
python 2.6.6 GCC 4.4.7 20120313 (Red Hat 4.4.7-4) 64bit ELF on Linux 
x86_64 centos 6.5 Final
Running tests in /ltmp/chenjing/gpaw-test-L_VbEM
Jobs: 1, Cores: 4, debug-mode: False
=============================================================================
gemm_complex.py                         0.027  OK
ase3k_version.py                        0.022  OK
kpt.py                                  0.030  OK
mpicomm.py                              0.022  OK
numpy_core_multiarray_dot.py            0.021  OK
maxrss.py                               0.000  SKIPPED
fileio/hdf5_noncontiguous.py            0.002  SKIPPED
cg2.py                                  0.024  OK
laplace.py                              0.023  OK
lapack.py                               0.023  OK
eigh.py                                 0.022  OK
parallel/submatrix_redist.py            0.000  SKIPPED
second_derivative.py                    0.035  OK
parallel/parallel_eigh.py               0.022  OK
gp2.py                                  0.023  OK
blas.py                                 0.164  OK
Gauss.py                                0.045  OK
nabla.py                                0.140  OK
dot.py                                  0.030  OK
mmm.py                                  0.028  OK
lxc_fxc.py                              0.030  OK
pbe_pw91.py                             0.029  OK
gradient.py                             0.033  OK
erf.py                                  0.028  OK
lf.py                                   0.033  OK
fsbt.py                                 0.034  OK
parallel/compare.py                     0.031  OK
integral4.py                            0.069  OK
zher.py                                 0.149  OK
gd.py                                   0.032  OK
pw/interpol.py                          0.025  OK
screened_poisson.py                     0.461  OK
xc.py                                   0.064  OK
XC2.py                                  2.548  OK
yukawa_radial.py                        0.024  OK
dump_chi0.py                            0.045  OK
vdw/potential.py                        0.026  OK
lebedev.py                              0.053  OK
fileio/hdf5_simple.py                   0.002  SKIPPED
occupations.py                          0.080  OK
derivatives.py                          0.034  OK
parallel/realspace_blacs.py             0.027  OK
pw/reallfc.py                      [ip03:03367] 3 more processes have 
sent help message help-mpi-runtime.txt / mpi_init:warn-fork
[ip03:03367] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages
      0.357  OK
parallel/pblas.py                       0.048  OK
non_periodic.py                         0.064  OK
spectrum.py                             0.019  SKIPPED
pw/lfc.py                               0.273  OK
gauss_func.py                           1.032  OK
multipoletest.py                        0.516  OK
noncollinear/xcgrid3d.py                6.207  OK
cluster.py                              0.228  OK
poisson.py                              0.095  OK
parallel/overlap.py                     2.293  OK
parallel/scalapack.py                   0.036  OK
gauss_wave.py                           0.650  OK
transformations.py                      0.047  OK
parallel/blacsdist.py                   0.033  OK
ut_rsh.py                               2.098  OK
pbc.py                                  0.822  OK
noncollinear/xccorr.py                  0.587  OK
atoms_too_close.py                      1.043  OK
harmonic.py                            40.344  OK
proton.py                               5.189  OK
atoms_mismatch.py                       0.051  OK
timing.py                               0.935  OK
parallel/ut_parallel.py                 1.098  OK
ut_csh.py                          Test failed. Check ut_csh.log for 
details.
Test failed. Check ut_csh.log for details.
Test failed. Check ut_csh.log for details.
Test failed. Check ut_csh.log for details.
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 3376 on
node ip03 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

           Numpy is compiled in the cluster, I did not do it myself.

for

ldd `which gpaw-python`

I got

  	linux-vdso.so.1 =>  (0x00007fff2edff000)
	libgfortran.so.3 => /home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgfortran.so.3 (0x00002b41947c7000)
	libxc.so.1 => /home/chenjing/Installation/libxc/lib/libxc.so.1 (0x00002b4194a9f000)
	libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0 (0x000000343ee00000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x000000343e600000)
	libdl.so.2 => /lib64/libdl.so.2 (0x000000343e200000)
	libutil.so.1 => /lib64/libutil.so.1 (0x0000003440e00000)
	libm.so.6 => /lib64/libm.so.6 (0x000000343ea00000)
	libmpi.so.0 => /home/chenjing/openmpi-1.4.5/lib/libmpi.so.0 (0x00002b4194d52000)
	libopen-rte.so.0 => /home/chenjing/openmpi-1.4.5/lib/libopen-rte.so.0 (0x00002b4195165000)
	libopen-pal.so.0 => /home/chenjing/openmpi-1.4.5/lib/libopen-pal.so.0 (0x00002b41953ed000)
	librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x000000343f600000)
	libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000000343f200000)
	libtorque.so.2 => /opt/torque/lib/libtorque.so.2 (0x00002b419564f000)
	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002b4195953000)
	libnsl.so.1 => /lib64/libnsl.so.1 (0x00002b4195b5c000)
	libc.so.6 => /lib64/libc.so.6 (0x000000343de00000)
	libgcc_s.so.1 => /home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgcc_s.so.1 (0x00002b4195d76000)
	/lib64/ld-linux-x86-64.so.2 (0x000000343da00000)
	libnl.so.1 => /lib64/libnl.so.1 (0x00002b4195f8c000)

for
python -c "import numpy; print numpy.__config__.show(); print  numpy.__version__"

I got

atlas_threads_info:
     libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
     library_dirs = ['/usr/lib64/atlas']
     language = f77
     include_dirs = ['/usr/include']

blas_opt_info:
     libraries = ['ptf77blas', 'ptcblas', 'atlas']
     library_dirs = ['/usr/lib64/atlas']
     define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
     language = c
     include_dirs = ['/usr/include']

atlas_blas_threads_info:
     libraries = ['ptf77blas', 'ptcblas', 'atlas']
     library_dirs = ['/usr/lib64/atlas']
     language = c
     include_dirs = ['/usr/include']

lapack_opt_info:
     libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
     library_dirs = ['/usr/lib64/atlas']
     define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
     language = f77
     include_dirs = ['/usr/include']

lapack_mkl_info:
   NOT AVAILABLE

blas_mkl_info:
   NOT AVAILABLE

mkl_info:
   NOT AVAILABLE

None
1.4.1

and for ldd `python -c "from numpy.core import _dotblas; print 
_dotblas.__file__"`

I got
     linux-vdso.so.1 =>  (0x00007fff2afff000)
     libptf77blas.so.3 => /usr/lib64/atlas/libptf77blas.so.3 
(0x00002b104af64000)
     libptcblas.so.3 => /usr/lib64/atlas/libptcblas.so.3 
(0x00002b104b184000)
     libatlas.so.3 => /usr/lib64/atlas/libatlas.so.3 (0x00002b104b3a4000)
     libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0 
(0x00002b104ba00000)
     libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b104bda7000)
     libc.so.6 => /lib64/libc.so.6 (0x00002b104bfc4000)
     libgfortran.so.3 => 
/home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgfortran.so.3 
(0x00002b104c358000)
     libm.so.6 => /lib64/libm.so.6 (0x00002b104c631000)
     libdl.so.2 => /lib64/libdl.so.2 (0x00002b104c8b5000)
     libutil.so.1 => /lib64/libutil.so.1 (0x00002b104cab9000)
     /lib64/ld-linux-x86-64.so.2 (0x000000343da00000)

Best.
Jingzhe

于 2015年02月08日 18:33, Marcin Dulak 写道:
> On 02/08/2015 01:42 AM, jingzhe Chen wrote:
>> Hi all,
>>       In one cluster this error is repeated again and again where gpaw is
>> compiled with blas/lapack, even the forces are not the same after
>> broadcasting.  While it disappeared when I try the same script on another
>> cluster(also blas/lapack).
> do the full gpaw-test pass in parallel?
> How was numpy compiled on those clusters?
> To have the full information, provide:
> ldd `which gpaw-python`
> python -c "import numpy; print numpy.__config__.show(); print  numpy.__version__"
> In addition check the numpy's _dotblas.so linked libraries 
> (_dotblas.so the source of problems most often) with:
> ldd `python -c "from numpy.core import _dotblas; print _dotblas.__file__"`
>
> Best regards,
>
> Marcin
>>
>>       Best.
>>       Jingzhe
>>
>> On Fri, Feb 6, 2015 at 12:41 PM, jingzhe <jingzhe.chen at gmail.com 
>> <mailto:jingzhe.chen at gmail.com>> wrote:
>>
>>     Dear all,
>>
>>                  I ran again in the debug mode, the results I got for
>>     the atoms positions on
>>     different ranks can differ in the order of 0.01A. And even the
>>     forces on different
>>     ranks differ in the order of  1eV/A, while every time there is
>>     only one rank behaves
>>     oddly,  now I have exchanged the two lines ( broadcast and
>>     symmetric correction)
>>     in the force calculator to see what will happen.
>>
>>                 Best.
>>
>>                  Jingzhe
>>
>>
>>     于 2015年02月05日 15:53, Jens Jørgen Mortensen 写道:
>>
>>         On 02/04/2015 05:12 PM, Ask Hjorth Larsen wrote:
>>
>>             I committed something in r12401 which should make the
>>             check more
>>             reliable.  It does not use hashing because the atoms
>>             object is sent
>>             anyway.
>>
>>
>>         Thanks a lot for fixing this!  Should there also be some
>>         tolerance for the unit cell?
>>
>>         Jens Jørgen
>>
>>             Best regards
>>             Ask
>>
>>             2015-02-04 14:47 GMT+01:00 Ask Hjorth Larsen
>>             <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>>
>>                 Well, to clarify a bit.
>>
>>                 The hashing is useful if we don't want to send stuff
>>                 around.
>>
>>                 If we are actually sending the positions now (by
>>                 broadcast; I am only
>>                 strictly aware that the forces are broadcast), then
>>                 each core can
>>                 compare locally without the need for hashing, to see
>>                 if it wants to
>>                 raise an error.  (Raising errors on some cores but
>>                 not all is
>>                 sometimes annoying though.)
>>
>>                 Best regards
>>                 Ask
>>
>>                 2015-02-04 12:57 GMT+01:00 Ask Hjorth Larsen
>>                 <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>>
>>                     Hello
>>
>>                     2015-02-04 10:21 GMT+01:00 Torsten Hahn
>>                     <torstenhahn at fastmail.fm
>>                     <mailto:torstenhahn at fastmail.fm>>:
>>
>>                         Probably we could do this but my feeling is,
>>                         that this would only cure the symptoms not
>>                         the real origin of this annoying bug.
>>
>>
>>                         In fact there is code in
>>
>>                         mpi/__init__.py
>>
>>                         that says:
>>
>>                         # Construct fingerprint:
>>                         # ASE may return slightly different atomic
>>                         positions (e.g. due
>>                         # to MKL) so compare only first 8 decimals of
>>                         positions
>>
>>
>>                         The code says that only 8 decimal positions
>>                         are used for the generation of atomic
>>                         „fingerprints“. These code relies on numpy
>>                         and therefore lapack/blas functions. However
>>                         i have no idea what that md5_array etc. stuff
>>                         really does. But there is some debug-code
>>                         which should at least tell you which Atom(s)
>>                         causes the problems.
>>
>>                     md5_array calculates the md5 sum of the data of
>>                     an array. It is a
>>                     kind of checksum.
>>
>>                     Rounding unfortunately does not solve the
>>                     problem.  For any epsilon
>>                     however little, there exist numbers that differ
>>                     by epsilon but round
>>                     to different numbers.  So the check will not work
>>                     the way it is
>>                     implemented at the moment: Positions that are
>>                     "close enough" can
>>                     currently generate an error.  In other words if
>>                     you get this error,
>>                     maybe there was no problem at all.  Given the
>>                     vast thousands of DFT
>>                     calculations that are done, this may not be so
>>                     unlikely.
>>
>>                         However, that error is *very* strange because
>>                         mpi.broadcast(...) should result in *exactly*
>>                         the same objects on all cores. No idea why
>>                         there should be any difference at all and
>>                         what was the intention behind the fancy
>>                         fingerprint-generation stuff in the
>>                         compare_atoms(atoms, comm=world) method.
>>
>>                     The check was introduced because there were
>>                     (infrequent) situations
>>                     where different cores had different positions,
>>                     due e.g. to the finicky
>>                     numerics elsewhere discussed.  Later, I guess we
>>                     have accepted the
>>                     numerical issues and relaxed the check so it is
>>                     no longer exact,
>>                     preferring instead to broadcast.  Evidently
>>                     something else is
>>                     happening aside from the broadcast, which allows
>>                     things to go wrong.
>>                     Perhaps the error in the rounding scheme
>>                     mentioned above.
>>
>>                     To explain the hashing: We want to check that
>>                     numbers on two different
>>                     CPUs are equal.  Either we have to send all the
>>                     numbers, or hash them
>>                     and send the hash.  Hence hashing is much nicer. 
>>                     But maybe it would
>>                     be better to hash them with a continuous
>>                     function.  For example adding
>>                     all numbers with different (pseudorandom?)
>>                     complex phase factors.
>>                     Then one can compare the complex hashes and see
>>                     if they are close
>>                     enough to each other.  There are probably better
>>                     ways.
>>
>>                     Best regards
>>                     Ask
>>
>>                         Best,
>>                         Torsten.
>>
>>                             Am 04.02.2015 um 10:00 schrieb jingzhe
>>                             <jingzhe.chen at gmail.com
>>                             <mailto:jingzhe.chen at gmail.com>>:
>>
>>                             Hi Torsten,
>>
>>                                           Thanks for quick reply, but
>>                             I use gcc and lapack/blas, I mean if the
>>                             positions
>>                             of the atoms are slightly different for
>>                             different ranks because of compiler/lib
>>                             stuff,
>>                             can we just set a tolerance in the
>>                             check_atoms and jump off the error?
>>
>>                                           Best.
>>
>>                                           Jingzhe
>>
>>
>>
>>
>>
>>                             于 2015年02月04日 14:32, Torsten Hahn 写道:
>>
>>                                 Dear Jingzhe,
>>
>>                                 we often recognized this error if we
>>                                 use GPAW together with Intel MKL <=
>>                                 11.x on Intel CPU’s. I never tracked
>>                                 down the error because it was gone
>>                                 after compiler/library upgrade.
>>
>>                                 Best,
>>                                 Torsten.
>>
>>
>>                                 -- 
>>                                 Dr. Torsten Hahn
>>                                 torstenhahn at fastmail.fm
>>                                 <mailto:torstenhahn at fastmail.fm>
>>
>>                                     Am 04.02.2015 um 07:27 schrieb
>>                                     jingzhe Chen
>>                                     <jingzhe.chen at gmail.com
>>                                     <mailto:jingzhe.chen at gmail.com>>:
>>
>>                                     Dear GPAW guys,
>>
>>                                              I used the latest gpaw
>>                                     to run a relaxation job, and find
>>                                     the below
>>                                     error message.
>>
>>                                           RuntimeError: Atoms objects
>>                                     on different processors are not
>>                                     identical!
>>
>>                                              I find a line in the
>>                                     force calculator
>>                                     'wfs.world.broadcast(self.F_av, 0)'
>>                                     so that all the forces on
>>                                     different ranks should be the
>>                                     same, which makes
>>                                     me confused, I can not think out
>>                                     any other reason can lead to this
>>                                     error.
>>
>>                                             Could anyone take a look
>>                                     at it?
>>
>>                                             I attached the structure
>>                                     file and running script here, I
>>                                     used 24 cores.
>>
>>                                             Thanks in advance.
>>
>>                                               Jingzhe
>>
>>                                     <main.py><model.traj>_______________________________________________
>>
>>                                     gpaw-users mailing list
>>                                     gpaw-users at listserv.fysik.dtu.dk
>>                                     <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>                                     https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>>
>>                         _______________________________________________
>>                         gpaw-users mailing list
>>                         gpaw-users at listserv.fysik.dtu.dk
>>                         <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>                         https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>>             _______________________________________________
>>             gpaw-users mailing list
>>             gpaw-users at listserv.fysik.dtu.dk
>>             <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>             https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>>
>>         _______________________________________________
>>         gpaw-users mailing list
>>         gpaw-users at listserv.fysik.dtu.dk
>>         <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>         https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>>
>>
>>
>>
>> _______________________________________________
>> gpaw-users mailing list
>> gpaw-users at listserv.fysik.dtu.dk
>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20150209/bc2d509a/attachment-0001.html>