[gpaw-users] Error when relaxing atoms

Mon Feb 9 10:26:49 CET 2015

Hi,

On 02/09/2015 05:29 AM, jingzhe wrote:
> Hi Marcin,
>
>              My bad , I did not try the gpaw-test in parallel, just 
> tried relax.py and
> transport.py.  The gpaw-test in parallel failed with the following message
>
>
> --------------------------------------------------------------------------
> An MPI process has executed an operation involving a call to the
> "fork()" system call to create a child process.  Open MPI is currently
> operating in a condition that could result in memory corruption or
> other system errors; your MPI job may hang, crash, or produce silent
> data corruption.  The use of fork() (or system() or other calls that
> create child processes) is strongly discouraged.
>
> The process that invoked fork was:
>
>   Local host:          ip03 (PID 3374)
>   MPI_COMM_WORLD rank: 0
>
> If you are *absolutely sure* that your application will successfully
> and correctly survive a call to fork(), you may disable this warning
> by setting the mpi_warn_on_fork MCA parameter to 0.
> --------------------------------------------------------------------------
> python 2.6.6 GCC 4.4.7 20120313 (Red Hat 4.4.7-4) 64bit ELF on Linux 
> x86_64 centos 6.5 Final
> Running tests in /ltmp/chenjing/gpaw-test-L_VbEM
> Jobs: 1, Cores: 4, debug-mode: False
> =============================================================================
> gemm_complex.py                         0.027  OK
> ase3k_version.py                        0.022  OK
> kpt.py                                  0.030  OK
> mpicomm.py                              0.022  OK
> numpy_core_multiarray_dot.py            0.021  OK
> maxrss.py                               0.000  SKIPPED
> fileio/hdf5_noncontiguous.py            0.002  SKIPPED
> cg2.py                                  0.024  OK
> laplace.py                              0.023  OK
> lapack.py                               0.023  OK
> eigh.py                                 0.022  OK
> parallel/submatrix_redist.py            0.000  SKIPPED
> second_derivative.py                    0.035  OK
> parallel/parallel_eigh.py               0.022  OK
> gp2.py                                  0.023  OK
> blas.py                                 0.164  OK
> Gauss.py                                0.045  OK
> nabla.py                                0.140  OK
> dot.py                                  0.030  OK
> mmm.py                                  0.028  OK
> lxc_fxc.py                              0.030  OK
> pbe_pw91.py                             0.029  OK
> gradient.py                             0.033  OK
> erf.py                                  0.028  OK
> lf.py                                   0.033  OK
> fsbt.py                                 0.034  OK
> parallel/compare.py                     0.031  OK
> integral4.py                            0.069  OK
> zher.py                                 0.149  OK
> gd.py                                   0.032  OK
> pw/interpol.py                          0.025  OK
> screened_poisson.py                     0.461  OK
> xc.py                                   0.064  OK
> XC2.py                                  2.548  OK
> yukawa_radial.py                        0.024  OK
> dump_chi0.py                            0.045  OK
> vdw/potential.py                        0.026  OK
> lebedev.py                              0.053  OK
> fileio/hdf5_simple.py                   0.002  SKIPPED
> occupations.py                          0.080  OK
> derivatives.py                          0.034  OK
> parallel/realspace_blacs.py             0.027  OK
> pw/reallfc.py                      [ip03:03367] 3 more processes have 
> sent help message help-mpi-runtime.txt / mpi_init:warn-fork
> [ip03:03367] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
> all help / error messages
>      0.357  OK
> parallel/pblas.py                       0.048  OK
> non_periodic.py                         0.064  OK
> spectrum.py                             0.019  SKIPPED
> pw/lfc.py                               0.273  OK
> gauss_func.py                           1.032  OK
> multipoletest.py                        0.516  OK
> noncollinear/xcgrid3d.py                6.207  OK
> cluster.py                              0.228  OK
> poisson.py                              0.095  OK
> parallel/overlap.py                     2.293  OK
> parallel/scalapack.py                   0.036  OK
> gauss_wave.py                           0.650  OK
> transformations.py                      0.047  OK
> parallel/blacsdist.py                   0.033  OK
> ut_rsh.py                               2.098  OK
> pbc.py                                  0.822  OK
> noncollinear/xccorr.py                  0.587  OK
> atoms_too_close.py                      1.043  OK
> harmonic.py                            40.344  OK
> proton.py                               5.189  OK
> atoms_mismatch.py                       0.051  OK
> timing.py                               0.935  OK
> parallel/ut_parallel.py                 1.098  OK
> ut_csh.py                          Test failed. Check ut_csh.log for 
> details.
> Test failed. Check ut_csh.log for details.
> Test failed. Check ut_csh.log for details.
> Test failed. Check ut_csh.log for details.
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 2 with PID 3376 on
> node ip03 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
>           Numpy is compiled in the cluster, I did not do it myself.
>
> for
> ldd `which gpaw-python`
> I got
>     
>   	linux-vdso.so.1 =>  (0x00007fff2edff000)
> 	libgfortran.so.3 => /home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgfortran.so.3 (0x00002b41947c7000)
this is risky: you are using libraries distributed by Matlab instead of 
the default system ones.
I don't see any blas linked - do you use static linking?
> 	libxc.so.1 => /home/chenjing/Installation/libxc/lib/libxc.so.1 (0x00002b4194a9f000)
> 	libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0 (0x000000343ee00000)
> 	libpthread.so.0 => /lib64/libpthread.so.0 (0x000000343e600000)
> 	libdl.so.2 => /lib64/libdl.so.2 (0x000000343e200000)
> 	libutil.so.1 => /lib64/libutil.so.1 (0x0000003440e00000)
> 	libm.so.6 => /lib64/libm.so.6 (0x000000343ea00000)
> 	libmpi.so.0 => /home/chenjing/openmpi-1.4.5/lib/libmpi.so.0 (0x00002b4194d52000)
> 	libopen-rte.so.0 => /home/chenjing/openmpi-1.4.5/lib/libopen-rte.so.0 (0x00002b4195165000)
> 	libopen-pal.so.0 => /home/chenjing/openmpi-1.4.5/lib/libopen-pal.so.0 (0x00002b41953ed000)
> 	librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x000000343f600000)
> 	libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000000343f200000)
> 	libtorque.so.2 => /opt/torque/lib/libtorque.so.2 (0x00002b419564f000)
> 	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002b4195953000)
> 	libnsl.so.1 => /lib64/libnsl.so.1 (0x00002b4195b5c000)
> 	libc.so.6 => /lib64/libc.so.6 (0x000000343de00000)
> 	libgcc_s.so.1 => /home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgcc_s.so.1 (0x00002b4195d76000)
> 	/lib64/ld-linux-x86-64.so.2 (0x000000343da00000)
> 	libnl.so.1 => /lib64/libnl.so.1 (0x00002b4195f8c000)
>
> for
> python -c "import numpy; print numpy.__config__.show(); print  numpy.__version__"
>
> I got
>
> atlas_threads_info:
>      libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
>      library_dirs = ['/usr/lib64/atlas']
>      language = f77
>      include_dirs = ['/usr/include']
>
> blas_opt_info:
>      libraries = ['ptf77blas', 'ptcblas', 'atlas']
here is the problem: this is a multithreaded atlas - it won't work with 
gpaw.
You need to build another numpy. See here
https://wiki.fysik.dtu.dk/gpaw/install/Linux/r410_psmn.ens-lyon.html
how to disable the use of system blas/lapack and use numpy's internally 
distributed sources.

Best regards,

Marcin
>      library_dirs = ['/usr/lib64/atlas']
>      define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
>      language = c
>      include_dirs = ['/usr/include']
>
> atlas_blas_threads_info:
>      libraries = ['ptf77blas', 'ptcblas', 'atlas']
>      library_dirs = ['/usr/lib64/atlas']
>      language = c
>      include_dirs = ['/usr/include']
>
> lapack_opt_info:
>      libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
>      library_dirs = ['/usr/lib64/atlas']
>      define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
>      language = f77
>      include_dirs = ['/usr/include']
>
> lapack_mkl_info:
>    NOT AVAILABLE
>
> blas_mkl_info:
>    NOT AVAILABLE
>
> mkl_info:
>    NOT AVAILABLE
>
> None
> 1.4.1
>
> and for ldd `python -c "from numpy.core import _dotblas; print 
> _dotblas.__file__"`
>
> I got
>     linux-vdso.so.1 =>  (0x00007fff2afff000)
>     libptf77blas.so.3 => /usr/lib64/atlas/libptf77blas.so.3 
> (0x00002b104af64000)
>     libptcblas.so.3 => /usr/lib64/atlas/libptcblas.so.3 
> (0x00002b104b184000)
>     libatlas.so.3 => /usr/lib64/atlas/libatlas.so.3 (0x00002b104b3a4000)
>     libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0 
> (0x00002b104ba00000)
>     libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b104bda7000)
>     libc.so.6 => /lib64/libc.so.6 (0x00002b104bfc4000)
>     libgfortran.so.3 => 
> /home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgfortran.so.3 
> (0x00002b104c358000)
>     libm.so.6 => /lib64/libm.so.6 (0x00002b104c631000)
>     libdl.so.2 => /lib64/libdl.so.2 (0x00002b104c8b5000)
>     libutil.so.1 => /lib64/libutil.so.1 (0x00002b104cab9000)
>     /lib64/ld-linux-x86-64.so.2 (0x000000343da00000)
>
>
> Best.
> Jingzhe
>
>
>
> 于 2015年02月08日 18:33, Marcin Dulak 写道:
>> On 02/08/2015 01:42 AM, jingzhe Chen wrote:
>>> Hi all,
>>>       In one cluster this error is repeated again and again where 
>>> gpaw is
>>> compiled with blas/lapack, even the forces are not the same after
>>> broadcasting.  While it disappeared when I try the same script on 
>>> another
>>> cluster(also blas/lapack).
>> do the full gpaw-test pass in parallel?
>> How was numpy compiled on those clusters?
>> To have the full information, provide:
>> ldd `which gpaw-python`
>> python -c "import numpy; print numpy.__config__.show(); print  numpy.__version__"
>> In addition check the numpy's _dotblas.so linked libraries 
>> (_dotblas.so the source of problems most often) with:
>> ldd `python -c "from numpy.core import _dotblas; print 
>> _dotblas.__file__"`
>>
>> Best regards,
>>
>> Marcin
>>>
>>>       Best.
>>>       Jingzhe
>>>
>>> On Fri, Feb 6, 2015 at 12:41 PM, jingzhe <jingzhe.chen at gmail.com 
>>> <mailto:jingzhe.chen at gmail.com>> wrote:
>>>
>>>     Dear all,
>>>
>>>                  I ran again in the debug mode, the results I got
>>>     for the atoms positions on
>>>     different ranks can differ in the order of 0.01A. And even the
>>>     forces on different
>>>     ranks differ in the order of  1eV/A, while every time there is
>>>     only one rank behaves
>>>     oddly,  now I have exchanged the two lines ( broadcast and
>>>     symmetric correction)
>>>     in the force calculator to see what will happen.
>>>
>>>                 Best.
>>>
>>>                  Jingzhe
>>>
>>>
>>>     于 2015年02月05日 15:53, Jens Jørgen Mortensen 写道:
>>>
>>>         On 02/04/2015 05:12 PM, Ask Hjorth Larsen wrote:
>>>
>>>             I committed something in r12401 which should make the
>>>             check more
>>>             reliable.  It does not use hashing because the atoms
>>>             object is sent
>>>             anyway.
>>>
>>>
>>>         Thanks a lot for fixing this!  Should there also be some
>>>         tolerance for the unit cell?
>>>
>>>         Jens Jørgen
>>>
>>>             Best regards
>>>             Ask
>>>
>>>             2015-02-04 14:47 GMT+01:00 Ask Hjorth Larsen
>>>             <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>>>
>>>                 Well, to clarify a bit.
>>>
>>>                 The hashing is useful if we don't want to send stuff
>>>                 around.
>>>
>>>                 If we are actually sending the positions now (by
>>>                 broadcast; I am only
>>>                 strictly aware that the forces are broadcast), then
>>>                 each core can
>>>                 compare locally without the need for hashing, to see
>>>                 if it wants to
>>>                 raise an error.  (Raising errors on some cores but
>>>                 not all is
>>>                 sometimes annoying though.)
>>>
>>>                 Best regards
>>>                 Ask
>>>
>>>                 2015-02-04 12:57 GMT+01:00 Ask Hjorth Larsen
>>>                 <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>>>
>>>                     Hello
>>>
>>>                     2015-02-04 10:21 GMT+01:00 Torsten Hahn
>>>                     <torstenhahn at fastmail.fm
>>>                     <mailto:torstenhahn at fastmail.fm>>:
>>>
>>>                         Probably we could do this but my feeling is,
>>>                         that this would only cure the symptoms not
>>>                         the real origin of this annoying bug.
>>>
>>>
>>>                         In fact there is code in
>>>
>>>                         mpi/__init__.py
>>>
>>>                         that says:
>>>
>>>                         # Construct fingerprint:
>>>                         # ASE may return slightly different atomic
>>>                         positions (e.g. due
>>>                         # to MKL) so compare only first 8 decimals
>>>                         of positions
>>>
>>>
>>>                         The code says that only 8 decimal positions
>>>                         are used for the generation of atomic
>>>                         „fingerprints“. These code relies on numpy
>>>                         and therefore lapack/blas functions. However
>>>                         i have no idea what that md5_array etc.
>>>                         stuff really does. But there is some
>>>                         debug-code which should at least tell you
>>>                         which Atom(s) causes the problems.
>>>
>>>                     md5_array calculates the md5 sum of the data of
>>>                     an array. It is a
>>>                     kind of checksum.
>>>
>>>                     Rounding unfortunately does not solve the
>>>                     problem.  For any epsilon
>>>                     however little, there exist numbers that differ
>>>                     by epsilon but round
>>>                     to different numbers.  So the check will not
>>>                     work the way it is
>>>                     implemented at the moment: Positions that are
>>>                     "close enough" can
>>>                     currently generate an error.  In other words if
>>>                     you get this error,
>>>                     maybe there was no problem at all.  Given the
>>>                     vast thousands of DFT
>>>                     calculations that are done, this may not be so
>>>                     unlikely.
>>>
>>>                         However, that error is *very* strange
>>>                         because mpi.broadcast(...) should result in
>>>                         *exactly* the same objects on all cores. No
>>>                         idea why there should be any difference at
>>>                         all and what was the intention behind the
>>>                         fancy fingerprint-generation stuff in the
>>>                         compare_atoms(atoms, comm=world) method.
>>>
>>>                     The check was introduced because there were
>>>                     (infrequent) situations
>>>                     where different cores had different positions,
>>>                     due e.g. to the finicky
>>>                     numerics elsewhere discussed.  Later, I guess we
>>>                     have accepted the
>>>                     numerical issues and relaxed the check so it is
>>>                     no longer exact,
>>>                     preferring instead to broadcast.  Evidently
>>>                     something else is
>>>                     happening aside from the broadcast, which allows
>>>                     things to go wrong.
>>>                     Perhaps the error in the rounding scheme
>>>                     mentioned above.
>>>
>>>                     To explain the hashing: We want to check that
>>>                     numbers on two different
>>>                     CPUs are equal.  Either we have to send all the
>>>                     numbers, or hash them
>>>                     and send the hash.  Hence hashing is much
>>>                     nicer.  But maybe it would
>>>                     be better to hash them with a continuous
>>>                     function.  For example adding
>>>                     all numbers with different (pseudorandom?)
>>>                     complex phase factors.
>>>                     Then one can compare the complex hashes and see
>>>                     if they are close
>>>                     enough to each other.  There are probably better
>>>                     ways.
>>>
>>>                     Best regards
>>>                     Ask
>>>
>>>                         Best,
>>>                         Torsten.
>>>
>>>                             Am 04.02.2015 um 10:00 schrieb jingzhe
>>>                             <jingzhe.chen at gmail.com
>>>                             <mailto:jingzhe.chen at gmail.com>>:
>>>
>>>                             Hi Torsten,
>>>
>>>                                           Thanks for quick reply,
>>>                             but I use gcc and lapack/blas, I mean if
>>>                             the positions
>>>                             of the atoms are slightly different for
>>>                             different ranks because of compiler/lib
>>>                             stuff,
>>>                             can we just set a tolerance in the
>>>                             check_atoms and jump off the error?
>>>
>>>                                           Best.
>>>
>>>                                           Jingzhe
>>>
>>>
>>>
>>>
>>>
>>>                             于 2015年02月04日 14:32, Torsten Hahn 写道:
>>>
>>>                                 Dear Jingzhe,
>>>
>>>                                 we often recognized this error if we
>>>                                 use GPAW together with Intel MKL <=
>>>                                 11.x on Intel CPU’s. I never tracked
>>>                                 down the error because it was gone
>>>                                 after compiler/library upgrade.
>>>
>>>                                 Best,
>>>                                 Torsten.
>>>
>>>
>>>                                 -- 
>>>                                 Dr. Torsten Hahn
>>>                                 torstenhahn at fastmail.fm
>>>                                 <mailto:torstenhahn at fastmail.fm>
>>>
>>>                                     Am 04.02.2015 um 07:27 schrieb
>>>                                     jingzhe Chen
>>>                                     <jingzhe.chen at gmail.com
>>>                                     <mailto:jingzhe.chen at gmail.com>>:
>>>
>>>                                     Dear GPAW guys,
>>>
>>>                                              I used the latest gpaw
>>>                                     to run a relaxation job, and
>>>                                     find the below
>>>                                     error message.
>>>
>>>                                           RuntimeError: Atoms
>>>                                     objects on different processors
>>>                                     are not identical!
>>>
>>>                                              I find a line in the
>>>                                     force calculator
>>>                                     'wfs.world.broadcast(self.F_av, 0)'
>>>                                     so that all the forces on
>>>                                     different ranks should be the
>>>                                     same, which makes
>>>                                     me confused, I can not think out
>>>                                     any other reason can lead to
>>>                                     this error.
>>>
>>>                                             Could anyone take a look
>>>                                     at it?
>>>
>>>                                             I attached the structure
>>>                                     file and running script here, I
>>>                                     used 24 cores.
>>>
>>>                                             Thanks in advance.
>>>
>>>                                               Jingzhe
>>>
>>>                                     <main.py><model.traj>_______________________________________________
>>>
>>>                                     gpaw-users mailing list
>>>                                     gpaw-users at listserv.fysik.dtu.dk
>>>                                     <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>>                                     https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>
>>>                         _______________________________________________
>>>                         gpaw-users mailing list
>>>                         gpaw-users at listserv.fysik.dtu.dk
>>>                         <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>>                         https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>             _______________________________________________
>>>             gpaw-users mailing list
>>>             gpaw-users at listserv.fysik.dtu.dk
>>>             <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>>             https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>
>>>         _______________________________________________
>>>         gpaw-users mailing list
>>>         gpaw-users at listserv.fysik.dtu.dk
>>>         <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>>         https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20150209/d4f7bfa8/attachment-0001.html>