[gpaw-users] Error when relaxing atoms
Marcin Dulak
Marcin.Dulak at fysik.dtu.dk
Mon Feb 9 10:26:49 CET 2015
Hi,
On 02/09/2015 05:29 AM, jingzhe wrote:
> Hi Marcin,
>
> My bad , I did not try the gpaw-test in parallel, just
> tried relax.py and
> transport.py. The gpaw-test in parallel failed with the following message
>
>
> --------------------------------------------------------------------------
> An MPI process has executed an operation involving a call to the
> "fork()" system call to create a child process. Open MPI is currently
> operating in a condition that could result in memory corruption or
> other system errors; your MPI job may hang, crash, or produce silent
> data corruption. The use of fork() (or system() or other calls that
> create child processes) is strongly discouraged.
>
> The process that invoked fork was:
>
> Local host: ip03 (PID 3374)
> MPI_COMM_WORLD rank: 0
>
> If you are *absolutely sure* that your application will successfully
> and correctly survive a call to fork(), you may disable this warning
> by setting the mpi_warn_on_fork MCA parameter to 0.
> --------------------------------------------------------------------------
> python 2.6.6 GCC 4.4.7 20120313 (Red Hat 4.4.7-4) 64bit ELF on Linux
> x86_64 centos 6.5 Final
> Running tests in /ltmp/chenjing/gpaw-test-L_VbEM
> Jobs: 1, Cores: 4, debug-mode: False
> =============================================================================
> gemm_complex.py 0.027 OK
> ase3k_version.py 0.022 OK
> kpt.py 0.030 OK
> mpicomm.py 0.022 OK
> numpy_core_multiarray_dot.py 0.021 OK
> maxrss.py 0.000 SKIPPED
> fileio/hdf5_noncontiguous.py 0.002 SKIPPED
> cg2.py 0.024 OK
> laplace.py 0.023 OK
> lapack.py 0.023 OK
> eigh.py 0.022 OK
> parallel/submatrix_redist.py 0.000 SKIPPED
> second_derivative.py 0.035 OK
> parallel/parallel_eigh.py 0.022 OK
> gp2.py 0.023 OK
> blas.py 0.164 OK
> Gauss.py 0.045 OK
> nabla.py 0.140 OK
> dot.py 0.030 OK
> mmm.py 0.028 OK
> lxc_fxc.py 0.030 OK
> pbe_pw91.py 0.029 OK
> gradient.py 0.033 OK
> erf.py 0.028 OK
> lf.py 0.033 OK
> fsbt.py 0.034 OK
> parallel/compare.py 0.031 OK
> integral4.py 0.069 OK
> zher.py 0.149 OK
> gd.py 0.032 OK
> pw/interpol.py 0.025 OK
> screened_poisson.py 0.461 OK
> xc.py 0.064 OK
> XC2.py 2.548 OK
> yukawa_radial.py 0.024 OK
> dump_chi0.py 0.045 OK
> vdw/potential.py 0.026 OK
> lebedev.py 0.053 OK
> fileio/hdf5_simple.py 0.002 SKIPPED
> occupations.py 0.080 OK
> derivatives.py 0.034 OK
> parallel/realspace_blacs.py 0.027 OK
> pw/reallfc.py [ip03:03367] 3 more processes have
> sent help message help-mpi-runtime.txt / mpi_init:warn-fork
> [ip03:03367] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> 0.357 OK
> parallel/pblas.py 0.048 OK
> non_periodic.py 0.064 OK
> spectrum.py 0.019 SKIPPED
> pw/lfc.py 0.273 OK
> gauss_func.py 1.032 OK
> multipoletest.py 0.516 OK
> noncollinear/xcgrid3d.py 6.207 OK
> cluster.py 0.228 OK
> poisson.py 0.095 OK
> parallel/overlap.py 2.293 OK
> parallel/scalapack.py 0.036 OK
> gauss_wave.py 0.650 OK
> transformations.py 0.047 OK
> parallel/blacsdist.py 0.033 OK
> ut_rsh.py 2.098 OK
> pbc.py 0.822 OK
> noncollinear/xccorr.py 0.587 OK
> atoms_too_close.py 1.043 OK
> harmonic.py 40.344 OK
> proton.py 5.189 OK
> atoms_mismatch.py 0.051 OK
> timing.py 0.935 OK
> parallel/ut_parallel.py 1.098 OK
> ut_csh.py Test failed. Check ut_csh.log for
> details.
> Test failed. Check ut_csh.log for details.
> Test failed. Check ut_csh.log for details.
> Test failed. Check ut_csh.log for details.
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 2 with PID 3376 on
> node ip03 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> Numpy is compiled in the cluster, I did not do it myself.
>
> for
> ldd `which gpaw-python`
> I got
>
> linux-vdso.so.1 => (0x00007fff2edff000)
> libgfortran.so.3 => /home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgfortran.so.3 (0x00002b41947c7000)
this is risky: you are using libraries distributed by Matlab instead of
the default system ones.
I don't see any blas linked - do you use static linking?
> libxc.so.1 => /home/chenjing/Installation/libxc/lib/libxc.so.1 (0x00002b4194a9f000)
> libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0 (0x000000343ee00000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x000000343e600000)
> libdl.so.2 => /lib64/libdl.so.2 (0x000000343e200000)
> libutil.so.1 => /lib64/libutil.so.1 (0x0000003440e00000)
> libm.so.6 => /lib64/libm.so.6 (0x000000343ea00000)
> libmpi.so.0 => /home/chenjing/openmpi-1.4.5/lib/libmpi.so.0 (0x00002b4194d52000)
> libopen-rte.so.0 => /home/chenjing/openmpi-1.4.5/lib/libopen-rte.so.0 (0x00002b4195165000)
> libopen-pal.so.0 => /home/chenjing/openmpi-1.4.5/lib/libopen-pal.so.0 (0x00002b41953ed000)
> librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x000000343f600000)
> libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000000343f200000)
> libtorque.so.2 => /opt/torque/lib/libtorque.so.2 (0x00002b419564f000)
> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002b4195953000)
> libnsl.so.1 => /lib64/libnsl.so.1 (0x00002b4195b5c000)
> libc.so.6 => /lib64/libc.so.6 (0x000000343de00000)
> libgcc_s.so.1 => /home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgcc_s.so.1 (0x00002b4195d76000)
> /lib64/ld-linux-x86-64.so.2 (0x000000343da00000)
> libnl.so.1 => /lib64/libnl.so.1 (0x00002b4195f8c000)
>
> for
> python -c "import numpy; print numpy.__config__.show(); print numpy.__version__"
>
> I got
>
> atlas_threads_info:
> libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
> library_dirs = ['/usr/lib64/atlas']
> language = f77
> include_dirs = ['/usr/include']
>
> blas_opt_info:
> libraries = ['ptf77blas', 'ptcblas', 'atlas']
here is the problem: this is a multithreaded atlas - it won't work with
gpaw.
You need to build another numpy. See here
https://wiki.fysik.dtu.dk/gpaw/install/Linux/r410_psmn.ens-lyon.html
how to disable the use of system blas/lapack and use numpy's internally
distributed sources.
Best regards,
Marcin
> library_dirs = ['/usr/lib64/atlas']
> define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
> language = c
> include_dirs = ['/usr/include']
>
> atlas_blas_threads_info:
> libraries = ['ptf77blas', 'ptcblas', 'atlas']
> library_dirs = ['/usr/lib64/atlas']
> language = c
> include_dirs = ['/usr/include']
>
> lapack_opt_info:
> libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
> library_dirs = ['/usr/lib64/atlas']
> define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
> language = f77
> include_dirs = ['/usr/include']
>
> lapack_mkl_info:
> NOT AVAILABLE
>
> blas_mkl_info:
> NOT AVAILABLE
>
> mkl_info:
> NOT AVAILABLE
>
> None
> 1.4.1
>
> and for ldd `python -c "from numpy.core import _dotblas; print
> _dotblas.__file__"`
>
> I got
> linux-vdso.so.1 => (0x00007fff2afff000)
> libptf77blas.so.3 => /usr/lib64/atlas/libptf77blas.so.3
> (0x00002b104af64000)
> libptcblas.so.3 => /usr/lib64/atlas/libptcblas.so.3
> (0x00002b104b184000)
> libatlas.so.3 => /usr/lib64/atlas/libatlas.so.3 (0x00002b104b3a4000)
> libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0
> (0x00002b104ba00000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b104bda7000)
> libc.so.6 => /lib64/libc.so.6 (0x00002b104bfc4000)
> libgfortran.so.3 =>
> /home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgfortran.so.3
> (0x00002b104c358000)
> libm.so.6 => /lib64/libm.so.6 (0x00002b104c631000)
> libdl.so.2 => /lib64/libdl.so.2 (0x00002b104c8b5000)
> libutil.so.1 => /lib64/libutil.so.1 (0x00002b104cab9000)
> /lib64/ld-linux-x86-64.so.2 (0x000000343da00000)
>
>
> Best.
> Jingzhe
>
>
>
> 于 2015年02月08日 18:33, Marcin Dulak 写道:
>> On 02/08/2015 01:42 AM, jingzhe Chen wrote:
>>> Hi all,
>>> In one cluster this error is repeated again and again where
>>> gpaw is
>>> compiled with blas/lapack, even the forces are not the same after
>>> broadcasting. While it disappeared when I try the same script on
>>> another
>>> cluster(also blas/lapack).
>> do the full gpaw-test pass in parallel?
>> How was numpy compiled on those clusters?
>> To have the full information, provide:
>> ldd `which gpaw-python`
>> python -c "import numpy; print numpy.__config__.show(); print numpy.__version__"
>> In addition check the numpy's _dotblas.so linked libraries
>> (_dotblas.so the source of problems most often) with:
>> ldd `python -c "from numpy.core import _dotblas; print
>> _dotblas.__file__"`
>>
>> Best regards,
>>
>> Marcin
>>>
>>> Best.
>>> Jingzhe
>>>
>>> On Fri, Feb 6, 2015 at 12:41 PM, jingzhe <jingzhe.chen at gmail.com
>>> <mailto:jingzhe.chen at gmail.com>> wrote:
>>>
>>> Dear all,
>>>
>>> I ran again in the debug mode, the results I got
>>> for the atoms positions on
>>> different ranks can differ in the order of 0.01A. And even the
>>> forces on different
>>> ranks differ in the order of 1eV/A, while every time there is
>>> only one rank behaves
>>> oddly, now I have exchanged the two lines ( broadcast and
>>> symmetric correction)
>>> in the force calculator to see what will happen.
>>>
>>> Best.
>>>
>>> Jingzhe
>>>
>>>
>>> 于 2015年02月05日 15:53, Jens Jørgen Mortensen 写道:
>>>
>>> On 02/04/2015 05:12 PM, Ask Hjorth Larsen wrote:
>>>
>>> I committed something in r12401 which should make the
>>> check more
>>> reliable. It does not use hashing because the atoms
>>> object is sent
>>> anyway.
>>>
>>>
>>> Thanks a lot for fixing this! Should there also be some
>>> tolerance for the unit cell?
>>>
>>> Jens Jørgen
>>>
>>> Best regards
>>> Ask
>>>
>>> 2015-02-04 14:47 GMT+01:00 Ask Hjorth Larsen
>>> <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>>>
>>> Well, to clarify a bit.
>>>
>>> The hashing is useful if we don't want to send stuff
>>> around.
>>>
>>> If we are actually sending the positions now (by
>>> broadcast; I am only
>>> strictly aware that the forces are broadcast), then
>>> each core can
>>> compare locally without the need for hashing, to see
>>> if it wants to
>>> raise an error. (Raising errors on some cores but
>>> not all is
>>> sometimes annoying though.)
>>>
>>> Best regards
>>> Ask
>>>
>>> 2015-02-04 12:57 GMT+01:00 Ask Hjorth Larsen
>>> <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>>>
>>> Hello
>>>
>>> 2015-02-04 10:21 GMT+01:00 Torsten Hahn
>>> <torstenhahn at fastmail.fm
>>> <mailto:torstenhahn at fastmail.fm>>:
>>>
>>> Probably we could do this but my feeling is,
>>> that this would only cure the symptoms not
>>> the real origin of this annoying bug.
>>>
>>>
>>> In fact there is code in
>>>
>>> mpi/__init__.py
>>>
>>> that says:
>>>
>>> # Construct fingerprint:
>>> # ASE may return slightly different atomic
>>> positions (e.g. due
>>> # to MKL) so compare only first 8 decimals
>>> of positions
>>>
>>>
>>> The code says that only 8 decimal positions
>>> are used for the generation of atomic
>>> „fingerprints“. These code relies on numpy
>>> and therefore lapack/blas functions. However
>>> i have no idea what that md5_array etc.
>>> stuff really does. But there is some
>>> debug-code which should at least tell you
>>> which Atom(s) causes the problems.
>>>
>>> md5_array calculates the md5 sum of the data of
>>> an array. It is a
>>> kind of checksum.
>>>
>>> Rounding unfortunately does not solve the
>>> problem. For any epsilon
>>> however little, there exist numbers that differ
>>> by epsilon but round
>>> to different numbers. So the check will not
>>> work the way it is
>>> implemented at the moment: Positions that are
>>> "close enough" can
>>> currently generate an error. In other words if
>>> you get this error,
>>> maybe there was no problem at all. Given the
>>> vast thousands of DFT
>>> calculations that are done, this may not be so
>>> unlikely.
>>>
>>> However, that error is *very* strange
>>> because mpi.broadcast(...) should result in
>>> *exactly* the same objects on all cores. No
>>> idea why there should be any difference at
>>> all and what was the intention behind the
>>> fancy fingerprint-generation stuff in the
>>> compare_atoms(atoms, comm=world) method.
>>>
>>> The check was introduced because there were
>>> (infrequent) situations
>>> where different cores had different positions,
>>> due e.g. to the finicky
>>> numerics elsewhere discussed. Later, I guess we
>>> have accepted the
>>> numerical issues and relaxed the check so it is
>>> no longer exact,
>>> preferring instead to broadcast. Evidently
>>> something else is
>>> happening aside from the broadcast, which allows
>>> things to go wrong.
>>> Perhaps the error in the rounding scheme
>>> mentioned above.
>>>
>>> To explain the hashing: We want to check that
>>> numbers on two different
>>> CPUs are equal. Either we have to send all the
>>> numbers, or hash them
>>> and send the hash. Hence hashing is much
>>> nicer. But maybe it would
>>> be better to hash them with a continuous
>>> function. For example adding
>>> all numbers with different (pseudorandom?)
>>> complex phase factors.
>>> Then one can compare the complex hashes and see
>>> if they are close
>>> enough to each other. There are probably better
>>> ways.
>>>
>>> Best regards
>>> Ask
>>>
>>> Best,
>>> Torsten.
>>>
>>> Am 04.02.2015 um 10:00 schrieb jingzhe
>>> <jingzhe.chen at gmail.com
>>> <mailto:jingzhe.chen at gmail.com>>:
>>>
>>> Hi Torsten,
>>>
>>> Thanks for quick reply,
>>> but I use gcc and lapack/blas, I mean if
>>> the positions
>>> of the atoms are slightly different for
>>> different ranks because of compiler/lib
>>> stuff,
>>> can we just set a tolerance in the
>>> check_atoms and jump off the error?
>>>
>>> Best.
>>>
>>> Jingzhe
>>>
>>>
>>>
>>>
>>>
>>> 于 2015年02月04日 14:32, Torsten Hahn 写道:
>>>
>>> Dear Jingzhe,
>>>
>>> we often recognized this error if we
>>> use GPAW together with Intel MKL <=
>>> 11.x on Intel CPU’s. I never tracked
>>> down the error because it was gone
>>> after compiler/library upgrade.
>>>
>>> Best,
>>> Torsten.
>>>
>>>
>>> --
>>> Dr. Torsten Hahn
>>> torstenhahn at fastmail.fm
>>> <mailto:torstenhahn at fastmail.fm>
>>>
>>> Am 04.02.2015 um 07:27 schrieb
>>> jingzhe Chen
>>> <jingzhe.chen at gmail.com
>>> <mailto:jingzhe.chen at gmail.com>>:
>>>
>>> Dear GPAW guys,
>>>
>>> I used the latest gpaw
>>> to run a relaxation job, and
>>> find the below
>>> error message.
>>>
>>> RuntimeError: Atoms
>>> objects on different processors
>>> are not identical!
>>>
>>> I find a line in the
>>> force calculator
>>> 'wfs.world.broadcast(self.F_av, 0)'
>>> so that all the forces on
>>> different ranks should be the
>>> same, which makes
>>> me confused, I can not think out
>>> any other reason can lead to
>>> this error.
>>>
>>> Could anyone take a look
>>> at it?
>>>
>>> I attached the structure
>>> file and running script here, I
>>> used 24 cores.
>>>
>>> Thanks in advance.
>>>
>>> Jingzhe
>>>
>>> <main.py><model.traj>_______________________________________________
>>>
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> <mailto:gpaw-users at listserv.fysik.dtu.dk>
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20150209/d4f7bfa8/attachment-0001.html>
More information about the gpaw-users
mailing list