[gpaw-users] Error when relaxing atoms
jingzhe
jingzhe.chen at gmail.com
Mon Feb 9 05:29:59 CET 2015
Hi Marcin,
My bad , I did not try the gpaw-test in parallel, just
tried relax.py and
transport.py. The gpaw-test in parallel failed with the following message
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: ip03 (PID 3374)
MPI_COMM_WORLD rank: 0
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
python 2.6.6 GCC 4.4.7 20120313 (Red Hat 4.4.7-4) 64bit ELF on Linux
x86_64 centos 6.5 Final
Running tests in /ltmp/chenjing/gpaw-test-L_VbEM
Jobs: 1, Cores: 4, debug-mode: False
=============================================================================
gemm_complex.py 0.027 OK
ase3k_version.py 0.022 OK
kpt.py 0.030 OK
mpicomm.py 0.022 OK
numpy_core_multiarray_dot.py 0.021 OK
maxrss.py 0.000 SKIPPED
fileio/hdf5_noncontiguous.py 0.002 SKIPPED
cg2.py 0.024 OK
laplace.py 0.023 OK
lapack.py 0.023 OK
eigh.py 0.022 OK
parallel/submatrix_redist.py 0.000 SKIPPED
second_derivative.py 0.035 OK
parallel/parallel_eigh.py 0.022 OK
gp2.py 0.023 OK
blas.py 0.164 OK
Gauss.py 0.045 OK
nabla.py 0.140 OK
dot.py 0.030 OK
mmm.py 0.028 OK
lxc_fxc.py 0.030 OK
pbe_pw91.py 0.029 OK
gradient.py 0.033 OK
erf.py 0.028 OK
lf.py 0.033 OK
fsbt.py 0.034 OK
parallel/compare.py 0.031 OK
integral4.py 0.069 OK
zher.py 0.149 OK
gd.py 0.032 OK
pw/interpol.py 0.025 OK
screened_poisson.py 0.461 OK
xc.py 0.064 OK
XC2.py 2.548 OK
yukawa_radial.py 0.024 OK
dump_chi0.py 0.045 OK
vdw/potential.py 0.026 OK
lebedev.py 0.053 OK
fileio/hdf5_simple.py 0.002 SKIPPED
occupations.py 0.080 OK
derivatives.py 0.034 OK
parallel/realspace_blacs.py 0.027 OK
pw/reallfc.py [ip03:03367] 3 more processes have
sent help message help-mpi-runtime.txt / mpi_init:warn-fork
[ip03:03367] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages
0.357 OK
parallel/pblas.py 0.048 OK
non_periodic.py 0.064 OK
spectrum.py 0.019 SKIPPED
pw/lfc.py 0.273 OK
gauss_func.py 1.032 OK
multipoletest.py 0.516 OK
noncollinear/xcgrid3d.py 6.207 OK
cluster.py 0.228 OK
poisson.py 0.095 OK
parallel/overlap.py 2.293 OK
parallel/scalapack.py 0.036 OK
gauss_wave.py 0.650 OK
transformations.py 0.047 OK
parallel/blacsdist.py 0.033 OK
ut_rsh.py 2.098 OK
pbc.py 0.822 OK
noncollinear/xccorr.py 0.587 OK
atoms_too_close.py 1.043 OK
harmonic.py 40.344 OK
proton.py 5.189 OK
atoms_mismatch.py 0.051 OK
timing.py 0.935 OK
parallel/ut_parallel.py 1.098 OK
ut_csh.py Test failed. Check ut_csh.log for
details.
Test failed. Check ut_csh.log for details.
Test failed. Check ut_csh.log for details.
Test failed. Check ut_csh.log for details.
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 3376 on
node ip03 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Numpy is compiled in the cluster, I did not do it myself.
for
ldd `which gpaw-python`
I got
linux-vdso.so.1 => (0x00007fff2edff000)
libgfortran.so.3 => /home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgfortran.so.3 (0x00002b41947c7000)
libxc.so.1 => /home/chenjing/Installation/libxc/lib/libxc.so.1 (0x00002b4194a9f000)
libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0 (0x000000343ee00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000000343e600000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000343e200000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003440e00000)
libm.so.6 => /lib64/libm.so.6 (0x000000343ea00000)
libmpi.so.0 => /home/chenjing/openmpi-1.4.5/lib/libmpi.so.0 (0x00002b4194d52000)
libopen-rte.so.0 => /home/chenjing/openmpi-1.4.5/lib/libopen-rte.so.0 (0x00002b4195165000)
libopen-pal.so.0 => /home/chenjing/openmpi-1.4.5/lib/libopen-pal.so.0 (0x00002b41953ed000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x000000343f600000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000000343f200000)
libtorque.so.2 => /opt/torque/lib/libtorque.so.2 (0x00002b419564f000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002b4195953000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00002b4195b5c000)
libc.so.6 => /lib64/libc.so.6 (0x000000343de00000)
libgcc_s.so.1 => /home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgcc_s.so.1 (0x00002b4195d76000)
/lib64/ld-linux-x86-64.so.2 (0x000000343da00000)
libnl.so.1 => /lib64/libnl.so.1 (0x00002b4195f8c000)
for
python -c "import numpy; print numpy.__config__.show(); print numpy.__version__"
I got
atlas_threads_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib64/atlas']
language = f77
include_dirs = ['/usr/include']
blas_opt_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib64/atlas']
define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
language = c
include_dirs = ['/usr/include']
atlas_blas_threads_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib64/atlas']
language = c
include_dirs = ['/usr/include']
lapack_opt_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib64/atlas']
define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
language = f77
include_dirs = ['/usr/include']
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
None
1.4.1
and for ldd `python -c "from numpy.core import _dotblas; print
_dotblas.__file__"`
I got
linux-vdso.so.1 => (0x00007fff2afff000)
libptf77blas.so.3 => /usr/lib64/atlas/libptf77blas.so.3
(0x00002b104af64000)
libptcblas.so.3 => /usr/lib64/atlas/libptcblas.so.3
(0x00002b104b184000)
libatlas.so.3 => /usr/lib64/atlas/libatlas.so.3 (0x00002b104b3a4000)
libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0
(0x00002b104ba00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b104bda7000)
libc.so.6 => /lib64/libc.so.6 (0x00002b104bfc4000)
libgfortran.so.3 =>
/home/chenjing/programs/MatlabR2011A/sys/os/glnxa64/libgfortran.so.3
(0x00002b104c358000)
libm.so.6 => /lib64/libm.so.6 (0x00002b104c631000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b104c8b5000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002b104cab9000)
/lib64/ld-linux-x86-64.so.2 (0x000000343da00000)
Best.
Jingzhe
于 2015年02月08日 18:33, Marcin Dulak 写道:
> On 02/08/2015 01:42 AM, jingzhe Chen wrote:
>> Hi all,
>> In one cluster this error is repeated again and again where gpaw is
>> compiled with blas/lapack, even the forces are not the same after
>> broadcasting. While it disappeared when I try the same script on another
>> cluster(also blas/lapack).
> do the full gpaw-test pass in parallel?
> How was numpy compiled on those clusters?
> To have the full information, provide:
> ldd `which gpaw-python`
> python -c "import numpy; print numpy.__config__.show(); print numpy.__version__"
> In addition check the numpy's _dotblas.so linked libraries
> (_dotblas.so the source of problems most often) with:
> ldd `python -c "from numpy.core import _dotblas; print _dotblas.__file__"`
>
> Best regards,
>
> Marcin
>>
>> Best.
>> Jingzhe
>>
>> On Fri, Feb 6, 2015 at 12:41 PM, jingzhe <jingzhe.chen at gmail.com
>> <mailto:jingzhe.chen at gmail.com>> wrote:
>>
>> Dear all,
>>
>> I ran again in the debug mode, the results I got for
>> the atoms positions on
>> different ranks can differ in the order of 0.01A. And even the
>> forces on different
>> ranks differ in the order of 1eV/A, while every time there is
>> only one rank behaves
>> oddly, now I have exchanged the two lines ( broadcast and
>> symmetric correction)
>> in the force calculator to see what will happen.
>>
>> Best.
>>
>> Jingzhe
>>
>>
>> 于 2015年02月05日 15:53, Jens Jørgen Mortensen 写道:
>>
>> On 02/04/2015 05:12 PM, Ask Hjorth Larsen wrote:
>>
>> I committed something in r12401 which should make the
>> check more
>> reliable. It does not use hashing because the atoms
>> object is sent
>> anyway.
>>
>>
>> Thanks a lot for fixing this! Should there also be some
>> tolerance for the unit cell?
>>
>> Jens Jørgen
>>
>> Best regards
>> Ask
>>
>> 2015-02-04 14:47 GMT+01:00 Ask Hjorth Larsen
>> <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>>
>> Well, to clarify a bit.
>>
>> The hashing is useful if we don't want to send stuff
>> around.
>>
>> If we are actually sending the positions now (by
>> broadcast; I am only
>> strictly aware that the forces are broadcast), then
>> each core can
>> compare locally without the need for hashing, to see
>> if it wants to
>> raise an error. (Raising errors on some cores but
>> not all is
>> sometimes annoying though.)
>>
>> Best regards
>> Ask
>>
>> 2015-02-04 12:57 GMT+01:00 Ask Hjorth Larsen
>> <asklarsen at gmail.com <mailto:asklarsen at gmail.com>>:
>>
>> Hello
>>
>> 2015-02-04 10:21 GMT+01:00 Torsten Hahn
>> <torstenhahn at fastmail.fm
>> <mailto:torstenhahn at fastmail.fm>>:
>>
>> Probably we could do this but my feeling is,
>> that this would only cure the symptoms not
>> the real origin of this annoying bug.
>>
>>
>> In fact there is code in
>>
>> mpi/__init__.py
>>
>> that says:
>>
>> # Construct fingerprint:
>> # ASE may return slightly different atomic
>> positions (e.g. due
>> # to MKL) so compare only first 8 decimals of
>> positions
>>
>>
>> The code says that only 8 decimal positions
>> are used for the generation of atomic
>> „fingerprints“. These code relies on numpy
>> and therefore lapack/blas functions. However
>> i have no idea what that md5_array etc. stuff
>> really does. But there is some debug-code
>> which should at least tell you which Atom(s)
>> causes the problems.
>>
>> md5_array calculates the md5 sum of the data of
>> an array. It is a
>> kind of checksum.
>>
>> Rounding unfortunately does not solve the
>> problem. For any epsilon
>> however little, there exist numbers that differ
>> by epsilon but round
>> to different numbers. So the check will not work
>> the way it is
>> implemented at the moment: Positions that are
>> "close enough" can
>> currently generate an error. In other words if
>> you get this error,
>> maybe there was no problem at all. Given the
>> vast thousands of DFT
>> calculations that are done, this may not be so
>> unlikely.
>>
>> However, that error is *very* strange because
>> mpi.broadcast(...) should result in *exactly*
>> the same objects on all cores. No idea why
>> there should be any difference at all and
>> what was the intention behind the fancy
>> fingerprint-generation stuff in the
>> compare_atoms(atoms, comm=world) method.
>>
>> The check was introduced because there were
>> (infrequent) situations
>> where different cores had different positions,
>> due e.g. to the finicky
>> numerics elsewhere discussed. Later, I guess we
>> have accepted the
>> numerical issues and relaxed the check so it is
>> no longer exact,
>> preferring instead to broadcast. Evidently
>> something else is
>> happening aside from the broadcast, which allows
>> things to go wrong.
>> Perhaps the error in the rounding scheme
>> mentioned above.
>>
>> To explain the hashing: We want to check that
>> numbers on two different
>> CPUs are equal. Either we have to send all the
>> numbers, or hash them
>> and send the hash. Hence hashing is much nicer.
>> But maybe it would
>> be better to hash them with a continuous
>> function. For example adding
>> all numbers with different (pseudorandom?)
>> complex phase factors.
>> Then one can compare the complex hashes and see
>> if they are close
>> enough to each other. There are probably better
>> ways.
>>
>> Best regards
>> Ask
>>
>> Best,
>> Torsten.
>>
>> Am 04.02.2015 um 10:00 schrieb jingzhe
>> <jingzhe.chen at gmail.com
>> <mailto:jingzhe.chen at gmail.com>>:
>>
>> Hi Torsten,
>>
>> Thanks for quick reply, but
>> I use gcc and lapack/blas, I mean if the
>> positions
>> of the atoms are slightly different for
>> different ranks because of compiler/lib
>> stuff,
>> can we just set a tolerance in the
>> check_atoms and jump off the error?
>>
>> Best.
>>
>> Jingzhe
>>
>>
>>
>>
>>
>> 于 2015年02月04日 14:32, Torsten Hahn 写道:
>>
>> Dear Jingzhe,
>>
>> we often recognized this error if we
>> use GPAW together with Intel MKL <=
>> 11.x on Intel CPU’s. I never tracked
>> down the error because it was gone
>> after compiler/library upgrade.
>>
>> Best,
>> Torsten.
>>
>>
>> --
>> Dr. Torsten Hahn
>> torstenhahn at fastmail.fm
>> <mailto:torstenhahn at fastmail.fm>
>>
>> Am 04.02.2015 um 07:27 schrieb
>> jingzhe Chen
>> <jingzhe.chen at gmail.com
>> <mailto:jingzhe.chen at gmail.com>>:
>>
>> Dear GPAW guys,
>>
>> I used the latest gpaw
>> to run a relaxation job, and find
>> the below
>> error message.
>>
>> RuntimeError: Atoms objects
>> on different processors are not
>> identical!
>>
>> I find a line in the
>> force calculator
>> 'wfs.world.broadcast(self.F_av, 0)'
>> so that all the forces on
>> different ranks should be the
>> same, which makes
>> me confused, I can not think out
>> any other reason can lead to this
>> error.
>>
>> Could anyone take a look
>> at it?
>>
>> I attached the structure
>> file and running script here, I
>> used 24 cores.
>>
>> Thanks in advance.
>>
>> Jingzhe
>>
>> <main.py><model.traj>_______________________________________________
>>
>> gpaw-users mailing list
>> gpaw-users at listserv.fysik.dtu.dk
>> <mailto:gpaw-users at listserv.fysik.dtu.dk>
>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>>
>> _______________________________________________
>> gpaw-users mailing list
>> gpaw-users at listserv.fysik.dtu.dk
>> <mailto:gpaw-users at listserv.fysik.dtu.dk>
>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>> _______________________________________________
>> gpaw-users mailing list
>> gpaw-users at listserv.fysik.dtu.dk
>> <mailto:gpaw-users at listserv.fysik.dtu.dk>
>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>>
>> _______________________________________________
>> gpaw-users mailing list
>> gpaw-users at listserv.fysik.dtu.dk
>> <mailto:gpaw-users at listserv.fysik.dtu.dk>
>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>
>>
>>
>>
>>
>> _______________________________________________
>> gpaw-users mailing list
>> gpaw-users at listserv.fysik.dtu.dk
>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20150209/bc2d509a/attachment-0001.html>
More information about the gpaw-users
mailing list