[gpaw-users] Problem with compiling parallel GPAW
Michalsky Ronald
michalskyr at ethz.ch
Fri Jul 25 21:56:08 CEST 2014
Yes, “bse_MoS2_cut.py” hangs; it does not yield an actual error.
I’m not sure how to test the OpenMPI but I’ve tried using the OpenMPI provided by the cluster:
Edits: in .bashrc:
module load open_mpi/1.6.2
#export MPIDIR=/cluster/home03/mavt/ronaldm/openmpi-1.8.1.gfortran
export MPIDIR=/cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2
#export PATH=/cluster/home03/mavt/ronaldm/openmpi-1.8.1.gfortran/bin/mpicc:$PATH
export PATH=/cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/bin/mpicc:$PATH
export LD_LIBRARY_PATH=/cluster/apps/gcc/gcc472/lib64:$LD_LIBRARY_PATH # for libquadmath.so.0
export LD_LIBRARY_PATH=/cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_ess_lsf:$LD_LIBRARY_PATH # due to warning, that was though not prevented
export LD_LIBRARY_PATH=/cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_plm_lsf:$LD_LIBRARY_PATH # due to warning, that was though not prevented
export LD_LIBRARY_PATH=/cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_ras_lsf:$LD_LIBRARY_PATH # due to warning, that was though not prevented
Edits: in .bash_profile:
module load python/2.7.2 netcdf/4.1.3 #######
#module load python/2.7.2 netcdf/4.3.0 #######
module load gcc/4.7.2
module load open_mpi/1.6.2
export NETCDF=/cluster/apps/netcdf/4.1.3/x86_64/serial/gcc_4.7.2/lib #######
#export NETCDF=/cluster/apps/netcdf/4.3.0/x86_64/gcc_4.7.2/openmpi_1.6.2/lib #######
>>> Still: gpaw-python `which gpaw-test` 2>&1 | tee test.log
>>> hangs at “bse_MoS2_cut.py”:
[brutus3.ethz.ch:35995] mca: base: component_find: unable to open /cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_ess_lsf: libbat.so: cannot open shared object file: No such file or directory (ignored)
[brutus3.ethz.ch:36084] mca: base: component_find: unable to open /cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_ess_lsf: libbat.so: cannot open shared object file: No such file or directory (ignored)
[brutus3.ethz.ch:36084] mca: base: component_find: unable to open /cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_plm_lsf: libbat.so: cannot open shared object file: No such file or directory (ignored)
[brutus3.ethz.ch:36084] mca: base: component_find: unable to open /cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_ras_lsf: libbat.so: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory. This typically can indicate that the
memlock limits are set too low. For most HPC installations, the
memlock limits should be set to "unlimited". The failure occured
here:
Local host: brutus3.ethz.ch
OMPI source: btl_openib.c:190
Function: ibv_create_cq()
Device: mlx4_0
Memlock limit: 2097152
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: brutus3.ethz.ch (PID 35995)
MPI_COMM_WORLD rank: 0
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
python 2.7.2 GCC 4.4.6 20110731 (Red Hat 4.4.6-3) 64bit ELF on Linux x86_64 centos 6.5 Final
Running tests in /tmp/gpaw-test-Bh4RnJ
Jobs: 1, Cores: 1, debug-mode: False
=============================================================================
gemm_complex.py 0.018 OK
[…]
>>> and: mpirun -np 2 gpaw-python -c "import gpaw.mpi as mpi; print mpi.rank"
>>> yields:
[brutus3.ethz.ch:35216] mca: base: component_find: unable to open /cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_ess_lsf: libbat.so: cannot open shared object file: No such file or directory (ignored)
[brutus3.ethz.ch:35216] mca: base: component_find: unable to open /cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_plm_lsf: libbat.so: cannot open shared object file: No such file or directory (ignored)
[brutus3.ethz.ch:35216] mca: base: component_find: unable to open /cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_ras_lsf: libbat.so: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory. This typically can indicate that the
memlock limits are set too low. For most HPC installations, the
memlock limits should be set to "unlimited". The failure occured
here:
Local host: brutus3.ethz.ch
OMPI source: btl_openib.c:190
Function: ibv_create_cq()
Device: mlx4_0
Memlock limit: 2097152
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
0
1
[brutus3.ethz.ch:35216] 1 more process has sent help message help-mpi-btl-openib.txt / init-fail-no-mem
[brutus3.ethz.ch:35216] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>> while: mpiexec -np 4 gpaw-python `which gpaw-test` 2>&1 | tee test4.log
>>> yields:
[brutus2.ethz.ch:48094] mca: base: component_find: unable to open /cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_ess_lsf: libbat.so: cannot open shared object file: No such file or directory (ignored)
[brutus2.ethz.ch:48094] mca: base: component_find: unable to open /cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_plm_lsf: libbat.so: cannot open shared object file: No such file or directory (ignored)
[brutus2.ethz.ch:48094] mca: base: component_find: unable to open /cluster/apps/openmpi/1.6.2/x86_64/gcc_4.7.2/lib/openmpi/mca_ras_lsf: libbat.so: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory. This typically can indicate that the
memlock limits are set too low. For most HPC installations, the
memlock limits should be set to "unlimited". The failure occured
here:
Local host: brutus2.ethz.ch
OMPI source: btl_openib.c:190
Function: ibv_create_cq()
Device: mlx4_0
Memlock limit: 2097152
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: brutus2.ethz.ch (PID 48100)
MPI_COMM_WORLD rank: 0
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
python 2.7.2 GCC 4.4.6 20110731 (Red Hat 4.4.6-3) 64bit ELF on Linux x86_64 centos 6.5 Final
Running tests in /tmp/gpaw-test-0Z2zMG
Jobs: 1, Cores: 4, debug-mode: False
=============================================================================
gemm_complex.py 0.034 OK
mpicomm.py 0.032 OK
ase3k_version.py 0.029 OK
numpy_core_multiarray_dot.py 0.029 OK
eigh.py 0.044 OK
lapack.py 0.026 OK
dot.py 0.022 OK
lxc_fxc.py 0.029 OK
blas.py 0.032 OK
erf.py 0.020 OK
gp2.py 0.022 OK
kptpar.py [brutus2.ethz.ch:48094] 3 more processes have sent help message help-mpi-btl-openib.txt / init-fail-no-mem
[brutus2.ethz.ch:48094] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[brutus2.ethz.ch:48094] 3 more processes have sent help message help-mpi-runtime.txt / mpi_init:warn-fork
6.769 OK
non_periodic.py 0.026 OK
parallel/blacsdist.py 0.030 OK
gradient.py 0.038 OK
cg2.py 0.045 OK
kpt.py 0.024 OK
lf.py 0.022 OK
gd.py 0.017 OK
parallel/compare.py 0.242 OK
pbe_pw91.py 0.031 OK
fsbt.py 0.022 OK
derivatives.py 0.035 OK
Gauss.py 0.053 OK
second_derivative.py 0.041 OK
integral4.py 0.046 OK
parallel/ut_parallel.py 1.170 OK
transformations.py 0.046 OK
parallel/parallel_eigh.py 0.036 OK
spectrum.py 0.303 OK
xc.py 0.099 OK
zher.py 0.063 OK
pbc.py 0.061 OK
lebedev.py 0.051 OK
parallel/ut_hsblacs.py 0.346 OK
occupations.py 0.093 OK
dump_chi0.py 0.098 OK
cluster.py 0.271 OK
pw/interpol.py 0.032 OK
poisson.py 0.078 OK
pw/lfc.py 0.291 OK
pw/reallfc.py 0.405 OK
XC2.py 0.105 OK
multipoletest.py 0.406 OK
nabla.py 0.132 OK
noncollinear/xccorr.py 0.600 OK
gauss_wave.py 0.612 OK
harmonic.py 0.214 OK
atoms_too_close.py 0.256 OK
screened_poisson.py 0.206 OK
yukawa_radial.py 0.050 OK
noncollinear/xcgrid3d.py 0.197 OK
vdwradii.py 2.275 OK
lcao_restart.py 0.500 OK
ase3k.py 0.841 OK
parallel/ut_kptops.py 2.128 OK
fileio/idiotproof_setup.py 0.682 OK
fileio/hdf5_simple.py 0.037 SKIPPED
fileio/hdf5_noncontiguous.py 0.032 SKIPPED
fileio/parallel.py 6.487 OK
timing.py 1.019 OK
coulomb.py 0.572 OK
xcatom.py 1.585 OK
proton.py 0.492 OK
keep_htpsit.py 2.049 OK
pw/stresstest.py 2.313 OK
aeatom.py 4.282 OK
numpy_zdotc_graphite.py 1.822 OK
lcao_density.py 0.845 OK
parallel/overlap.py 0.978 OK
restart.py 1.973 OK
gemv.py 3.267 OK
ylexpand.py 1.659 OK
wfs_io.py 1.775 OK
fixocc.py 4.275 OK
nonselfconsistentLDA.py 2.343 OK
gga_atom.py 3.069 OK
ds_beta.py 2.794 OK
gauss_func.py 0.894 OK
noncollinear/h.py 1.997 OK
symmetry.py 3.726 OK
usesymm.py 2.058 OK
broydenmixer.py 4.086 OK
mixer.py 4.097 OK
wfs_auto.py 1.659 OK
ewald.py 5.067 OK
refine.py 1.503 OK
revPBE.py 2.956 OK
nonselfconsistent.py 3.267 OK
hydrogen.py 2.043 OK
fileio/file_reference.py 4.062 OK
fixdensity.py 3.996 OK
bee1.py 2.605 OK
spinFe3plus.py 5.160 OK
pw/h.py 6.014 OK
stdout.py 3.986 OK
parallel/lcao_complicated.py 10.440 OK
pw/slab.py 9.166 OK
spinpol.py 4.213 OK
plt.py 3.010 OK
eed.py 2.512 OK
lrtddft2.py 1.868 OK
parallel/hamiltonian.py 2.207 OK
ah.py 3.870 OK
laplace.py 0.034 OK
pw/mgo_hybrids.py 12.123 OK
lcao_largecellforce.py 3.160 OK
restart2.py 3.530 OK
Cl_minus.py 7.130 OK
fileio/restart_density.py 10.020 OK
external_potential.py 2.266 OK
pw/bulk.py 8.433 OK
pw/fftmixer.py 1.472 OK
mgga_restart.py 3.698 OK
vdw/quick.py 11.911 OK
partitioning.py 8.147 OK
bulk.py 12.549 OK
elf.py 6.431 OK
aluminum_EELS.py 5.776 OK
H_force.py 5.329 OK
parallel/lcao_hamiltonian.py 8.331 OK
fermisplit.py 5.849 OK
parallel/ut_redist.py 15.330 OK
lcao_h2o.py 2.746 OK
cmrtest/cmr_test2.py 5.257 OK
h2o_xas.py 5.350 OK
ne_gllb.py 10.771 OK
exx_acdf.py 6.887 OK
ut_rsh.py 2.145 OK
ut_csh.py 2.205 OK
spin_contamination.py 6.899 OK
davidson.py 9.190 OK
cg.py 7.258 OK
gllbatomic.py 22.057 OK
lcao_force.py 12.172 OK
fermilevel.py 12.780 OK
h2o_xas_recursion.py 7.836 OK
excited_state.py /cluster/home03/mavt/ronaldm/gpaw/gpaw/lrtddft/excited_state.py:176: RuntimeWarning: divide by zero encountered in remainder
if (i % ncalcs) == icalc:
/cluster/home03/mavt/ronaldm/gpaw/gpaw/lrtddft/excited_state.py:176: RuntimeWarning: divide by zero encountered in remainder
if (i % ncalcs) == icalc:
/cluster/home03/mavt/ronaldm/gpaw/gpaw/lrtddft/excited_state.py:176: RuntimeWarning: divide by zero encountered in remainder
if (i % ncalcs) == icalc:
/cluster/home03/mavt/ronaldm/gpaw/gpaw/lrtddft/excited_state.py:176: RuntimeWarning: divide by zero encountered in remainder
if (i % ncalcs) == icalc:
2.423 FAILED! (rank 0,1,2,3)
#############################################################################
RANK 0,1,2,3:
Traceback (most recent call last):
File "/cluster/home03/mavt/ronaldm/gpaw/gpaw/test/__init__.py", line 514, in run_one
execfile(filename, loc)
File "/cluster/home03/mavt/ronaldm/gpaw/gpaw/test/excited_state.py", line 39, in <module>
forces = exst.get_forces(H2)
File "/cluster/home03/mavt/ronaldm/gpaw/gpaw/lrtddft/excited_state.py", line 176, in get_forces
if (i % ncalcs) == icalc:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
#############################################################################
gemm.py 18.840 OK
rpa_energy_Ni.py 15.612 OK
LDA_unstable.py 20.611 OK
si.py 7.939 OK
blocked_rmm_diis.py 5.108 OK
lxc_xcatom.py 22.004 OK
gw_planewave.py 9.671 OK
degeneracy.py 9.305 OK
apmb.py 9.745 OK
vdw/potential.py 0.041 OK
al_chain.py 13.172 OK
relax.py 12.406 OK
fixmom.py 10.716 OK
CH4.py 12.639 OK
diamond_absorption.py 12.034 OK
simple_stm.py 22.820 OK
gw_method.py 18.060 OK
lcao_bulk.py 14.978 OK
constant_electric_field.py 15.148 OK
parallel/ut_invops.py 14.622 OK
parallel/lcao_projections.py 7.506 OK
guc_force.py 24.147 OK
test_ibzqpt.py 16.916 OK
aedensity.py 16.705 OK
fd2lcao_restart.py 23.504 OK
lcao_bsse.py 9.762 OK
pplda.py 33.445 OK
revPBE_Li.py 35.922 OK
si_primitive.py 17.008 OK
complex.py 10.706 OK
Hubbard_U.py 30.096 OK
ldos.py 38.696 OK
parallel/ut_hsops.py 58.648 OK
pw/hyb.py 35.228 OK
hgh_h2o.py 13.953 OK
vdw/quick_spin.py 52.983 OK
scfsic_h2.py 16.912 OK
lrtddft.py 31.377 OK
dscf_lcao.py 24.880 OK
IP_oxygen.py 36.677 OK
Al2_lrtddft.py 22.078 OK
rpa_energy_Si.py 32.897 OK
2Al.py 29.983 OK
jstm.py 19.824 OK
tpss.py 42.180 OK
be_nltd_ip.py 17.267 OK
si_xas.py 30.602 OK
atomize.py 37.679 OK
chi0.py 162.184 OK
Cu.py 48.385 OK
restart_band_structure.py 41.443 OK
ne_disc.py 36.857 OK
exx_coarse.py 32.540 OK
exx_unocc.py 8.875 OK
Hubbard_U_Zn.py 41.355 OK
diamond_gllb.py 73.924 OK
h2o_dks.py 70.194 OK
aluminum_EELS_lcao.py 26.350 OK
gw_ppa.py 21.388 OK
gw_static.py 10.340 OK
exx.py 30.104 OK
pygga.py 89.949 OK
dipole.py 39.430 OK
nsc_MGGA.py 48.030 OK
mgga_sc.py 39.634 OK
MgO_exx_fd_vs_pw.py 79.017 OK
lb94.py 170.687 OK
8Si.py 42.790 OK
td_na2.py 33.487 OK
ehrenfest_nacl.py 19.010 OK
rpa_energy_N2.py 149.873 OK
beefvdw.py 114.234 OK
nonlocalset.py 108.888 OK
wannierk.py 68.999 OK
rpa_energy_Na.py 79.209 OK
pw/si_stress.py 139.633 OK
ut_tddft.py 69.103 OK
transport.py 163.823 OK
vdw/ar2.py 120.665 OK
aluminum_testcell.py 162.975 OK
au02_absorption.py 231.149 OK
lrtddft3.py 212.349 OK
scfsic_n2.py 146.630 OK
parallel/lcao_parallel.py 8.294 OK
parallel/fd_parallel.py 6.368 OK
bse_aluminum.py 16.704 OK
bse_diamond.py 47.535 OK
bse_vs_lrtddft.py 116.312 OK
parallel/pblas.py 0.051 OK
parallel/scalapack.py 0.041 OK
parallel/scalapack_diag_simple.py 0.055 OK
parallel/scalapack_mpirecv_crash.py 11.156 OK
parallel/realspace_blacs.py 0.050 OK
AA_exx_enthalpy.py 303.890 OK
cmrtest/cmr_test.py 0.383 SKIPPED
cmrtest/cmr_test3.py 0.025 SKIPPED
cmrtest/cmr_test4.py 0.031 SKIPPED
cmrtest/cmr_append.py 0.016 SKIPPED
cmrtest/Li2_atomize.py 0.017 SKIPPED
=============================================================================
Ran 230 tests out of 237 in 4885.5 seconds
Tests skipped: 7
Tests failed: 1
=============================================================================
>>> Trying addtitinoally: in .bashrc & .bash_profile:
#module load python/2.7.2 netcdf/4.1.3 #######
module load python/2.7.2 netcdf/4.3.0 #######
#export NETCDF=/cluster/apps/netcdf/4.1.3/x86_64/serial/gcc_4.7.2/lib #######
export NETCDF=/cluster/apps/netcdf/4.3.0/x86_64/gcc_4.7.2/openmpi_1.6.2/lib #######
Serial: yields the same hanging at “bse_MoS2_cut.py”
Parallel: The same error messages as above
More information about the gpaw-users
mailing list