[gpaw-users] Trouble Compiling GPAW
Ask Hjorth Larsen
asklarsen at gmail.com
Tue Mar 22 19:19:57 CET 2016
Hi
2016-03-22 18:46 GMT+01:00 Tomlinson, Warren (CDR) <wwtomlin at nps.edu>:
> Ask-
> Thanks again for all the help. Sorry about the confusion and getting the terminology mixed up. I tried runs with changes to convergence density as well as nbands. I’ll keep experimenting and I’ll get back to you with how it goes. With regard to the "Illegal Instruction" I’m afraid I’ve confused things again. Let me try and explain. I’m working on a genetic algorithm (GA) approach to find the global minimum of various Pd clusters inside UiO-67. This takes a fair amount of CPU time, which I have, but that time is spread out over four different systems. I have installed GPAW 0.11 on two of the systems (both work great, with no problems). I have installed GPAW 1.0 on the third system (works great, no problems). The fourth system is giving me trouble. The third system is a Cray XC30 cluster called “Lightning.” The fourth is a Cray XE6 cluster called “Garnet.” I compiled GPAW 1.0 on both systems within the GNU environment virtually the same way. The only difference is that I was unable to get HDF5 to compile on Garnet, so I left it out. Also, the gcc version for Lighting was 5.1 and for Garnet was 4.9. For both I used the Cray libsci math libraries for Blas, Lapack and Scalapack. Both are running on Python 2.7.9. Both compilations went fine with just a few warnings for unused variables. I ran the tests on Lighting and all tests passed and Lightning is now churning out GA candidate relaxations for PD clusters no problem. For Garnet, however, I got “Illegal Instruction” when the test series got to fd_ops/laplace.py. I took a closer look and the Illegal Instruction comes when the apply method of the FDOperator Class in fd_operators.py calls the line: self.operator.apply(in_xg, out_xg, phase_cd) where self.operator was assigned the following way:
> self.operator = _gpaw.Operator(coef_p, offset_p, n_c, mp,
> neighbor_cd, dtype == float,
> comm, cod)
>
> Just to be clear, nothing runs on Garnet right now. I tried to do a simple relaxation to see if it might work, but it doesn’t. The run fails will only 11 lines written to the log file and the error returned form PBS was “Illegal Instruction.” So, there is definitely something wrong, but I have no idea what it is. I suspect there could be a configuration problem with the way Garnet is set up (it is not identical to Lightning), but I can’t figure out what the system doesn’t like. Do you have an insight as to what GPAW is trying to do at the time the Illegal Instruction occurs? Could I have a problem with the math libraries, or C libraries or could it be something else? If you can give me an idea of how the GPAW code is interacting with the cluster I can bring this information to the help folks that support Garnet and we can try and get an update or patch for their system (I’m pretty sure there’s nothing wrong with the GPAW code).
> Thanks again for all your help
> Warren
My guess is that compilers and libraries are somehow incompatible -
something very system-dependent. Is the compiler the same for all
libraries? But I will be hard pressed to help with that.
Best regards
Ask
>
>
>
>
>> On Mar 21, 2016, at 2:59 PM, Ask Hjorth Larsen <asklarsen at gmail.com> wrote:
>>
>> Hello
>>
>> (Please reply to list)
>>
>> 2016-03-21 21:04 GMT+01:00 Tomlinson, Warren (CDR) <wwtomlin at nps.edu>:
>>> Ask-
>>> Thanks for the help. Nothing has worked so far. I’ve tried adjusting the
>>> mixer as you suggested, as wall as adjusting the mixer to (0.01, 3,
>>> weight=50), but still no convergence. I’ve also tried with no mixer, but
>>> the SCF cycles just keep going. I’ve attached one of my log files (only
>>> first 60 SCF steps… sending all 600+ made file too big). I also took a look
>>> at the forces and found that convergence is definitely close, but the system
>>> never quite gets there. I’ve attached a file with this data as well. This
>>> first SCF step has 25 atoms out of limits, but the second has 503. For step
>>> three we’re back down to 15, then back up to 319 for step four. The system
>>> does get down to 1 atom out of limits (step 57), but there is no steady
>>> progression to that point. The number of atoms out of limits swings pretty
>>> wildly during the entire relaxation process. If I had set my limit to 0.1
>>> eV/Ang instead of the default I would have had convergence at step 11, but
>>> there were still pretty wild swings (from 3 to 362 then back to 2 and then
>>> up to 81… etc). Looking at some of the other compounds I’ve relaxed I think
>>> these swings are not unexpected, but for whatever reason this system never
>>> gets to zero. Should I try and write a routine to attach to the calculator
>>> that looks to potentially converge a system if only a handful of atoms are
>>> out of limits (and that they’re not THAT far out of limits)?
>>> Thank you for any help you can give.
>>> Warren
>>
>> Oh, it would seem that we were talking about two entirely different
>> things. If the SCF cycle does not converge in N steps, it means (to
>> me) that it cannot converge a selfconsistent calculation. In this
>> case, all selfconsistent calculations converge, while it is the
>> structure optimization that never does, because something else is
>> wrong.
>>
>> The problem seems to be that the forces are not accurate, and this you
>> can improve by setting convergence={'density': 1e-5} (or try 1e-6 in
>> extreme cases). (There is also a 'force' convergence criterion, but
>> it is more expensive to compute in LCAO mode.)
>>
>> You can consider using another optimizer as well, such as BFGS.
>>
>> You can lower the number of bands to get slightly better performance,
>> and lower memory usage; try nbands='120%'.
>>
>> (The mixer has no effect on this. Set it to whatever works best)
>>
>> As for the illegal instruction, I have no idea. It is nevertheless
>> capable of running that many calculations?
>>
>> Best regards
>> Ask
>>
>>>
>>>
>>>
>>>
>>>
>>> On Mar 18, 2016, at 11:19 AM, Tomlinson, Warren (CDR) <wwtomlin at nps.edu>
>>> wrote:
>>>
>>> Ask-
>>> Thanks for the help. I’ll give it a try and let you know if I have any
>>> problems.
>>> Thanks
>>> Warren
>>>
>>> On Mar 18, 2016, at 3:52 AM, Ask Hjorth Larsen <asklarsen at gmail.com> wrote:
>>>
>>> Hello
>>>
>>> 2016-03-17 19:57 GMT+01:00 Tomlinson, Warren (CDR) <wwtomlin at nps.edu>:
>>>
>>> All-
>>> Please ignore my installation problem form my previous email (below).
>>> Problem has been solved.
>>>
>>> On an unrelated note I have a question about a large system I’m
>>> working with (656 atoms). It is the unit cell for the UiO-67 MOF (periodic
>>> in all three directions). I’m trying to relax the structure and I’ve gotten
>>> to a point where the energy is not changing much at all between SCF steps,
>>> but the relaxation keeps running. The system seemed to settle relatively
>>> quickly to -4163.05 eV (about 200 or so SCF steps), but since then there
>>> have been 1063 SCF steps and the energy went from -4163.026398 eV to
>>> -4163.069115 eV. The lowest energy over those 1063 SCF steps was
>>> -4163.074407. It seems to me that the system is just shifting around a
>>> minimum, but never quite settling there. Do you have any tips for
>>> addressing this?
>>>
>>> Below is my python file:
>>>
>>> from ase import Atoms
>>> from ase.io import read, write
>>> from ase.ga.relax_attaches import VariansBreak
>>> from gpaw import GPAW, Mixer, FermiDirac
>>> from ase.optimize import QuasiNewton
>>> from gpaw.poisson import PoissonSolver
>>> import numpy as np
>>>
>>> cell = read('Full_N_Pd_szp_ptl.pdb')
>>> cell.set_cell([26.52,26.52,26.52])
>>> cell.set_pbc(1)
>>>
>>> calc_LCAO = GPAW(mode='lcao',
>>> gpts=(144,144,144),
>>> xc='PBE',
>>> poissonsolver=PoissonSolver(relax='GS', eps=1e-10),
>>> parallel={'band':2,'sl_default':(24,24,64)},
>>> basis = 'szp',
>>> mixer=Mixer(0.1, 5, weight=100.0),
>>>
>>>
>>> Try lowering the mixer parameter from 0.1 to 0.05. Larger systems
>>> tend to require more conservative mixing.
>>>
>>> If it does not work, please attach the logfile.
>>>
>>> It looks like you are using a very large number of cores (576). On
>>> normal architectures this should probably run on 24-48 cores, or even
>>> less given that it is only szp. Adding more cores beyond that may
>>> make it faster, but it is very unlikely that it strong-scales well to
>>> 576.
>>>
>>> Best regards
>>> Ask
>>>
>>> occupations=FermiDirac(width=0.1),
>>> txt='Full_N_Pd_szp.out',
>>> maxiter=1000
>>> )
>>>
>>> cell.set_calculator(calc_LCAO)
>>> opt = QuasiNewton(cell)
>>> vb = VariansBreak(cell, opt)
>>> opt.attach(vb.write)
>>> opt.run()
>>>
>>>
>>> Thanks,
>>> Warren
>>>
>>>
>>> On Mar 15, 2016, at 12:16 PM, Tomlinson, Warren (CDR) <wwtomlin at nps.edu>
>>> wrote:
>>>
>>> Hello-
>>> I’m having an issue compiling GPAW on a cluster and I’m wondering if someone
>>> can help.
>>>
>>> The system is a Cray XC30 cluster (called Lightning)
>>> I am attempting to compile in the GNU environment (although I get the same
>>> error when trying with Intel)
>>>
>>> Relevant modules:
>>> gcc/5.1.0
>>> cray-mpich/7.2.6
>>> cray-libsci/13.2.0
>>> python/gnu/2.7.9
>>> numpy/gnu/1.9.2
>>> spicy/gnu/0.15.1
>>>
>>>
>>> Relevant excerpts from system user’s manual:
>>> 5.1.1. Message Passing Interface (MPI)
>>>
>>> This release of MPI-2 derives from Argonne National Laboratory MPICH-2 and
>>> implements the MPI-2.2 standard except for spawn support, as documented by
>>> the MPI Forum in "MPI: A Message Passing Interface Standard, Version 2.2."
>>>
>>> The Message Passing Interface (MPI) is part of the software support for
>>> parallel programming across a network of computer systems through a
>>> technique known as message passing. MPI establishes a practical, portable,
>>> efficient, and flexible standard for message passing that makes use of the
>>> most attractive features of a number of existing message-passing systems,
>>> rather than selecting one of them and adopting it as the standard. See "man
>>> intro_mpi" for additional information.
>>>
>>> When creating an MPI program on Lightning, ensure the following:
>>>
>>> • That the default MPI module (cray-mpich) has been loaded. To check this,
>>> run the "module list" command. If cray-mpich is not listed and a different
>>> MPI module is listed, use the following command:
>>> module swap other_mpi_module cray-mpich
>>>
>>> If no MPI module is loaded, load the cray-mpich module.
>>>
>>> module load cray-mpich
>>>
>>> • That the source code includes one of the following lines:
>>> INCLUDE "mpif.h" ## for Fortran, or
>>> #include <mpi.h> ## for C/C++
>>>
>>> To compile an MPI program, use the following examples:
>>>
>>> ftn -o pi_program mpi_program.f
>>> ## for Fortran, or
>>> cc -o mpi_program mpi_program.c ## for C/C++
>>>
>>> ———————————————————————
>>>
>>> 5.1.3. Open Multi-Processing (OpenMP)
>>>
>>> OpenMP is a portable, scalable model that gives programmers a simple and
>>> flexible interface for developing parallel applications. It supports
>>> shared-memory multiprocessing programming in C, C++ and Fortran, and
>>> consists of a set of compiler directives, library routines, and environment
>>> variables that influence compilation and run-time behavior.
>>>
>>> When creating an OpenMP program on Lightning, ensure the following:
>>>
>>> • That the default MPI module (cray-mpich) has been loaded. To check this,
>>> run the "module list" command. If cray-mpich is not listed and a different
>>> MPI module is listed, use the following command:
>>> module swap other_mpi_module cray-mpich
>>>
>>> If no MPI module is loaded, load the cray-mpich module.
>>>
>>> module load cray-mpich
>>>
>>> • That if using OpenMP functions (for example, omp_get_wtime), the source
>>> code includes one of the following lines:
>>>
>>> INCLUDE 'omp.h' ## for Fortran, or
>>> #include <omp.h> ## for C/C++
>>>
>>> Or, if the code is written in Fortran 90 or later, the following line may be
>>> used instead:
>>>
>>> USE omp_lib
>>>
>>> • That the compile command includes an option to reference the OpenMP
>>> library. The PGI, Cray, Intel, and GNU compilers support OpenMP, and each
>>> one uses a different option.
>>>
>>> To compile an OpenMP program, use the following examples:
>>>
>>> For C/C++ codes:
>>>
>>> cc -o OpenMP_program -mp=nonuma OpenMP_program.c ## PGI
>>> cc -o OpenMP_program -h omp OpenMP_program.c ## Cray
>>> cc -o OpenMP_program -openmp OpenMP_program.c ## Intel
>>> cc -o OpenMP_program -fopenmp OpenMP_program.c ## GNU
>>>
>>> ———————————————————————
>>>
>>> 5.2. Available Compilers
>>>
>>> Lightning has four programming environment suites.
>>>
>>> • Portland Group (PGI)
>>> • Cray Fortran and C/C++
>>> • Intel
>>> • GNU
>>> On Lightning, different sets of compilers are used to compile codes for
>>> serial vs. parallel execution.
>>>
>>> Compiling for the Compute Nodes
>>>
>>> Codes compiled to run on the compute nodes may be serial or parallel. The
>>> x86-64 instruction set for Intel Ivy Bridge E5-2697 processors has
>>> extensions for the Floating Point Unit (FPU) that require the module
>>> craype-ivybridge to be loaded. This module is loaded for you by default. To
>>> compile codes for execution on the compute nodes, the same compile commands
>>> are available in all programming environment suites as shown in the
>>> following table:
>>>
>>> Compute Node Compiler Commands
>>> Language PGI
>>> Cray Intel
>>> GNU Serial/Parallel
>>> C cc
>>> cc cc
>>> cc Serial/Parallel
>>> C++ CC
>>> CC CC
>>> CC Serial/Parallel
>>> Fortran 77 f77
>>> f77 f77
>>> f77 Serial/Parallel
>>> Fortran 90 ftn
>>> ftn ftn
>>> ftn Serial/Parallel
>>> ——————————————————————————————
>>>
>>>
>>> Building the serial version of GPAW gives me no trouble. I have compiled it
>>> and run all the tests and they all pass. When trying to build the custom
>>> interpreter, the compiling goes fine, but the linking fails.
>>>
>>> Below is the link line causing the issue:
>>> cc -o build/bin.linux-x86_64-2.7//gpaw-python
>>> build/temp.linux-x86_64-2.7/c/woperators.o
>>> build/temp.linux-x86_64-2.7/c/plt.o build/temp.linux-x86_64-2.7/c/lapack.o
>>> build/temp.linux-x86_64-2.7/c/symmetry.o
>>> build/temp.linux-x86_64-2.7/c/plane_wave.o
>>> build/temp.linux-x86_64-2.7/c/operators.o
>>> build/temp.linux-x86_64-2.7/c/mlsqr.o
>>> build/temp.linux-x86_64-2.7/c/transformers.o
>>> build/temp.linux-x86_64-2.7/c/utilities.o
>>> build/temp.linux-x86_64-2.7/c/spline.o build/temp.linux-x86_64-2.7/c/lfc2.o
>>> build/temp.linux-x86_64-2.7/c/localized_functions.o
>>> build/temp.linux-x86_64-2.7/c/wigner_seitz.o
>>> build/temp.linux-x86_64-2.7/c/mpi.o build/temp.linux-x86_64-2.7/c/lfc.o
>>> build/temp.linux-x86_64-2.7/c/bc.o build/temp.linux-x86_64-2.7/c/hdf5.o
>>> build/temp.linux-x86_64-2.7/c/blas.o build/temp.linux-x86_64-2.7/c/fftw.o
>>> build/temp.linux-x86_64-2.7/c/lcao.o
>>> build/temp.linux-x86_64-2.7/c/point_charges.o
>>> build/temp.linux-x86_64-2.7/c/_gpaw.o build/temp.linux-x86_64-2.7/c/cerf.o
>>> build/temp.linux-x86_64-2.7/c/blacs.o
>>> build/temp.linux-x86_64-2.7/c/bmgs/bmgs.o
>>> build/temp.linux-x86_64-2.7/c/xc/rpbe.o
>>> build/temp.linux-x86_64-2.7/c/xc/tpss.o
>>> build/temp.linux-x86_64-2.7/c/xc/xc.o
>>> build/temp.linux-x86_64-2.7/c/xc/revtpss_c_pbe.o
>>> build/temp.linux-x86_64-2.7/c/xc/pbe.o
>>> build/temp.linux-x86_64-2.7/c/xc/libxc.o
>>> build/temp.linux-x86_64-2.7/c/xc/m06l.o
>>> build/temp.linux-x86_64-2.7/c/xc/pw91.o
>>> build/temp.linux-x86_64-2.7/c/xc/revtpss.o
>>> build/temp.linux-x86_64-2.7/c/xc/ensemble_gga.o
>>> build/temp.linux-x86_64-2.7/c/xc/xc_mgga.o
>>> build/temp.linux-x86_64-2.7/c/xc/vdw.o -L/home/wwtomlin/xc/lib
>>> -L/opt/gcc/5.1.0/snos/lib64 -L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib
>>> -L/app/COST/python/2.7.9/gnu/lib -L/opt/cray/mpt/7.2.6/gni/mpich-gnu/51/lib
>>> -L/app/COST/python/2.7.9/gnu/lib/python2.7/config -lxc -lpython2.7 -lpthread
>>> -ldl -lutil -lm -pg -L. -L/app/COST/bzip2/1.0.6/gnu//lib
>>> -L/app/COST/tcltk/8.6.4/gnu//lib
>>> -L/app/COST/dependencies/sqlite/3081101/gnu//lib
>>> -L/app/COST/dependencies/readline/6.3/gnu//lib -Xlinker -export-dynamic
>>>
>>>
>>> Below are the warnings and error I get:
>>> /app/COST/python/2.7.9/gnu/lib/libpython2.7.a(dynload_shlib.o): In function
>>> `_PyImport_GetDynLoadFunc':
>>> /app/COST/source/Python-2.7.9/Python/dynload_shlib.c:130: warning: Using
>>> 'dlopen' in statically linked applications requires at runtime the shared
>>> libraries from the glibc version used for linking
>>> /app/COST/python/2.7.9/gnu/lib/libpython2.7.a(posixmodule.o): In function
>>> `posix_tmpnam':
>>> /app/COST/source/Python-2.7.9/./Modules/posixmodule.c:7575: warning: the use
>>> of `tmpnam_r' is dangerous, better use `mkstemp'
>>> /app/COST/python/2.7.9/gnu/lib/libpython2.7.a(posixmodule.o): In function
>>> `posix_tempnam':
>>> /app/COST/source/Python-2.7.9/./Modules/posixmodule.c:7522: warning: the use
>>> of `tempnam' is dangerous, better use `mkstemp'
>>> /app/COST/python/2.7.9/gnu/lib/libpython2.7.a(posixmodule.o): In function
>>> `posix_initgroups':
>>> /app/COST/source/Python-2.7.9/./Modules/posixmodule.c:4161: warning: Using
>>> 'initgroups' in statically linked applications requires at runtime the
>>> shared libraries from the glibc version used for linking
>>> /app/COST/python/2.7.9/gnu/lib/libpython2.7.a(pwdmodule.o): In function
>>> `pwd_getpwall':
>>> /app/COST/source/Python-2.7.9/./Modules/pwdmodule.c:165: warning: Using
>>> 'getpwent' in statically linked applications requires at runtime the shared
>>> libraries from the glibc version used for linking
>>> /app/COST/python/2.7.9/gnu/lib/libpython2.7.a(pwdmodule.o): In function
>>> `pwd_getpwnam':
>>> /app/COST/source/Python-2.7.9/./Modules/pwdmodule.c:139: warning: Using
>>> 'getpwnam' in statically linked applications requires at runtime the shared
>>> libraries from the glibc version used for linking
>>> /app/COST/python/2.7.9/gnu/lib/libpython2.7.a(pwdmodule.o): In function
>>> `pwd_getpwuid':
>>> /app/COST/source/Python-2.7.9/./Modules/pwdmodule.c:114: warning: Using
>>> 'getpwuid' in statically linked applications requires at runtime the shared
>>> libraries from the glibc version used for linking
>>> /app/COST/python/2.7.9/gnu/lib/libpython2.7.a(pwdmodule.o): In function
>>> `pwd_getpwall':
>>> /app/COST/source/Python-2.7.9/./Modules/pwdmodule.c:164: warning: Using
>>> 'setpwent' in statically linked applications requires at runtime the shared
>>> libraries from the glibc version used for linking
>>> /app/COST/source/Python-2.7.9/./Modules/pwdmodule.c:176: warning: Using
>>> 'endpwent' in statically linked applications requires at runtime the shared
>>> libraries from the glibc version used for linking
>>> /usr/bin/ld: dynamic STT_GNU_IFUNC symbol `strcmp' with pointer equality in
>>> `/usr/lib/../lib64/libc.a(strcmp.o)' can not be used when making an
>>> executable; recompile with -fPIE and relink with -pie
>>> collect2: error: ld returned 1 exit status
>>>
>>>
>>> Finally, here are the lines from my customize.py file:
>>> compiler = 'cc'
>>> define_macros += [('PARALLEL', '1')]
>>> mpicompiler = 'cc'
>>> mpilinker = mpicompiler
>>>
>>> libraries = ['xc']
>>> scalapack = True
>>>
>>> mpi_library_dirs += ['/opt/cray/mpt/7.2.6/gni/mpich-gnu/51/lib']
>>> library_dirs += ['/home/wwtomlin/xc/lib']
>>> library_dirs += ['/opt/gcc/5.1.0/snos/lib64']
>>> library_dirs += ['/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib']
>>> library_dirs += ['/app/COST/python/2.7.9/gnu/lib']
>>>
>>> mpi_include_dirs += ['/opt/cray/mpt/7.2.6/gni/mpich-gnu/51/include']
>>> include_dirs += ['/home/wwtomlin/xc/include']
>>> include_dirs += ['/opt/gcc/5.1.0/snos/include']
>>> include_dirs += ['/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/include']
>>> include_dirs += ['/app/COST/python/2.7.9/gnu/include']
>>>
>>> define_macros += [('GPAW_NO_UNDERSCORE_CBLACS', '1')]
>>> define_macros += [('GPAW_NO_UNDERSCORE_CSCALAPACK', '1’)]
>>>
>>>
>>> My limited experience in this area tells me that there might be a problem
>>> with the way libc.a was compiled, but I’m note sure. I can’t do anything
>>> about that directly, but I could bring a problem to the attention of the
>>> system administrators and they’re usually pretty helpful getting things
>>> updated. Is that what I need to do?
>>>
>>> Thank you for any help,
>>> Warren
>>> PhD Student
>>> Naval Postgraduate School
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>
>
More information about the gpaw-users
mailing list