[gpaw-users] GPAW Help

Torsten Hahn torstenhahn at fastmail.fm
Tue Mar 1 08:57:01 CET 2016


Hey all,


Sometimes I experience similar errors. We once thought we had tracked it down to erroneous mpi implementation (Intel mpi). However, some people in my group still do see the same error with openmpi and to be honest we have no idea where it is come from. It looks like in some CPUs there is sometimes a very small numerical error in the atomic positions. This error does never happen in non- mpi calculations.

Would be really nice to track that down.

Best,
Torsten.

> Am 29.02.2016 um 18:42 schrieb Tomlinson, Warren (CDR) <wwtomlin at nps.edu>:
> 
> Ask-
>    Thanks for the help.  I tried running with the atoms.rattle as well as the hack you sent me.  The exact problem still persists.  Three SCF cycles are completed and then the error pops up.  I have had success with LBFGS.  There’s no reason why I shouldn’t be OK using that optimizer, correct?  It is odd, though, that BFGS can’t make it past three steps, though.
> Thanks
> Warren
> 
>> On Feb 26, 2016, at 11:47 AM, Ask Hjorth Larsen <asklarsen at gmail.com> wrote:
>> 
>> I realize that more symmetry breaking might be necessary depending on
>> how some things are implemented.  You can try with this slightly
>> symmetry-breaking hack:
>> 
>> http://dcwww.camd.dtu.dk/~askhl/files/bfgshack.py
>> 
>> If push comes to shove and we cannot guess what the problem is, try
>> reducing it in size as much as possible.  As few cores as possible,
>> and as rough parameters as possible.
>> 
>> Best regards
>> Ask
>> 
>> 2016-02-26 20:40 GMT+01:00 Ask Hjorth Larsen <asklarsen at gmail.com>:
>>> Hi Warren
>>> 
>>> 2016-02-26 19:40 GMT+01:00 Tomlinson, Warren (CDR) <wwtomlin at nps.edu>:
>>>> Ask-
>>>>       Thank you for your help.  I reran with the --debug option and also ran with with 36 cores.  Both still failed for the same synchronization problem.  I have all 144 synchronize_atoms_r##.pckl files, but I’m not sure exactly what to do with them.
>>>> 
>>>>       On a related note, I ran the 680 atom structure with QuasiNewton instead of BFGS and it worked.  So I’m guessing that’s a big clue.
>>> 
>>> That's interesting.  BFGS calculates eigenvectors.  Sometimes in
>>> exactly symmetric systems, different cores can get different results
>>> even though they perform the same mathematical operation, typically
>>> due to aggressive BLAS stuff.  They will differ very little, but they
>>> can order eigenvalues/vectors differently and maybe end up doing
>>> different things.
>>> 
>>> Try doing atoms.rattle(stdev=1e-12) and see if it runs.  Of course,
>>> the optimization should be robust against that sort of problem, so we
>>> would have to look into it even if it runs.
>>> 
>>>> 
>>>>       On an unrelated note, I’m afraid I have very little experience doing this kind of thing and so I’m not surprised that I have not correctly set the scalapack parameters.  I simply set the default based on what I found on the gpaw “Parallel runs’ page at the bottom:
>>>>       mb = 64
>>>>       m = floor(sqrt(bands/mb))
>>>>       n = m
>>>>       There are 2360 bands in the calculations, so that’s where I came up with ‘sl_default’:(6,6,64).  I would appreciate any insight you can give me on how to get the scalapack options set correctly.
>>> 
>>> It is the number of atomic orbitals, and not bands, which is relevant
>>> for how big the scalapack problem is in LCAO mode.  (6, 6, 64) might
>>> be a good scalapack setting on 36 cores because the number of cores
>>> multiplies to 36 (with one k-point/spin), but if you use more cores,
>>> you should increase the CPU grid to the maximum available.  You can
>>> also use sl_auto=True to choose something which is probably
>>> non-horrible.  For this size of system, there is no point in doing an
>>> LCAO calculation and not using the maximal possible number of cores
>>> for scalapack, because the Scalapack operations are the most expensive
>>> by far.
>>> 
>>> I will have a look at the documentation and maybe update it.
>>> 
>>> Best regards
>>> Ask
>>> 
>>>> 
>>>> Thanks
>>>> Warren
>>>> 
>>>>> On Feb 26, 2016, at 8:35 AM, Ask Hjorth Larsen <asklarsen at gmail.com> wrote:
>>>>> 
>>>>> Hello
>>>>> 
>>>>> This sounds strange.  Can you re-run it with --debug, please (e.g.,
>>>>> gpaw-python script.py --debug)?  Then we can see in which way they
>>>>> differ, whether it's due to slight numerical imprecision or position
>>>>> values that are complete garbage.
>>>>> 
>>>>> On an unrelated note, does it not run with just 36 cores for the 680
>>>>> atoms?  Also, the scalapack parameters should probably be set to use
>>>>> all available cores.
>>>>> 
>>>>> Best regards
>>>>> Ask
>>>>> 
>>>>> 2016-02-25 8:21 GMT+01:00 Tomlinson, Warren (CDR) <wwtomlin at nps.edu>:
>>>>>> Hello-
>>>>>>      I have been using GPAW for a couple of months now and have run into a persistent problem that I can not figure out.  I’m using a cluster with 3,456 nodes each with 36 cores (Intel Xeon E5-2699v3 Haswell).  I installed using MKL 11.2 and have python 2.7.10 and numpy 1.9.2.  I used the following setting to relax a periodic cell containing 114 atoms (successfully, with 72 cores):
>>>>>> 
>>>>>> cell = read('small.pdb’)
>>>>>> cell.set_pbc(1)
>>>>>> cell.set_cell([[18.752, 0., 0.], [9.376, 16.239708, 0.], [9.376, 5.413236, 15.310944]])
>>>>>> calc = GPAW(mode='lcao',
>>>>>>          gpts=(80,80,80),
>>>>>>          xc='PBE',
>>>>>>          poissonsolver=PoissonSolver(relax='GS', eps=1e-7),
>>>>>>          parallel={'band':2,'sl_default':(3,3,64)},
>>>>>>          basis='dzp',
>>>>>>          mixer=Mixer(0.1, 5, weight=100.0),
>>>>>>          occupations=FermiDirac(width=0.1),
>>>>>>          maxiter=1000,
>>>>>>          txt='67_sml_N_LCAO.out'
>>>>>>          )
>>>>>> cell.set_calculator(calc)
>>>>>> opt = BFGS(cell)
>>>>>> opt.run()
>>>>>> 
>>>>>> When I try virtually the exact same options on a larger (cubic) cell:
>>>>>> 
>>>>>> cell = read(‘big.pdb')
>>>>>> cell.set_cell([26.52,26.52,26.52])
>>>>>> cell.set_pbc(1)
>>>>>> calc_LCAO = GPAW(mode='lcao',
>>>>>>          gpts=(144,144,144),
>>>>>>          xc='PBE',
>>>>>>          poissonsolver=PoissonSolver(relax='GS', eps=1e-7),
>>>>>>          parallel={'band':2,'sl_default':(6,6,64)},
>>>>>>          basis = 'dzp',
>>>>>>          mixer=Mixer(0.1, 5, weight=100.0),
>>>>>>          occupations=FermiDirac(width=0.1),
>>>>>>          txt='67_Full.out',
>>>>>>          maxiter=1000
>>>>>>          )
>>>>>> etc…
>>>>>> 
>>>>>> 
>>>>>> I get an error after three SCF steps.  The larger cell has 680 atoms and I used 144 cores.  The error I get is below:
>>>>>> 
>>>>>> rank=036 L00: Traceback (most recent call last):
>>>>>> rank=036 L01:   File "/p/home/wwtomlin/PhaseII/Proj1/INITIAL/67_Full_C.py", line 44, in <module>
>>>>>> rank=036 L02:     opt.run()
>>>>>> rank=036 L03:   File "/p/home/wwtomlin/ase/ase/optimize/optimize.py", line 148, in run
>>>>>> rank=036 L04:     f = self.atoms.get_forces()
>>>>>> rank=036 L05:   File "/p/home/wwtomlin/ase/ase/atoms.py", line 688, in get_forces
>>>>>> rank=036 L06:     forces = self._calc.get_forces(self)
>>>>>> rank=036 L07:   File "/p/home/wwtomlin/gpaw/gpaw/aseinterface.py", line 78, in get_forces
>>>>>> rank=036 L08:     force_call_to_set_positions=force_call_to_set_positions)
>>>>>> rank=036 L09:   File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 272, in calculate
>>>>>> rank=036 L10:     self.set_positions(atoms)
>>>>>> rank=036 L11:   File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 328, in set_positions
>>>>>> rank=036 L12:     spos_ac = self.initialize_positions(atoms)
>>>>>> rank=036 L13:   File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 314, in initialize_positions
>>>>>> rank=036 L14:     self.synchronize_atoms()
>>>>>> rank=036 L15:   File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 1034, in synchronize_atoms
>>>>>> rank=036 L16:     mpi.synchronize_atoms(self.atoms, self.wfs.world)
>>>>>> rank=036 L17:   File "/p/home/wwtomlin/gpaw/gpaw/mpi/__init__.py", line 714, in synchronize_atoms
>>>>>> rank=036 L18:     err_ranks)
>>>>>> rank=036 L19: ValueError: ('Mismatch of Atoms objects.  In debug mode, atoms will be dumped to files.', array([  5,   9,  13,  17,  18,  19,  20,  21,  22,  23,  25,  27,  28,
>>>>>> rank=036 L20:         32,  33,  35,  37,  40,  41,  43,  44,  46,  50,  51,  52,  53,
>>>>>> rank=036 L21:         54,  60,  62,  63,  64,  65,  68,  71,  72,  74,  80,  82,  85,
>>>>>> rank=036 L22:         87,  90,  91,  94,  97,  98,  99, 100, 101, 104, 106, 107, 110,
>>>>>> rank=036 L23:        111, 115, 116, 118, 123, 125, 129, 130, 137, 138, 139, 142]))
>>>>>> GPAW CLEANUP (node 36): <type 'exceptions.ValueError'> occurred.  Calling MPI_Abort!
>>>>>> 
>>>>>> ————
>>>>>> 
>>>>>> I think this means my atoms are not in the same positions across different cores, but I can’t figure out how this happened.  Do you have any suggestions?
>>>>>> Thank you
>>>>>> Warren
>>>>>> PhD Student
>>>>>> Naval Postgraduate School
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> gpaw-users mailing list
>>>>>> gpaw-users at listserv.fysik.dtu.dk
>>>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>> 
> 
> 
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users



More information about the gpaw-users mailing list