[gpaw-users] GPAW Help

Tomlinson, Warren (CDR) wwtomlin at nps.edu
Wed Mar 2 20:20:05 CET 2016


All-
	Thanks for the help.  I tried setting the environmental variable MKL_CBWR as suggested, but still got the error (after 3 SCF cycles).  I also made a number by number comparison of all positions on all CPUs to the reference CPU (rank 0) eg:
	For all the position vectors in reference and other CPUs:
		for i in range(3):
			if ref[i] <> allpos[i]:
				same = False

	The same variable was never set to False.  I also compared chemical symbols, PBCs, Cells and compound symbol.  All were identical among all CPUs.  Is there perhaps something else I should check? 
Thanks for all the help.  I realize there might not be an answer to this right now, and that’s OK.  I’m moving forward with LBFGS and so far it’s working just fine.
Thanks
Warren

> On Mar 1, 2016, at 12:33 PM, Ask Hjorth Larsen <asklarsen at gmail.com> wrote:
> 
> As far as I can see, it should be printing 1.2345e-7 or something like
> that if the positions were in fact different.  The tolerance is 1e-8.
> Which other property could possibly be wrong?  Is the cell off by
> 1e-37 in one of them?  That would be enough to cause that error,
> because those are compared as a == b.  But then why would it happen at
> iteration three?  I find it highly unlikely that, say, the number of
> atoms suddenly differs by something :).
> 
> Anyway, Warren, if you could similarly compare the other properties of
> each dumped atoms object, then it would be very useful.  Also, print
> repr(errs) might be more appropriate since you know that you get all
> precision (but it should stil have shown an actual error if there were
> one).
> 
> Another thing we can do is to make a wrapper for ASE optimizers which
> broadcasts the work of rank 0 to all ranks.  That should completely
> prevent such an error from arising from within ASE (at least for the
> positions), but it should be used with caution I guess.
> 
> Best regards
> Ask
> 
> 2016-03-01 9:57 GMT+01:00 Jussi Enkovaara <jussi.enkovaara at csc.fi>:
>> Hi all,
>> the problem is most likely related to the fact (as Ask already mentioned),
>> that modern optimized libraries (Intel MKL as prime example) do not
>> necessary provide the same output even with bit-identical input, but result
>> may depend for example how memory allocations are aligned, see e.g.
>> https://software.intel.com/en-us/articles/getting-reproducible-results-with-intel-mkl/
>> 
>> Symmetry is not the key issue, in systems with symmetry there can be
>> degenerate eigenvectors and tiny numerical differences can produce
>> completery different linear combinations of eigenvectors which can amplify
>> the problem, but problems can arise even without symmetry, and therefore
>> rattle does not necessarily solve anything.
>> 
>> There has been some effort to solve the problems due to numerical
>> reproducibility in GPAW (e.g. atomic positions returned from ASE are not
>> required to be bitwise identical), but apparantly some bugs are still
>> remaining.
>> 
>> For MKL, one could try to enforce numerical reproducibility by setting the
>> environment variable MKL_CBWR, suitable values might depend on MKL version,
>> but one could try to start with
>> 
>> export MKL_CBWR=AVX
>> 
>> This can lead to some performance degregation.
>> 
>> Best regards,
>> Jussi
>> 
>> 
>> 
>> On 2016-03-01 09:57, Torsten Hahn wrote:
>>> 
>>> Hey all,
>>> 
>>> 
>>> Sometimes I experience similar errors. We once thought we had tracked it
>>> down to erroneous mpi implementation (Intel mpi). However, some people in my
>>> group still do see the same error with openmpi and to be honest we have no
>>> idea where it is come from. It looks like in some CPUs there is sometimes a
>>> very small numerical error in the atomic positions. This error does never
>>> happen in non- mpi calculations.
>>> 
>>> Would be really nice to track that down.
>>> 
>>> Best,
>>> Torsten.
>>> 
>>>> Am 29.02.2016 um 18:42 schrieb Tomlinson, Warren (CDR)
>>>> <wwtomlin at nps.edu>:
>>>> 
>>>> Ask-
>>>>    Thanks for the help.  I tried running with the atoms.rattle as well
>>>> as the hack you sent me.  The exact problem still persists.  Three SCF
>>>> cycles are completed and then the error pops up.  I have had success with
>>>> LBFGS.  There’s no reason why I shouldn’t be OK using that optimizer,
>>>> correct?  It is odd, though, that BFGS can’t make it past three steps,
>>>> though.
>>>> Thanks
>>>> Warren
>>>> 
>>>>> On Feb 26, 2016, at 11:47 AM, Ask Hjorth Larsen <asklarsen at gmail.com>
>>>>> wrote:
>>>>> 
>>>>> I realize that more symmetry breaking might be necessary depending on
>>>>> how some things are implemented.  You can try with this slightly
>>>>> symmetry-breaking hack:
>>>>> 
>>>>> http://dcwww.camd.dtu.dk/~askhl/files/bfgshack.py
>>>>> 
>>>>> If push comes to shove and we cannot guess what the problem is, try
>>>>> reducing it in size as much as possible.  As few cores as possible,
>>>>> and as rough parameters as possible.
>>>>> 
>>>>> Best regards
>>>>> Ask
>>>>> 
>>>>> 2016-02-26 20:40 GMT+01:00 Ask Hjorth Larsen <asklarsen at gmail.com>:
>>>>>> 
>>>>>> Hi Warren
>>>>>> 
>>>>>> 2016-02-26 19:40 GMT+01:00 Tomlinson, Warren (CDR) <wwtomlin at nps.edu>:
>>>>>>> 
>>>>>>> Ask-
>>>>>>>       Thank you for your help.  I reran with the --debug option and
>>>>>>> also ran with with 36 cores.  Both still failed for the same synchronization
>>>>>>> problem.  I have all 144 synchronize_atoms_r##.pckl files, but I’m not sure
>>>>>>> exactly what to do with them.
>>>>>>> 
>>>>>>>       On a related note, I ran the 680 atom structure with
>>>>>>> QuasiNewton instead of BFGS and it worked.  So I’m guessing that’s a big
>>>>>>> clue.
>>>>>> 
>>>>>> 
>>>>>> That's interesting.  BFGS calculates eigenvectors.  Sometimes in
>>>>>> exactly symmetric systems, different cores can get different results
>>>>>> even though they perform the same mathematical operation, typically
>>>>>> due to aggressive BLAS stuff.  They will differ very little, but they
>>>>>> can order eigenvalues/vectors differently and maybe end up doing
>>>>>> different things.
>>>>>> 
>>>>>> Try doing atoms.rattle(stdev=1e-12) and see if it runs.  Of course,
>>>>>> the optimization should be robust against that sort of problem, so we
>>>>>> would have to look into it even if it runs.
>>>>>> 
>>>>>>> 
>>>>>>>       On an unrelated note, I’m afraid I have very little experience
>>>>>>> doing this kind of thing and so I’m not surprised that I have not correctly
>>>>>>> set the scalapack parameters.  I simply set the default based on what I
>>>>>>> found on the gpaw “Parallel runs’ page at the bottom:
>>>>>>>       mb = 64
>>>>>>>       m = floor(sqrt(bands/mb))
>>>>>>>       n = m
>>>>>>>       There are 2360 bands in the calculations, so that’s where I
>>>>>>> came up with ‘sl_default’:(6,6,64).  I would appreciate any insight you can
>>>>>>> give me on how to get the scalapack options set correctly.
>>>>>> 
>>>>>> 
>>>>>> It is the number of atomic orbitals, and not bands, which is relevant
>>>>>> for how big the scalapack problem is in LCAO mode.  (6, 6, 64) might
>>>>>> be a good scalapack setting on 36 cores because the number of cores
>>>>>> multiplies to 36 (with one k-point/spin), but if you use more cores,
>>>>>> you should increase the CPU grid to the maximum available.  You can
>>>>>> also use sl_auto=True to choose something which is probably
>>>>>> non-horrible.  For this size of system, there is no point in doing an
>>>>>> LCAO calculation and not using the maximal possible number of cores
>>>>>> for scalapack, because the Scalapack operations are the most expensive
>>>>>> by far.
>>>>>> 
>>>>>> I will have a look at the documentation and maybe update it.
>>>>>> 
>>>>>> Best regards
>>>>>> Ask
>>>>>> 
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Warren
>>>>>>> 
>>>>>>>> On Feb 26, 2016, at 8:35 AM, Ask Hjorth Larsen <asklarsen at gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hello
>>>>>>>> 
>>>>>>>> This sounds strange.  Can you re-run it with --debug, please (e.g.,
>>>>>>>> gpaw-python script.py --debug)?  Then we can see in which way they
>>>>>>>> differ, whether it's due to slight numerical imprecision or position
>>>>>>>> values that are complete garbage.
>>>>>>>> 
>>>>>>>> On an unrelated note, does it not run with just 36 cores for the 680
>>>>>>>> atoms?  Also, the scalapack parameters should probably be set to use
>>>>>>>> all available cores.
>>>>>>>> 
>>>>>>>> Best regards
>>>>>>>> Ask
>>>>>>>> 
>>>>>>>> 2016-02-25 8:21 GMT+01:00 Tomlinson, Warren (CDR) <wwtomlin at nps.edu>:
>>>>>>>>> 
>>>>>>>>> Hello-
>>>>>>>>>      I have been using GPAW for a couple of months now and have run
>>>>>>>>> into a persistent problem that I can not figure out.  I’m using a cluster
>>>>>>>>> with 3,456 nodes each with 36 cores (Intel Xeon E5-2699v3 Haswell).  I
>>>>>>>>> installed using MKL 11.2 and have python 2.7.10 and numpy 1.9.2.  I used the
>>>>>>>>> following setting to relax a periodic cell containing 114 atoms
>>>>>>>>> (successfully, with 72 cores):
>>>>>>>>> 
>>>>>>>>> cell = read('small.pdb’)
>>>>>>>>> cell.set_pbc(1)
>>>>>>>>> cell.set_cell([[18.752, 0., 0.], [9.376, 16.239708, 0.], [9.376,
>>>>>>>>> 5.413236, 15.310944]])
>>>>>>>>> calc = GPAW(mode='lcao',
>>>>>>>>>          gpts=(80,80,80),
>>>>>>>>>          xc='PBE',
>>>>>>>>>          poissonsolver=PoissonSolver(relax='GS', eps=1e-7),
>>>>>>>>>          parallel={'band':2,'sl_default':(3,3,64)},
>>>>>>>>>          basis='dzp',
>>>>>>>>>          mixer=Mixer(0.1, 5, weight=100.0),
>>>>>>>>>          occupations=FermiDirac(width=0.1),
>>>>>>>>>          maxiter=1000,
>>>>>>>>>          txt='67_sml_N_LCAO.out'
>>>>>>>>>          )
>>>>>>>>> cell.set_calculator(calc)
>>>>>>>>> opt = BFGS(cell)
>>>>>>>>> opt.run()
>>>>>>>>> 
>>>>>>>>> When I try virtually the exact same options on a larger (cubic)
>>>>>>>>> cell:
>>>>>>>>> 
>>>>>>>>> cell = read(‘big.pdb')
>>>>>>>>> cell.set_cell([26.52,26.52,26.52])
>>>>>>>>> cell.set_pbc(1)
>>>>>>>>> calc_LCAO = GPAW(mode='lcao',
>>>>>>>>>          gpts=(144,144,144),
>>>>>>>>>          xc='PBE',
>>>>>>>>>          poissonsolver=PoissonSolver(relax='GS', eps=1e-7),
>>>>>>>>>          parallel={'band':2,'sl_default':(6,6,64)},
>>>>>>>>>          basis = 'dzp',
>>>>>>>>>          mixer=Mixer(0.1, 5, weight=100.0),
>>>>>>>>>          occupations=FermiDirac(width=0.1),
>>>>>>>>>          txt='67_Full.out',
>>>>>>>>>          maxiter=1000
>>>>>>>>>          )
>>>>>>>>> etc…
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I get an error after three SCF steps.  The larger cell has 680 atoms
>>>>>>>>> and I used 144 cores.  The error I get is below:
>>>>>>>>> 
>>>>>>>>> rank=036 L00: Traceback (most recent call last):
>>>>>>>>> rank=036 L01:   File
>>>>>>>>> "/p/home/wwtomlin/PhaseII/Proj1/INITIAL/67_Full_C.py", line 44, in <module>
>>>>>>>>> rank=036 L02:     opt.run()
>>>>>>>>> rank=036 L03:   File
>>>>>>>>> "/p/home/wwtomlin/ase/ase/optimize/optimize.py", line 148, in run
>>>>>>>>> rank=036 L04:     f = self.atoms.get_forces()
>>>>>>>>> rank=036 L05:   File "/p/home/wwtomlin/ase/ase/atoms.py", line 688,
>>>>>>>>> in get_forces
>>>>>>>>> rank=036 L06:     forces = self._calc.get_forces(self)
>>>>>>>>> rank=036 L07:   File "/p/home/wwtomlin/gpaw/gpaw/aseinterface.py",
>>>>>>>>> line 78, in get_forces
>>>>>>>>> rank=036 L08:
>>>>>>>>> force_call_to_set_positions=force_call_to_set_positions)
>>>>>>>>> rank=036 L09:   File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 272,
>>>>>>>>> in calculate
>>>>>>>>> rank=036 L10:     self.set_positions(atoms)
>>>>>>>>> rank=036 L11:   File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 328,
>>>>>>>>> in set_positions
>>>>>>>>> rank=036 L12:     spos_ac = self.initialize_positions(atoms)
>>>>>>>>> rank=036 L13:   File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 314,
>>>>>>>>> in initialize_positions
>>>>>>>>> rank=036 L14:     self.synchronize_atoms()
>>>>>>>>> rank=036 L15:   File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 1034,
>>>>>>>>> in synchronize_atoms
>>>>>>>>> rank=036 L16:     mpi.synchronize_atoms(self.atoms, self.wfs.world)
>>>>>>>>> rank=036 L17:   File "/p/home/wwtomlin/gpaw/gpaw/mpi/__init__.py",
>>>>>>>>> line 714, in synchronize_atoms
>>>>>>>>> rank=036 L18:     err_ranks)
>>>>>>>>> rank=036 L19: ValueError: ('Mismatch of Atoms objects.  In debug
>>>>>>>>> mode, atoms will be dumped to files.', array([  5,   9,  13,  17,  18,  19,
>>>>>>>>> 20,  21,  22,  23,  25,  27,  28,
>>>>>>>>> rank=036 L20:         32,  33,  35,  37,  40,  41,  43,  44,  46,
>>>>>>>>> 50,  51,  52,  53,
>>>>>>>>> rank=036 L21:         54,  60,  62,  63,  64,  65,  68,  71,  72,
>>>>>>>>> 74,  80,  82,  85,
>>>>>>>>> rank=036 L22:         87,  90,  91,  94,  97,  98,  99, 100, 101,
>>>>>>>>> 104, 106, 107, 110,
>>>>>>>>> rank=036 L23:        111, 115, 116, 118, 123, 125, 129, 130, 137,
>>>>>>>>> 138, 139, 142]))
>>>>>>>>> GPAW CLEANUP (node 36): <type 'exceptions.ValueError'> occurred.
>>>>>>>>> Calling MPI_Abort!
>>>>>>>>> 
>>>>>>>>> ————
>>>>>>>>> 
>>>>>>>>> I think this means my atoms are not in the same positions across
>>>>>>>>> different cores, but I can’t figure out how this happened.  Do you have any
>>>>>>>>> suggestions?
>>>>>>>>> Thank you
>>>>>>>>> Warren
>>>>>>>>> PhD Student
>>>>>>>>> Naval Postgraduate School
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> gpaw-users mailing list
>>>>>>>>> gpaw-users at listserv.fysik.dtu.dk
>>>>>>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> gpaw-users mailing list
>>>> gpaw-users at listserv.fysik.dtu.dk
>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>> 
>>> 
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>> 
>> 
>> 
>> _______________________________________________
>> gpaw-users mailing list
>> gpaw-users at listserv.fysik.dtu.dk
>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users




More information about the gpaw-users mailing list