[gpaw-users] GPAW Help
Tomlinson, Warren (CDR)
wwtomlin at nps.edu
Wed Mar 2 20:20:05 CET 2016
All-
Thanks for the help. I tried setting the environmental variable MKL_CBWR as suggested, but still got the error (after 3 SCF cycles). I also made a number by number comparison of all positions on all CPUs to the reference CPU (rank 0) eg:
For all the position vectors in reference and other CPUs:
for i in range(3):
if ref[i] <> allpos[i]:
same = False
The same variable was never set to False. I also compared chemical symbols, PBCs, Cells and compound symbol. All were identical among all CPUs. Is there perhaps something else I should check?
Thanks for all the help. I realize there might not be an answer to this right now, and that’s OK. I’m moving forward with LBFGS and so far it’s working just fine.
Thanks
Warren
> On Mar 1, 2016, at 12:33 PM, Ask Hjorth Larsen <asklarsen at gmail.com> wrote:
>
> As far as I can see, it should be printing 1.2345e-7 or something like
> that if the positions were in fact different. The tolerance is 1e-8.
> Which other property could possibly be wrong? Is the cell off by
> 1e-37 in one of them? That would be enough to cause that error,
> because those are compared as a == b. But then why would it happen at
> iteration three? I find it highly unlikely that, say, the number of
> atoms suddenly differs by something :).
>
> Anyway, Warren, if you could similarly compare the other properties of
> each dumped atoms object, then it would be very useful. Also, print
> repr(errs) might be more appropriate since you know that you get all
> precision (but it should stil have shown an actual error if there were
> one).
>
> Another thing we can do is to make a wrapper for ASE optimizers which
> broadcasts the work of rank 0 to all ranks. That should completely
> prevent such an error from arising from within ASE (at least for the
> positions), but it should be used with caution I guess.
>
> Best regards
> Ask
>
> 2016-03-01 9:57 GMT+01:00 Jussi Enkovaara <jussi.enkovaara at csc.fi>:
>> Hi all,
>> the problem is most likely related to the fact (as Ask already mentioned),
>> that modern optimized libraries (Intel MKL as prime example) do not
>> necessary provide the same output even with bit-identical input, but result
>> may depend for example how memory allocations are aligned, see e.g.
>> https://software.intel.com/en-us/articles/getting-reproducible-results-with-intel-mkl/
>>
>> Symmetry is not the key issue, in systems with symmetry there can be
>> degenerate eigenvectors and tiny numerical differences can produce
>> completery different linear combinations of eigenvectors which can amplify
>> the problem, but problems can arise even without symmetry, and therefore
>> rattle does not necessarily solve anything.
>>
>> There has been some effort to solve the problems due to numerical
>> reproducibility in GPAW (e.g. atomic positions returned from ASE are not
>> required to be bitwise identical), but apparantly some bugs are still
>> remaining.
>>
>> For MKL, one could try to enforce numerical reproducibility by setting the
>> environment variable MKL_CBWR, suitable values might depend on MKL version,
>> but one could try to start with
>>
>> export MKL_CBWR=AVX
>>
>> This can lead to some performance degregation.
>>
>> Best regards,
>> Jussi
>>
>>
>>
>> On 2016-03-01 09:57, Torsten Hahn wrote:
>>>
>>> Hey all,
>>>
>>>
>>> Sometimes I experience similar errors. We once thought we had tracked it
>>> down to erroneous mpi implementation (Intel mpi). However, some people in my
>>> group still do see the same error with openmpi and to be honest we have no
>>> idea where it is come from. It looks like in some CPUs there is sometimes a
>>> very small numerical error in the atomic positions. This error does never
>>> happen in non- mpi calculations.
>>>
>>> Would be really nice to track that down.
>>>
>>> Best,
>>> Torsten.
>>>
>>>> Am 29.02.2016 um 18:42 schrieb Tomlinson, Warren (CDR)
>>>> <wwtomlin at nps.edu>:
>>>>
>>>> Ask-
>>>> Thanks for the help. I tried running with the atoms.rattle as well
>>>> as the hack you sent me. The exact problem still persists. Three SCF
>>>> cycles are completed and then the error pops up. I have had success with
>>>> LBFGS. There’s no reason why I shouldn’t be OK using that optimizer,
>>>> correct? It is odd, though, that BFGS can’t make it past three steps,
>>>> though.
>>>> Thanks
>>>> Warren
>>>>
>>>>> On Feb 26, 2016, at 11:47 AM, Ask Hjorth Larsen <asklarsen at gmail.com>
>>>>> wrote:
>>>>>
>>>>> I realize that more symmetry breaking might be necessary depending on
>>>>> how some things are implemented. You can try with this slightly
>>>>> symmetry-breaking hack:
>>>>>
>>>>> http://dcwww.camd.dtu.dk/~askhl/files/bfgshack.py
>>>>>
>>>>> If push comes to shove and we cannot guess what the problem is, try
>>>>> reducing it in size as much as possible. As few cores as possible,
>>>>> and as rough parameters as possible.
>>>>>
>>>>> Best regards
>>>>> Ask
>>>>>
>>>>> 2016-02-26 20:40 GMT+01:00 Ask Hjorth Larsen <asklarsen at gmail.com>:
>>>>>>
>>>>>> Hi Warren
>>>>>>
>>>>>> 2016-02-26 19:40 GMT+01:00 Tomlinson, Warren (CDR) <wwtomlin at nps.edu>:
>>>>>>>
>>>>>>> Ask-
>>>>>>> Thank you for your help. I reran with the --debug option and
>>>>>>> also ran with with 36 cores. Both still failed for the same synchronization
>>>>>>> problem. I have all 144 synchronize_atoms_r##.pckl files, but I’m not sure
>>>>>>> exactly what to do with them.
>>>>>>>
>>>>>>> On a related note, I ran the 680 atom structure with
>>>>>>> QuasiNewton instead of BFGS and it worked. So I’m guessing that’s a big
>>>>>>> clue.
>>>>>>
>>>>>>
>>>>>> That's interesting. BFGS calculates eigenvectors. Sometimes in
>>>>>> exactly symmetric systems, different cores can get different results
>>>>>> even though they perform the same mathematical operation, typically
>>>>>> due to aggressive BLAS stuff. They will differ very little, but they
>>>>>> can order eigenvalues/vectors differently and maybe end up doing
>>>>>> different things.
>>>>>>
>>>>>> Try doing atoms.rattle(stdev=1e-12) and see if it runs. Of course,
>>>>>> the optimization should be robust against that sort of problem, so we
>>>>>> would have to look into it even if it runs.
>>>>>>
>>>>>>>
>>>>>>> On an unrelated note, I’m afraid I have very little experience
>>>>>>> doing this kind of thing and so I’m not surprised that I have not correctly
>>>>>>> set the scalapack parameters. I simply set the default based on what I
>>>>>>> found on the gpaw “Parallel runs’ page at the bottom:
>>>>>>> mb = 64
>>>>>>> m = floor(sqrt(bands/mb))
>>>>>>> n = m
>>>>>>> There are 2360 bands in the calculations, so that’s where I
>>>>>>> came up with ‘sl_default’:(6,6,64). I would appreciate any insight you can
>>>>>>> give me on how to get the scalapack options set correctly.
>>>>>>
>>>>>>
>>>>>> It is the number of atomic orbitals, and not bands, which is relevant
>>>>>> for how big the scalapack problem is in LCAO mode. (6, 6, 64) might
>>>>>> be a good scalapack setting on 36 cores because the number of cores
>>>>>> multiplies to 36 (with one k-point/spin), but if you use more cores,
>>>>>> you should increase the CPU grid to the maximum available. You can
>>>>>> also use sl_auto=True to choose something which is probably
>>>>>> non-horrible. For this size of system, there is no point in doing an
>>>>>> LCAO calculation and not using the maximal possible number of cores
>>>>>> for scalapack, because the Scalapack operations are the most expensive
>>>>>> by far.
>>>>>>
>>>>>> I will have a look at the documentation and maybe update it.
>>>>>>
>>>>>> Best regards
>>>>>> Ask
>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Warren
>>>>>>>
>>>>>>>> On Feb 26, 2016, at 8:35 AM, Ask Hjorth Larsen <asklarsen at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello
>>>>>>>>
>>>>>>>> This sounds strange. Can you re-run it with --debug, please (e.g.,
>>>>>>>> gpaw-python script.py --debug)? Then we can see in which way they
>>>>>>>> differ, whether it's due to slight numerical imprecision or position
>>>>>>>> values that are complete garbage.
>>>>>>>>
>>>>>>>> On an unrelated note, does it not run with just 36 cores for the 680
>>>>>>>> atoms? Also, the scalapack parameters should probably be set to use
>>>>>>>> all available cores.
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>> Ask
>>>>>>>>
>>>>>>>> 2016-02-25 8:21 GMT+01:00 Tomlinson, Warren (CDR) <wwtomlin at nps.edu>:
>>>>>>>>>
>>>>>>>>> Hello-
>>>>>>>>> I have been using GPAW for a couple of months now and have run
>>>>>>>>> into a persistent problem that I can not figure out. I’m using a cluster
>>>>>>>>> with 3,456 nodes each with 36 cores (Intel Xeon E5-2699v3 Haswell). I
>>>>>>>>> installed using MKL 11.2 and have python 2.7.10 and numpy 1.9.2. I used the
>>>>>>>>> following setting to relax a periodic cell containing 114 atoms
>>>>>>>>> (successfully, with 72 cores):
>>>>>>>>>
>>>>>>>>> cell = read('small.pdb’)
>>>>>>>>> cell.set_pbc(1)
>>>>>>>>> cell.set_cell([[18.752, 0., 0.], [9.376, 16.239708, 0.], [9.376,
>>>>>>>>> 5.413236, 15.310944]])
>>>>>>>>> calc = GPAW(mode='lcao',
>>>>>>>>> gpts=(80,80,80),
>>>>>>>>> xc='PBE',
>>>>>>>>> poissonsolver=PoissonSolver(relax='GS', eps=1e-7),
>>>>>>>>> parallel={'band':2,'sl_default':(3,3,64)},
>>>>>>>>> basis='dzp',
>>>>>>>>> mixer=Mixer(0.1, 5, weight=100.0),
>>>>>>>>> occupations=FermiDirac(width=0.1),
>>>>>>>>> maxiter=1000,
>>>>>>>>> txt='67_sml_N_LCAO.out'
>>>>>>>>> )
>>>>>>>>> cell.set_calculator(calc)
>>>>>>>>> opt = BFGS(cell)
>>>>>>>>> opt.run()
>>>>>>>>>
>>>>>>>>> When I try virtually the exact same options on a larger (cubic)
>>>>>>>>> cell:
>>>>>>>>>
>>>>>>>>> cell = read(‘big.pdb')
>>>>>>>>> cell.set_cell([26.52,26.52,26.52])
>>>>>>>>> cell.set_pbc(1)
>>>>>>>>> calc_LCAO = GPAW(mode='lcao',
>>>>>>>>> gpts=(144,144,144),
>>>>>>>>> xc='PBE',
>>>>>>>>> poissonsolver=PoissonSolver(relax='GS', eps=1e-7),
>>>>>>>>> parallel={'band':2,'sl_default':(6,6,64)},
>>>>>>>>> basis = 'dzp',
>>>>>>>>> mixer=Mixer(0.1, 5, weight=100.0),
>>>>>>>>> occupations=FermiDirac(width=0.1),
>>>>>>>>> txt='67_Full.out',
>>>>>>>>> maxiter=1000
>>>>>>>>> )
>>>>>>>>> etc…
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I get an error after three SCF steps. The larger cell has 680 atoms
>>>>>>>>> and I used 144 cores. The error I get is below:
>>>>>>>>>
>>>>>>>>> rank=036 L00: Traceback (most recent call last):
>>>>>>>>> rank=036 L01: File
>>>>>>>>> "/p/home/wwtomlin/PhaseII/Proj1/INITIAL/67_Full_C.py", line 44, in <module>
>>>>>>>>> rank=036 L02: opt.run()
>>>>>>>>> rank=036 L03: File
>>>>>>>>> "/p/home/wwtomlin/ase/ase/optimize/optimize.py", line 148, in run
>>>>>>>>> rank=036 L04: f = self.atoms.get_forces()
>>>>>>>>> rank=036 L05: File "/p/home/wwtomlin/ase/ase/atoms.py", line 688,
>>>>>>>>> in get_forces
>>>>>>>>> rank=036 L06: forces = self._calc.get_forces(self)
>>>>>>>>> rank=036 L07: File "/p/home/wwtomlin/gpaw/gpaw/aseinterface.py",
>>>>>>>>> line 78, in get_forces
>>>>>>>>> rank=036 L08:
>>>>>>>>> force_call_to_set_positions=force_call_to_set_positions)
>>>>>>>>> rank=036 L09: File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 272,
>>>>>>>>> in calculate
>>>>>>>>> rank=036 L10: self.set_positions(atoms)
>>>>>>>>> rank=036 L11: File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 328,
>>>>>>>>> in set_positions
>>>>>>>>> rank=036 L12: spos_ac = self.initialize_positions(atoms)
>>>>>>>>> rank=036 L13: File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 314,
>>>>>>>>> in initialize_positions
>>>>>>>>> rank=036 L14: self.synchronize_atoms()
>>>>>>>>> rank=036 L15: File "/p/home/wwtomlin/gpaw/gpaw/paw.py", line 1034,
>>>>>>>>> in synchronize_atoms
>>>>>>>>> rank=036 L16: mpi.synchronize_atoms(self.atoms, self.wfs.world)
>>>>>>>>> rank=036 L17: File "/p/home/wwtomlin/gpaw/gpaw/mpi/__init__.py",
>>>>>>>>> line 714, in synchronize_atoms
>>>>>>>>> rank=036 L18: err_ranks)
>>>>>>>>> rank=036 L19: ValueError: ('Mismatch of Atoms objects. In debug
>>>>>>>>> mode, atoms will be dumped to files.', array([ 5, 9, 13, 17, 18, 19,
>>>>>>>>> 20, 21, 22, 23, 25, 27, 28,
>>>>>>>>> rank=036 L20: 32, 33, 35, 37, 40, 41, 43, 44, 46,
>>>>>>>>> 50, 51, 52, 53,
>>>>>>>>> rank=036 L21: 54, 60, 62, 63, 64, 65, 68, 71, 72,
>>>>>>>>> 74, 80, 82, 85,
>>>>>>>>> rank=036 L22: 87, 90, 91, 94, 97, 98, 99, 100, 101,
>>>>>>>>> 104, 106, 107, 110,
>>>>>>>>> rank=036 L23: 111, 115, 116, 118, 123, 125, 129, 130, 137,
>>>>>>>>> 138, 139, 142]))
>>>>>>>>> GPAW CLEANUP (node 36): <type 'exceptions.ValueError'> occurred.
>>>>>>>>> Calling MPI_Abort!
>>>>>>>>>
>>>>>>>>> ————
>>>>>>>>>
>>>>>>>>> I think this means my atoms are not in the same positions across
>>>>>>>>> different cores, but I can’t figure out how this happened. Do you have any
>>>>>>>>> suggestions?
>>>>>>>>> Thank you
>>>>>>>>> Warren
>>>>>>>>> PhD Student
>>>>>>>>> Naval Postgraduate School
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gpaw-users mailing list
>>>>>>>>> gpaw-users at listserv.fysik.dtu.dk
>>>>>>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gpaw-users mailing list
>>>> gpaw-users at listserv.fysik.dtu.dk
>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>>
>>> _______________________________________________
>>> gpaw-users mailing list
>>> gpaw-users at listserv.fysik.dtu.dk
>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>
>>
>> _______________________________________________
>> gpaw-users mailing list
>> gpaw-users at listserv.fysik.dtu.dk
>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
More information about the gpaw-users
mailing list