[gpaw-users] BSSE parallel issue

Wed Sep 11 04:09:39 CEST 2013

Ask,
If you take a look at rhodium.py, I first optimize with LBFGS, and that
goes fine.  Once I hit the part with:

calc.set(setups={'Rh': 'ghost', 'C': 'paw', 'O': 'paw'})
rhodium.set_calculator(calc)
e_co = rhodium.get_potential_energy()
parprint('e_co = %s' % e_co)

That's where I hit trouble (as shown in 3nodes.out).  From 3nodes.txt we
see that it starts to do a memory estimate....then nothing.  No ghost atoms
are setup, nothing.  And it will hang there until I kill it  --- I once had
it run overnight and it stayed at that point for ~20 hours.

I did what you suggested, and for the ghost atoms I shunted the output to a
different .txt file.  Here is a tarball of everything.

On Sun, Sep 8, 2013 at 6:43 PM, Ask Hjorth Larsen <asklarsen at gmail.com>wrote:

> Hello Glen
>
> None of the calculations in 3nodes.txt contain any ghost atoms (try
> grepping the output files for 'Ghost').  Are you quite sure that the
> crash happens after the first relaxation is done?  You can try setting
> a new txt after the relaxation is done so it writes new stuff into a
> different file.
>
> While the segfault is nasty, I think we should solve one problem at a
> time - so let's forget scalapack for now and concentrate on the other
> stuff.
>
> Is the problem reproducible across multiple identical runs?
>
> Also: Parallelizing over 3 nodes with 5 k-points and nothing more than
> ~20 atoms is very inefficient.  For a system of this size you should
> not be using more than one node.  A single node should get you well
> beyond 200 atoms on most computers, even with 5 k-points (although at
> 200 atoms you could probably make effective use of approximately as
> many nodes as there are irreducible-BZ k-points).  But we should keep
> using 3 nodes for now in order to figure out what the problem is, of
> course.
>
> Regards
> Ask
>
> 2013/9/8 Glen Jenness <glenjenness at gmail.com>:
> > Ask,
> > Sorry it's a bit late (I moved from Wi to De in the past week), but here
> is
> > the information you requested.  rhodium.py is the actual script --- it's
> > just CO on a Rh (111) surface with 4 layers.  For 2nodes and 3nodes, I
> had
> > PL = dict(), and then did a run with PL = {'sl_auto': True}.  2nodes was
> a
> > successful run, 3nodes stalled --- once it got to that point I let it run
> > for ~5 hours, and it didn't move.
> >
> > Rhodium.out gives the full errors from having sl_auto set to True.
> >
> > Thanks!
> > Glen
> >
> >
> > On Sun, Sep 1, 2013 at 8:39 AM, Ask Hjorth Larsen <asklarsen at gmail.com>
> > wrote:
> >>
> >> Also: Please attach full scripts (written so as to demonstrate the
> >> error) and logfiles so I don't have to guess which parameters to
> >> change.  For example I don't know how many CPUs you were using.
> >>
> >> Regards
> >> Ask
> >>
> >> 2013/9/1 Ask Hjorth Larsen <asklarsen at gmail.com>:
> >> > Hello
> >> >
> >> > It works for me.
> >> >
> >> > Note that 17 atoms is not enough for scalapack to be a good idea.
> >> >
> >> > The first parameter in your mixer should be 0.04, not 0.4.
> >> >
> >> > Best regards
> >> > Ask
> >> >
> >> >
> >> > 2013/9/1 Glen Jenness <glenjenness at gmail.com>:
> >> >> Hi GPAW users!
> >> >> I ran into a curious problem when running GPAW in parallel while
> >> >> specifying
> >> >> ghost centers for a BSSE correction.
> >> >>
> >> >> I will be able to run my dimer system (in this case a CO molecule on
> a
> >> >> Rh
> >> >> (111) surface), but then when I specify calc.set(setups={'Rh':
> 'ghost"
> >> >> etc.), it'll enter the memory estimate part, and then freeze if I run
> >> >> over 1
> >> >> node.
> >> >>
> >> >> A colleague suggested setting the parallel option sl_auto to True,
> but
> >> >> doing
> >> >> so gives:
> >> >>
> >> >> ] [27] gpaw-python(PyObject_Call+0x5d) [0x49128d]
> >> >> [compute-0-6:29024] [28] gpaw-python(PyEval_EvalFrameEx+0x399d)
> >> >> [0x50dbfd]
> >> >> [compute-0-6:29024] [29] gpaw-python(PyEval_EvalCodeEx+0x89b)
> >> >> [0x511ffb]
> >> >> [compute-0-6:29024] *** End of error message ***
> >> >> [compute-0-9.local:17479] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> >> base/pls_base_orted_cmds.c at line 275
> >> >> [compute-0-9.local:17479] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> >> pls_tm_module.c at line 572
> >> >> [compute-0-9.local:17479] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> >> errmgr_hnp.c at line 90
> >> >> mpirun noticed that job rank 0 with PID 17481 on node compute-0-9
> >> >> exited on
> >> >> signal 11 (Segmentation fault).
> >> >> [compute-0-9.local:17479] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> >> base/pls_base_orted_cmds.c at line 188
> >> >> [compute-0-9.local:17479] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> >> pls_tm_module.c at line 603
> >> >>
> >> >>
> >> >> Any idea what could cause either issue?
> >> >>
> >> >> Thanks!
> >> >>
> >> >> My python input is:
> >> >>
> >> >> from ase.atoms import Atoms
> >> >> from ase.lattice.surface import fcc111, add_adsorbate
> >> >> from ase.constraints import FixAtoms
> >> >> from ase.optimize.lbfgs import LBFGS
> >> >> from ase.parallel import parprint
> >> >>
> >> >> from gpaw import GPAW, Mixer, FermiDirac
> >> >>
> >> >> PL = {'sl_auto': True}
> >> >>
> >> >> verb = False
> >> >> mix = Mixer(beta=0.40, nmaxold=45, weight=50.0)
> >> >> occ = FermiDirac(0.1)
> >> >>
> >> >> calc = GPAW(mode='lcao', basis='dzp', txt='rhodium.txt',
> kpts=(5,5,1),
> >> >> occupations=occ, xc='PBE', verbose=verb, mixer=mix, parallel=PL)
> >> >>
> >> >> rhodium = fcc111('Rh', (1,1,4), vacuum=8.0)
> >> >> constraint = FixAtoms([0, 1])
> >> >> rhodium.set_constraint(constraint)
> >> >> rhodium *= (2,2,1)
> >> >>
> >> >> co = Atoms('CO', positions=[(0,0,0), (0,0,1.14)])
> >> >> add_adsorbate(rhodium, co, 1.8, position='ontop')
> >> >>
> >> >> rhodium.set_calculator(calc)
> >> >>
> >> >> opt = LBFGS(rhodium, trajectory='co-rhodium.traj')
> >> >> opt.run(fmax=0.01)
> >> >> e_ads = rhodium.get_potential_energy()
> >> >> parprint('e_ads = %f' % e_ads)
> >> >>
> >> >> calc.set(setups={'Rh': 'ghost', 'C': 'paw', 'O': 'paw'})
> >> >> rhodium.set_calculator(calc)
> >> >> e_co = rhodium.get_potential_energy()
> >> >> parprint('e_co = %s' % e_co)
> >> >>
> >> >> calc.set(setups={'Rh': 'paw', 'C': 'ghost', 'O': 'ghost'})
> >> >> rhodium.set_calculator(calc)
> >> >> e_surf = rhodium.get_potential_energy()
> >> >> parprint('e_surf = %s' % e_surf)
> >> >>
> >> >> parprint('E_BE = %f' % ( e_ads - e_co - e_surf))
> >> >>
> >> >> --
> >> >> Dr. Glen Jenness
> >> >> Schmidt Group/Morgan Group
> >> >> Department of Chemistry/Materials Science and Engineering (MSAE)
> >> >> University of Wisconsin - Madison
> >> >>
> >> >> _______________________________________________
> >> >> gpaw-users mailing list
> >> >> gpaw-users at listserv.fysik.dtu.dk
> >> >> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
> >
> >
> >
> >
> > --
> > Dr. Glen Jenness
> > Schmidt Group/Morgan Group
> > Department of Chemistry/Materials Science and Engineering (MSAE)
> > University of Wisconsin - Madison
>

-- 
Dr. Glen Jenness
Schmidt Group/Morgan Group
Department of Chemistry/Materials Science and Engineering (MSAE)
University of Wisconsin - Madison
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20130910/34a399f9/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: for_ask.tar
Type: application/x-tar
Size: 133120 bytes
Desc: not available
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20130910/34a399f9/attachment-0001.tar>