[gpaw-users] Calculation freezes after memory estimate

Juho Arjoranta juho.arjoranta at helsinki.fi
Wed Jul 4 08:37:15 CEST 2012


Lainaus "Marcin Dulak" <Marcin.Dulak at fysik.dtu.dk>:

> On 07/03/12 12:17, Juho Arjoranta wrote:
>> Lainaus "Marcin Dulak"<Marcin.Dulak at fysik.dtu.dk>:
>>
>>> On 06/26/12 08:41, Juho Arjoranta wrote:
>>>> Hello,
>>>>
>>>> I have been converging the vacuum around a copper surface with two
>>>> bottom layers fixed and with 4 and 6 layers of copper everything went
>>>> fine. Now I have an 8 layer surface and the calculation freezes after
>>>> the evaluation of memory usage is made. The calculation for the
>>>> surface was restarted from a bulk calculation where the initial state
>>>> for the surface was set. The bulk was relaxed to make sure that the
>>>> structure is really correct.
>>> could freezing be ralated to scalapack?
>>> I remember similar problems with scalapack on, for example
>>> this hangs:
>>> https://trac.fysik.dtu.dk/projects/gpaw/browser/trunk/gpaw/test/parallel/scalapack_pdlasrt_hang.py
>>>
>>> Marcin
>> It would seem that the freezing was related to scalapack. I tried to
>> run the same calculations with the default eigensolver and without
>> scalapack and it seems to be working now. To get the density and wave
>> functions to converge I also made some modifications to the calculator:
> did you get different energies without/with scalapack?
> Do you know which version of scalapack do you use (on our systems
> scalapack version 2.0.1 seems to behave better than 1.8.0)?
> Can you run, on 4 cores:
> https://trac.fysik.dtu.dk/projects/gpaw/browser/trunk/gpaw/test/parallel/scalapack_mpirecv_crash.py
> https://trac.fysik.dtu.dk/projects/gpaw/browser/trunk/gpaw/test/parallel/scalapack_pdlasrt_hang.py
>
> Marcin
>
I will run some test to see whether the energies are the same with and  
without scalapack. I'm not sure what version of scalapack is in use  
but I will find it out.

I scalapack_mpirecv_crash.py with 4 cores and it crashed just before  
the memory estimate with a good old segmentation fault:

_pmii_daemon(SIGCHLD): [NID 01792] [c7-0c0s0n0] [Tue Jul  3 16:36:39  
2012] PE 2 exit signal Segmentation fault
[NID 01792] 2012-07-03 16:36:39 Apid 4555952: initiated application  
termination
Application 4555952 exit codes: 139
Application 4555952 resources: utime 0, stime 0

The other one, scalapack_pdlasrt_hang.py, has been running over night  
(17 hours now) and I'm not sure if it is frozen or not? How long  
should it take to finish with 4 cores?

Juho

>
>>
>> surface, calc = restart('bulk-8-initial.gpw',
>>                          txt = name + '.txt',
>>                          xc = 'PBE',
>>                          basis = 'szp(dzp)',
>>                          mixer=Mixer(beta=0.1, nmaxold=2, weight=100.0),
>>                          poissonsolver=PoissonSolver(nn=3, relax='GS',
>> eps=1e-12),
>>                          parallel={'domain': (1,2,7)},
>>                          kpts = (8, 8, 1),
>>                          h = 0.12)
>>
>> And for a different (but similiar system) I also increased the maximum
>> number of iterations for the Poisson solver as explained in this
>> thread:
>> https://listserv.fysik.dtu.dk/pipermail/gpaw-users/2010-October/000408.html
>>
>> Juho
>>
>>>> Initially when I used the same calculator options as for the 6 layer
>>>> surface the calculation for 8 layers froze. After that I have tried
>>>> less agressive mixing, different Poisson solvers, extracting the
>>>> single-zeta polarization basis set from the double-zeta polarization
>>>> basis sets, and eventually even a different eigensolver.
>>>>
>>>> These calculation were run at Louhi and the last one I tried was this one:
>>>>
>>>> from ase.parallel import paropen
>>>> from gpaw import restart, Mixer
>>>> from gpaw.poisson import PoissonSolver
>>>> from ase.constraints import FixAtoms
>>>> from ase.optimize import QuasiNewton
>>>> from numpy import zeros
>>>>
>>>> resultfile = paropen('8-layer-cg-results.txt', 'w')
>>>>
>>>> for vac in range(5, 15):
>>>>
>>>>          name = '8-layer-cg-%.i-vacuum' % vac
>>>>
>>>>          k = 8
>>>>          a = 3.643       # Lattice constant from the convergence
>>>> calculations
>>>>
>>>>          surface, calc = restart('bulk-8-initial.gpw',
>>>>                          txt = name + '.txt',
>>>>                          eigensolver = 'cg',
>>>>                          mixer=Mixer(beta=0.1, nmaxold=2, weight=100.0),
>>>>                          poissonsolver=PoissonSolver(nn=3, relax='J'),
>>>>                          basis='szp(dzp)',
>>>>                          parallel={'sl_default': (5,1,64),
>>>>                                    'domain': (1,1,5)},
>>>>                          kpts = (k, k, 1))
>>>>
>>>>          # Fixing of the two bottom layers
>>>>
>>>>          fix = 2
>>>>
>>>>          positions = surface.get_positions()
>>>>          number = positions.size / 3
>>>>
>>>>          array = zeros([number])
>>>>          for i in range(0, number):
>>>>                  if (surface.positions[i][2]<   fix * a / 2):
>>>>                          array[i] = 1
>>>>                  surface.set_tags(array)
>>>>          c = FixAtoms(mask=[atom.tag == 1 for atom in surface])
>>>>          surface.set_constraint(c)
>>>>
>>>>          surface.pbc = (1, 1, 0)                         # Set PBC
>>>> false in z-direction
>>>>          surface.center(axis = 2, vacuum = vac)          # Center the
>>>> system in z-direction and add vacuum
>>>>
>>>>          surface.set_calculator(calc)
>>>>
>>>>          # Relaxation of the surface
>>>>
>>>>          relax = QuasiNewton(surface, trajectory = name + '.traj')
>>>>          relax.run(fmax = 0.05)
>>>>
>>>> Equivalent code worked with 4 and 6 layers just fine. Then of course
>>>> there were no need for different eigensolver or basis etc. For some
>>>> reason the only error message I get, is Louhi sending me this to my
>>>> email:
>>>>
>>>> PBS Job Id: 1098436.sdb
>>>> Job Name:   8-layer-cg
>>>> Post job file processing error; job 1098436.sdb on host nid00143
>>>>
>>>>
>>>> Unable to copy file /var/spool/PBS/spool/1098436.sdb.OU to
>>>> nid00139://wrk/arjoran/8-layer-cg.o1098436
>>>> error from copy
>>>> nid00139: Connection refused
>>>> end error output
>>>> Output retained on that host in: /var/spool/PBS/undelivered/1098436.sdb.OU
>>>>
>>>> Any ideas how the get that running?
>>>>
>>>> Juho Arjoranta
>>>>
>>>>
>>>> _______________________________________________
>>>> gpaw-users mailing list
>>>> gpaw-users at listserv.fysik.dtu.dk
>>>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>>
>>>
>>>
>>
>
>
> -- 
> ***********************************
>
> Marcin Dulak
> Technical University of Denmark
> Department of Physics
> Building 307, Room 229
> DK-2800 Kongens Lyngby
> Denmark
> Tel.: (+45) 4525 3157
> Fax.: (+45) 4593 2399
> email: Marcin.Dulak at fysik.dtu.dk
>
> ***********************************
>
>





More information about the gpaw-users mailing list