[gpaw-users] general comment on memory leaks.

Ask Hjorth Larsen asklarsen at gmail.com
Fri Jan 15 15:33:46 CET 2016


Hmmm.  (In this case although the text file contains little
information, it does say something about *when* the calculation
crashes, which is still piece of information.)

Previous large-scale FD calculations have been performed with the
RMM-DIIS solver.  The default Davidson solver is not as scalable
because it does not allow band parallelization.  This system looks
small enough that it should be able to work without band
parallelization.  Nevertheless I suggest trying with RMM-DIIS because
we have used that up to 1000+ atoms and 10000+ electrons.

from gpaw.eigensolvers import RMM_DIIS

GPAW(eigensolver=RMM_DIIS(keep_htpsit=False), ....)

The keep_htpsit=False yields a performance and memory advantage for
large systems but is slower for 'normal'-size systems.

If this works, then we should probably look into the scalability of
the Davidson solver.

For large systems, depending on your needs for numerical accuracy, you
may be able to benefit from LCAO mode (typically: mode='lcao',
basis='dzp').  With LCAO mode, this calculation would fit easily
within a single node of a normal supercomputer.

Another thing is that you do not specify nbands.  The calculation
(according to a dry-run, python <script.py> --dry-run) says that it
will have 1535 electrons and 1152 bands.  850 bands should be more
than enough even with RMM-DIIS which likes to have many extra bands.
This will proportionally reduce memory consumption, although it would
not be the root cause of the problem.

Best regards
Ask

2016-01-15 14:58 GMT+01:00 abhishek khetan <askhetan at gmail.com>:
> So there are two kind of outputs. Mostly, its like this
>   ___ ___ ___ _ _ _
>  |   |   |_  | | | |
>  | | | | | . | | | |
>  |__ |  _|___|_____|  0.12.0.13279
>  |___|_|
>
> (really, it doesn't print anything beyond). And sometimes it goes a bit
> further
>   ___ ___ ___ _ _ _
>  |   |   |_  | | | |
>  | | | | | . | | | |
>  |__ |  _|___|_____|  0.12.0.13279
>  |___|_|
>
> User:   ak498084 at linuxitvc08.rz.RWTH-Aachen.DE
> Date:   Sun Jan 10 22:28:09 2016
> Arch:   x86_64
> Pid:    24578
> Python: 2.7.9
> gpaw:   /home/ak498084/Utility/GPAW/gpaw_devel/gpaw-0.12/gpaw
> _gpaw:
> /home/ak498084/Utility/GPAW/gpaw_devel/gpaw-0.12/build/bin.linux-x86_64-2.7/gpaw-python
> ase:    /home/ak498084/Utility/GPAW/gpaw_devel/ase/ase (version 3.10.0)
> numpy:
> /usr/local_rwth/sw/python/2.7.9/x86_64/lib/python2.7/site-packages/numpy
> (version 1.9.1)
> scipy:
> /usr/local_rwth/sw/python/2.7.9/x86_64/lib/python2.7/site-packages/scipy
> (version 0.15.1)
> units:  Angstrom and eV
> cores:  84
>
> Memory estimate
> ---------------
> Process memory now: 77.73 MiB
> Calculator  1476.36 MiB
>     Density  21.41 MiB
>         Arrays  6.10 MiB
>         Localized functions  13.55 MiB
>         Mixer  1.76 MiB
>     Hamiltonian  8.87 MiB
>         Arrays  4.53 MiB
>         XC  0.00 MiB
>         Poisson  3.37 MiB
>         vbar  0.98 MiB
>     Wavefunctions  1446.08 MiB
>         Arrays psit_nG  406.12 MiB
>         Eigensolver  610.42 MiB
>         Projections  1.57 MiB
>         Projectors  1.59 MiB
>         Overlap op  426.38 MiB
>
> But this happens only when I give it something of the order of 12+gigs per
> core for 84 cores.
>
> As I had mentioned in my earlier posts, the memory requirement for such a
> similar system, which I was somehow able to get to convergence after a lot
> of similar difficulties of seg faults, looks typically like.
>
> top - 23:35:13 up 5 days, 12:23,  0 users,  load average: 6.03, 6.04, 5.93
> Tasks: 705 total,   7 running, 698 sleeping,   0 stopped,   0 zombie
> Cpu(s): 25.1%us,  0.1%sy,  0.0%ni, 74.7%id,  0.1%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:    47.127G total,   11.120G used,   36.007G free,   36.012M buffers
> Swap:    0.000k total,    0.000k used,    0.000k free,  357.062M cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 24578 ak498084  20   0 2123m 1.7g  22m R 100.0  3.6  67:00.30 gpaw-python
> 24579 ak498084  20   0 2031m 1.6g  21m R 100.0  3.4  67:06.94 gpaw-python
> 24580 ak498084  20   0 2008m 1.6g  20m R 100.0  3.3  67:06.98 gpaw-python
> 24581 ak498084  20   0 2008m 1.6g  21m R 100.0  3.3  67:07.00 gpaw-python
> 24582 ak498084  20   0 2065m 1.6g  20m R 100.0  3.4  66:59.07 gpaw-python
> 24583 ak498084  20   0 2009m 1.6g  20m R 100.0  3.3  67:05.41 gpaw-python
> 23787 ak498084  20   0 88292 5504 1636 S  0.5  0.0   0:19.07 res
> 24390 ak498084  25   5 19676 1784  924 R  0.1  0.0   0:06.01 top
> 24452 ak498084  20   0 57300 7176 3432 S  0.1  0.0   0:00.35 mpiexec
> 23992 ak498084  20   0  9396 1304  940 S  0.0  0.0   0:00.00 1452461243.2555
> 23996 ak498084  20   0 11344 1320 1092 S  0.0  0.0   0:00.00 sh
> 24218 ak498084  20   0 18792 2184 1184 S  0.0  0.0   0:00.09 zsh
>
> Where what you see is the memory required by 6 (of 84) processors when the
> job is properly runnning and some ionic/electronic relaxations have been
> completed. This is when i provide about 8+gigs per core. i really don't
> understand why the seg fault arises when the actual requirement is so
> modest.
>
> Best,
>
>
>
> On Fri, Jan 15, 2016 at 2:26 PM, Ask Hjorth Larsen <asklarsen at gmail.com>
> wrote:
>>
>> The files are appreciated but the text output (stdout), being the most
>> important one, is still missing.
>>
>> Best regards
>> Ask
>>
>> 2016-01-15 14:09 GMT+01:00 abhishek khetan <askhetan at gmail.com>:
>> > Attached are the files. The cif file is actually a gpaw converged output
>> > that i extracted using ase and then changed one atom in it. It is quite
>> > huge
>> > in size though.
>> >
>> > Maybe such errors are related to my installation, although I cannot find
>> > it
>> > in any way.
>> >
>> > Thanks and Best,
>> >
>> >
>> > On Thu, Jan 14, 2016 at 5:47 PM, Ask Hjorth Larsen <asklarsen at gmail.com>
>> > wrote:
>> >>
>> >> Please attach both input script and text output.
>> >>
>> >> Best regards
>> >> Ask
>> >>
>> >> 2016-01-14 17:38 GMT+01:00 abhishek khetan <askhetan at gmail.com>:
>> >> > Dear gpaw developers,
>> >> >
>> >> > i have found that in general for large systems (> 150 atoms) or
>> >> > systems
>> >> > with
>> >> > memory intensive methods like the GW, there are always segfault
>> >> > errors
>> >> > of a
>> >> > similar kind. I have a scalapack compiled working version of
>> >> > gpaw-0.12
>> >> > which
>> >> > passes all tests in the suite. For a system, small in size, the
>> >> > various
>> >> > methods in gpaw run properly but for bigger systems of the desired
>> >> > sizes
>> >> > of
>> >> > the same kind, gpaw fails with the exact same kind of error.
>> >> >
>> >> > gpaw-python:18622 terminated with signal 11 at PC=3d8d6acba8
>> >> > SP=7ffe9b9d47b0.  Backtrace:
>> >> >
>> >> > I have posted about this in the context of GW method on the gpaw
>> >> > forums
>> >> > a
>> >> > couple of dozen times before, but i haven't seen anyone else report
>> >> > similar
>> >> > errors. Now I am encountering the same unsolved errors in even simple
>> >> > relaxation problems where the unit cell happens to be quite large.
>> >> > For
>> >> > slightly smaller cases where the systems do converge, i see that the
>> >> > memory
>> >> > reuirements are actually very modest (1-2) gigs per core for 60
>> >> > cores.
>> >> >
>> >> > Any ideas/ methods/ procedures that i can resolve this error as a
>> >> > user ?
>> >> > Am
>> >> > I allowed to make a ticket on this or request for a ticket on this on
>> >> > the
>> >> > TRAC ?
>> >> >
>> >> > Thanks and Best,
>> >> >
>> >> > askhetan
>> >> >
>> >> > _______________________________________________
>> >> > gpaw-users mailing list
>> >> > gpaw-users at listserv.fysik.dtu.dk
>> >> > https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>> >
>> >
>> >
>> >
>> > --
>> > || radhe radhe ||
>> >
>> > abhishek
>> >
>> > _______________________________________________
>> > gpaw-users mailing list
>> > gpaw-users at listserv.fysik.dtu.dk
>> > https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>
>
>
>
> --
> || radhe radhe ||
>
> abhishek
>
> _______________________________________________
> gpaw-users mailing list
> gpaw-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users


More information about the gpaw-users mailing list