[gpaw-users] general comment on memory leaks.
abhishek khetan
askhetan at gmail.com
Wed Jan 20 11:20:39 CET 2016
and by EVERY SINGLE TIME, I mean i have run the exact same jobs twice or
thrice to check if they crash or run. For all the cases mentioned above.
On Wed, Jan 20, 2016 at 11:19 AM, abhishek khetan <askhetan at gmail.com>
wrote:
> I think I have figured out exactly where the problem lies, but not what is
> causing it.
>
> First, just to give you what our two clusters here are like (in case they
> may be of help):
>
> Cluster1:
> Chassis: 14x Dell PowerEdge C6100 *(Means 14 nodes on this
> chassis/cluster)*
> Processor/Node: 2x Intel Xeon X5670 (6-core) *(Means a total of 2x6=12
> processors per node)*
> Memory/Node: 48 GByte (12x 4 GByte, 1333MHz) *(Means atleast 3.5 Gbs of
> actual resident memory available per core)*
> Interconnect:Infiniband QDR Dual Port 40Gb/s (non-blocking)
> <http://gottfried.itv.rwth-aachen.de/wiki/lib/exe/fetch.php?media=computing:is5030_35.pdf>
> File System: lustre file system
> Operating System: Scientific Linux 6.4
>
> Cluster2:
> Blades: 6x Dell PowerEdge M620 *(Means 6 nodes per chassis/cluster)*
> Processor/Blade: 2x Intel Xeon E5-2660v2 (10-core) *(Means a total of
> 2x10=20 processors per node)*
> Memory/Blade: 256 GByte *(Means atleast 12.5 Gbs of actual resident
> memory available per core)*
> Interconnect: Infiniband FDR-10
> <http://www.mellanox.com/related-docs/whitepapers/WP_FDR_InfiniBand_is_Here.pdf>
> File System: lustre file system
> Operating System: Scientific Linux 6.4
>
> I did an experiment I some low memory jobs (with kpts=1x1x1) on 12 and 24
> processors on Cluster1 and I also ran some higher memory jobs (with
> kpts=1x1x2) on 20 and 40 processors on Cluster2.
>
> In both cases, when the jobs did not span over more than one node, which
> means 12 procs on Cluster1 for low mem jobs and 20 procs on Cluster2 for
> high mem jobs, they ran perfectly well EVERY SINGLE TIME.
>
> However, as I increased the number of processors from 12 (1 node) to 24 (2
> nodes) for the low mem jobs on Cluster1, and 20 (1 node) to 40 (2 nodes)
> for the higher mem jobs on Cluster2, the behaviour is totally erratic.
> Sometimes they start, other times they give the same segfault error, which
> I have described previously in this post. Another interesting feature was
> that the more number of processors (and therefore nodes) i run the jobs on,
> the more difficult it is to get to jobs to start. In all simplicity, the
> number of times the jobs crashed was found to be an exponentially
> increasing function of the number of nodes involved. As pseudo-scientific
> this sounds, its actually what is happening. I have no clue why.
>
> Although, this clearly indicates a problem with the inter-node
> communication here on the cluster, because on single nodes, there is no
> problem at all. I have provided you with the exact technical details so
> that maybe you can let me know if its a known problem on Infiniband FDR or
> QDR interconnections. Could there be a problem in my compilation? Seems to
> me not because even on 3 or 4 nodes, the jobs do start sometimes, if I am
> lucky.
>
> Any help is greatly appreciated.
>
>
> On Mon, Jan 18, 2016 at 7:35 PM, abhishek khetan <askhetan at gmail.com>
> wrote:
>
>> You're right, the word memory leak is a wrong description. I made the
>> mistake of invariable associating it with the seg fault error, which is
>> what it actually is. I will make these tests and get back.
>>
>>
>> On Mon, Jan 18, 2016 at 6:32 PM, Ask Hjorth Larsen <asklarsen at gmail.com>
>> wrote:
>>
>>> Why are you so sure that there are memory leaks? So far we have only
>>> seen indications that a lot of memory is allocated.
>>>
>>> You could for example lower the grid spacing until it runs, then check
>>> if memory usage increases linearly with subsequent identical
>>> calculations. That would indicate a memory leak. If you do not
>>> observe this behaviour, then I don't know what you are seeing, but it
>>> is certainly not a memory leak!
>>>
>>> 2016-01-18 13:26 GMT+01:00 abhishek khetan <askhetan at gmail.com>:
>>> > I tried using the cluster interactively, and it gives me the output as
>>> > below. I couldn't make the r_memusage function work but its easily
>>> visible
>>> > that the memory requirements are quite modest. I do not know why there
>>> is
>>> > seg fault when I allocate it in the regular cluster for production
>>> jobs.
>>> >
>>> > ___ ___ ___ _ _ _
>>> > | | |_ | | | |
>>> > | | | | | . | | | |
>>> > |__ | _|___|_____| 0.12.0.13279
>>> > |___|_|
>>> >
>>> > User: ak498084 at linuxbmc0002.rz.RWTH-Aachen.DE
>>> > Date: Mon Jan 18 13:22:24 2016
>>> > Arch: x86_64
>>> > Pid: 20443
>>> > Python: 2.7.9
>>> > gpaw: /home/ak498084/Utility/GPAW/gpaw_devel/gpaw-0.12/gpaw
>>> > _gpaw:
>>> >
>>> /home/ak498084/Utility/GPAW/gpaw_devel/gpaw-0.12/build/bin.linux-x86_64-2.7/gpaw-python
>>> > ase: /home/ak498084/Utility/GPAW/gpaw_devel/ase/ase (version 3.10.0)
>>> > numpy:
>>> >
>>> /usr/local_rwth/sw/python/2.7.9/x86_64/lib/python2.7/site-packages/numpy
>>> > (version 1.9.1)
>>> > scipy:
>>> >
>>> /usr/local_rwth/sw/python/2.7.9/x86_64/lib/python2.7/site-packages/scipy
>>> > (version 0.15.1)
>>> > units: Angstrom and eV
>>> > cores: 32
>>> >
>>> > Memory estimate
>>> > ---------------
>>> > Process memory now: 75.02 MiB
>>> > Calculator 1145.24 MiB
>>> > Density 56.04 MiB
>>> > Arrays 15.91 MiB
>>> > Localized functions 35.58 MiB
>>> > Mixer 4.55 MiB
>>> > Hamiltonian 23.19 MiB
>>> > Arrays 11.82 MiB
>>> > XC 0.00 MiB
>>> > Poisson 8.81 MiB
>>> > vbar 2.56 MiB
>>> > Wavefunctions 1066.01 MiB
>>> > Arrays psit_nG 523.69 MiB
>>> > Eigensolver 2.29 MiB
>>> > Projections 2.06 MiB
>>> > Projectors 4.17 MiB
>>> > Overlap op 533.81 MiB
>>> >
>>> >
>>> > On Mon, Jan 18, 2016 at 1:01 PM, abhishek khetan <askhetan at gmail.com>
>>> wrote:
>>> >>
>>> >> Dear Marcin, and Ask,
>>> >>
>>> >> I am indeed on this cluster. And I have already used both these tools.
>>> >> When I use the r_memusage (to check the peak physical memory), the
>>> peak
>>> >> physical memory is in the order of a few MBs and the process gets
>>> killed
>>> >> right as the beginning with the output only as:
>>> >>
>>> >> | | |_ | | | |
>>> >> | | | | | . | | | |
>>> >> |__ | _|___|_____| 0.12.0.13279
>>> >> |___|_|
>>> >>
>>> >>
>>> >> The same is not the case when I take a pre-converged systems and run
>>> the
>>> >> r_memusage script. It shows me a good 2.5 GBs (and rising) before I
>>> kill the
>>> >> process, as I can see its running fine. This is what I mean by saying
>>> that
>>> >> the allocation doesn't even start for these unconverged cases. Using
>>> >> eigensolver=RMM_DIIS(keep_htpsit=False), has the exact same problems.
>>> Is
>>> >> there a way, I can trick gpaw into giving the cluster much less of a
>>> >> requirement. I want to try this because, as I have mentioned, at the
>>> peak
>>> >> condition my jobs don't need more than 2 GB per core and I'm
>>> providing it 8
>>> >> GB usually (albiet, to no use).
>>> >>
>>> >> Best,
>>> >>
>>> >>
>>> >> On Sat, Jan 16, 2016 at 1:10 PM, Marcin Dulak <mdul at dtu.dk> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> are you one this cluster?
>>> >>> https://doc.itc.rwth-aachen.de/display/CC/r_memusage
>>> >>>
>>> >>>
>>> https://doc.itc.rwth-aachen.de/display/CC/Resource+limitations+on+dialog+systems
>>> >>> It may be that the batch system (LSF) kills your jobs that exceed
>>> given
>>> >>> resident memory.
>>> >>> The two links above may help you to diagnose that.
>>> >>> I recall GPAW's memory estimate is not very accurate for standard
>>> >>> ground-state, PW or GRID mode jobs
>>> >>> (~20%) and may be very inaccurate (order of magnitude) for VDW or
>>> LCAO
>>> >>> jobs (Ask correct me if this is not the case anymore).
>>> >>>
>>> >>> Best regards,
>>> >>>
>>> >>> Marcin
>>> >>> _______________________________________________
>>> >>> gpaw-users mailing list
>>> >>> gpaw-users at listserv.fysik.dtu.dk
>>> >>> https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> || radhe radhe ||
>>> >>
>>> >> abhishek
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > || radhe radhe ||
>>> >
>>> > abhishek
>>> >
>>> > _______________________________________________
>>> > gpaw-users mailing list
>>> > gpaw-users at listserv.fysik.dtu.dk
>>> > https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
>>>
>>
>>
>>
>> --
>> || radhe radhe ||
>>
>> abhishek
>>
>
>
>
> --
> || radhe radhe ||
>
> abhishek
>
--
|| radhe radhe ||
abhishek
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/gpaw-users/attachments/20160120/df44f235/attachment.html>
More information about the gpaw-users
mailing list