[gpaw-users] [CSC #180899] jobs hang

Tue Jun 21 11:09:04 CEST 2016

Hi Ask and Martii,

Thanks a lot for your kind support and troubleshooting. It was indeed a complicated issue to troubleshoot, at least 
at my level. :) It runs normally by just exporting MKL_CBWR=SSE2.

Martti, As newer version of GPAW (gpaw-1.0.0) is available, It would very nice if you compile it over taito and sisu.

Best regards,
Rizwan
________________________________________
From: Martti Louhivuori via RT [research-support at csc.fi]
Sent: Monday, June 20, 2016 2:22 PM
To: Ahmed Rizwan
Subject: [CSC #180899] jobs hang

Hi Ask,

Yes, that was indeed my intention... :)

No worries, though. Rizwan should have received both e-mails, but our ticketing system seems to separate the replies to each recipient... sorry for the confusion!

BR,
Martti

On Mon Jun 20 13:41:10 2016, asklarsen at gmail.com wrote:
> (Did you mean to CC Rizwan in the two e-mails here?  I will send a
> reply to both of you later.)
>
> 2016-06-20 12:32 GMT+02:00 Martti Louhivuori via RT <research-
> support at csc.fi>:
> > Hi!
> >
> > It seems that numerical precision is indeed the root of the problem.
> > By instructing MKL to make sure that all cores get exactly the same
> > results (so no rounding off differencies due to floating point
> > arithmetics), the run finishes without a problem.
> >
> > So, the only thing one needs to change is to add the following line
> > into the SLURM job script (script.sh) after loading the GPAW module:
> >
> > export MKL_CBWR=SSE2
> >
> > Not an ideal solution (you'll loose some performance), but at least
> > it by-passes the issue.
> >
> > Ask: Could it be that the algorithms involved are so numerically
> > fragile (or that numerical imprecision builds up over time) that once
> > in a while one just hits a "bad spot" and GPAW dies silently? Not an
> > easy issue to debug, but it would be good for GPAW to gracefully
> > handle such a situation... :)
> >
> > Best regards,
> > Martti
> >
> >
> > On Mon Jun 20 11:46:12 2016, louhivuo at csc.fi wrote:
> >> Dear Rizwan and Ask,
> >>
> >> I sincerely doubt this is an MPI (/installation) issue... at least
> >> in
> >> any kind of straightforward way. :) No MPI errors that would explain
> >> the issue are seen even when turning on MPI debugging
> >> (I_MPI_DEBUG=5).
> >>
> >> It seems that it stalls on Taito when using multiple cores
> >> regardless
> >> of whether they are in one node or multiple nodes. Depending on the
> >> exact setup (no. of cores, whether I_MPI_DEBUG is set etc.), it
> >> hangs
> >> at different iterations, but always on the same spot (i.e. after
> >> reporting the Dipole Moment).
> >>
> >> On Sisu (our Cray XC30), the same run finishes without a problem.
> >> Since no other user has encountered any problems on Taito (as fas as
> >> we can tell), I'm inclined to think that this may be a subtle
> >> numerical issue (Ask: akin to the Atom mismatch issue Jussi has been
> >> trying to address for some years now).
> >>
> >> For the record, on Taito the libraries used are IntelMPI 4.1.0 and
> >> IntelMKL 11.0.2, while on Sisu they are Cray-mpich 7.3.3 and Cray-
> >> libsci 16.03.1.
> >>
> >> I'll test my hypothesis about numerical stability and get back to
> >> you
> >> soonest.
> >>
> >> Best regards,
> >> Martti
> >>
> >>
> >> On Thu Jun 16 21:13:20 2016, asklarsen at gmail.com wrote:
> >> > Hi Ahmed
> >> >
> >> > If it occurs with multiple /nodes/ but not with multiple cores
> >> > within
> >> > only /one/ node, then we are practically certain that it is an
> >> > MPI/compiler problem.
> >> >
> >> > Right now I understand that it occurs in parallel (whether it runs
> >> > within a single or multiple nodes), but not in serial.  But that
> >> > could
> >> > still be a problem with the code and not MPI.  (Although the
> >> > problem
> >> > is so obscure and apparently platform specific that the first
> >> > explanation is much more likely.)
> >> >
> >> > Best regards
> >> > Ask
> >
> >
> >