[ase-users] Problems with dftd3 and multiple nodes.

Sascha Thinius sascha.thinius at thch.uni-bonn.de
Wed Nov 1 10:52:05 CET 2017


Yes, I can confirm that only the number of nodes differ. One node is working fine, more than one not.
The scripts are exactly the same. Removing the D3 calculator the calculation will run in parallel without problems.

The "d3-else" messages, I just added on my own, to see where the code goes along.
The .EDISP : I have no idea, maybe from the dftd3 executable.
I have compiled gpaw-python using icc and mpicc. dftd3-lib-0.9 is compiled using ifort with -O0 -g -traceback -debug all flags.
I did not made any modifications.

Removing omp_num_threads also did not change anything.
I also contacted the guys from "Leibniz-Rechenzentrum", where I am doing the calculations. They also cannot backtrace the error.
Is there a simple tool, I can use to trace the error? Is it possible that python_anaconda is causing this problem?

Thanks for your help so far.
Sascha


On Tue, 31 Oct 2017 14:59:02 +0000
  Eric Hermes via ase-users <ase-users at listserv.fysik.dtu.dk> wrote:
> On Tue, 2017-10-31 at 10:29 +0100, Sascha Thinius wrote:
>> Hi Eric,
>> 
>> I've installed the ase-version including your changes. But it did not
>> change anything.
>> The ase_dftd3.xyz file was writen, while the ase_dftd3.out file is
>> still empty.
>> 
>> Running the calculation on one node, the following (not empty) files
>> were created:
>> .EDISP
>> dftd3_gradient
>> ase_dftd3.xyz
>> ase_dftd3.out
>> 
>> Thanks for help so far. I appreciate more suggestions to fix the
>> problem.
> 
> Please keep replies on-list.
> 
> Can you confirm that the tests you are running only differ in the
> number of nodes used? Are there any other differences between the
> calculation that works and the one that doesn't besides the number of
> nodes? Does this calculation work without the DFTD3 calculator?
> 
> I don't really understand the output in the files you attached... I
> don't know where the .EDISP file is coming from, and I don't know where
> the "d3-else" messages are coming from in your output. The DFTD3
> calculator does neither of those. Have you made any modifications to
> GPAW, the dftd3 program itself, or the DFTD3 calculator in ASE to
> produce this output? The DFTD3 calculator will only work with the
> unmodified reference implementation of dftd3 from Grimme's website.
> 
> Keep in mind that the dftd3 executable is serial; it is neither MPI nor
> OpenMP parallelized. That means for very large systems with three-body
> contributions, it can take a very long time. But, your system is not
> that big and you are not using threebody corrections. I can run the
> dispersion calculation on my laptop in much less than a second.
> However, if you are testing larger systems with threebody corrections,
> it may look like the calculation is hanging just because it takes a
> long time.
> 
> Also, GPAW is only MPI parallelized, not OpenMP, so you should probably
> disable OpenMP on this calculation anyway with something like
> "OMP_NUM_THREADS=1" in your environment. The crash is coming from
> libpoe, which I take it is some sort of IBM parallel execution library,
> but I am not at all familiar with IBM hardware or software. It may be
> related to OpenMP, but I genuinely have no idea how to even start
> diagnosing that stack trace. It's certainly not crashing in the dftd3
> executable itself though.
> 
> Eric
> 
>> 
>> Sascha.
>> 
>> 
>> On Thu, 26 Oct 2017 15:48:21 +0000
>>   Eric Hermes via ase-users <ase-users at listserv.fysik.dtu.dk> wrote:
>> > On Thu, 2017-10-26 at 11:11 +0200, Sascha Thinius via ase-users
>> > wrote:
>> > > Good morning,
>> > > 
>> > > I am happy that dftd3 is available in the ASE from version ase-
>> > > 3.15.0b1.
>> > > Using a single node everything works fine for me.
>> > > Using two or more nodes the code get stuck.
>> > > Attached you find out-File, err-File, structure-File, python-
>> > > scipt
>> > > and the submission-scipt. Ignore bad settings in the 
>> > > python-scipt.
>> > > The code get stuck in the calculate() fuction line 228 (in the if
>> > > world.rank == 0 statement).
>> > > ase_dftd3.xyz ist written, ase_dftd3.out is written but empty.
>> > > 
>> > > Thanks for any advise.
>> > 
>> > Hm, it's hard to tell what's going wrong based on the files you
>> > shared.
>> > I am the one who wrote this module, but I never tested it with
>> > gpaw-
>> > python or mpi4py, so it's not terribly surprising that it's not
>> > working
>> > across multiple hosts. It looks like Alexander Tygesen did some
>> > work on
>> > the code to make it more compatible with parallel calculations, so
>> > he
>> > might have some insight into what's going wrong.
>> > 
>> > I've committed some additional changes to the code which
>> > reorganizes
>> > some of the parallel logic and gets rid of the assumption that the
>> > dftd3 files are readable by all MPI processes (for example, if you
>> > are
>> > running the calculation on local storage across multiple hosts...).
>> > There's a chance this will fix it for you, just pull from the git
>> > head
>> > and try again. If this doesn't solve your issue, please share any
>> > files
>> > created by the DFTD3 calculator (i.e. ase_dftd3.{out,POSCAR,xyz},
>> > dftd3_cellgradient, dftd3_gradient, .dftd3par.local).
>> > 
>> > Eric
>> > 
>> > > 
>> > > All the best,
>> > > Sascha.
>> > > _______________________________________________
>> > > ase-users mailing list
>> > > ase-users at listserv.fysik.dtu.dk
>> > > https://listserv.fysik.dtu.dk/mailman/listinfo/ase-users
>> > 
>> > _______________________________________________
>> > ase-users mailing list
>> > ase-users at listserv.fysik.dtu.dk
>> > https://listserv.fysik.dtu.dk/mailman/listinfo/ase-users
>> 
>> 
> 
> _______________________________________________
> ase-users mailing list
> ase-users at listserv.fysik.dtu.dk
> https://listserv.fysik.dtu.dk/mailman/listinfo/ase-users



More information about the ase-users mailing list