[gpaw-users] Different MPI worlds ASE vs. GPAW: small fix, big fix or migrating to mpi4py?

Thu Aug 9 18:01:48 CEST 2018

On Thu, 2018-08-09 at 10:06 -0500, Ask Hjorth Larsen wrote:
> Hello,
> 
> 2018-08-09 5:31 GMT-05:00 Gaël Donval <G.Donval at bath.ac.uk>:
> > Hi both,
> > > On 08/09/2018 05:46 AM, Ask Hjorth Larsen via gpaw-users wrote:
> > > > Hello,
> > > > 
> > > > 2018-08-08 13:27 GMT-05:00 Gaël Donval <G.Donval at bath.ac.uk>:
> 
> (....)
> > > I think switching to mpi4py would be difficult and, as Ask
> > > mentioned,
> > > not so nice for users.  And our C-extension still needs to call
> > > MPI
> > > functions.
> > 
> > Let's forget about the switch then: I get your point.
> > 
> > Ask, I also get that you don't see any compelling reasons to
> > separate
> > those things, so let's put that on hold for now.
> > 
> > Let's assume I'm restricting myself to having a fully working
> > parallel
> > _gpaw.so version. Nothing more (I don't plan to touch `gpaw-python` 
> > at
> > all). For that I need a single point of entry to MPI from Python.
> > Why?
> > Because it is simpler for both users and developers (i.e. single
> > way of
> > doing things, single place to look at, single place to update) AND
> > because that would allow us to provide the same guarantees as
> > `gpaw-
> > python`.
> > 
> 
> +1, I hope it is not too difficult

So do I...

> 
> > That point of entry could check whether MPI is already initialized
> > and
> > raise a suitable exception if that's the case: that way, if no
> > exception is raised, we know we are in control, just like in `gpaw-
> > python`. The user could still load mpi4py after the fact and meddle
> > with MPI but so can he within `gpaw-python`...
> 
> What do you think about having a Python-level 'gpaw' subcommand in
> which we manage our parallelization?

Something along the lines of the following?

    mpiexec python -m gpaw blah.py

A "compatibility" `gpaw-python` script doing just that could also be
provided.

If so, I don't see why not.

> 
> What matters is that we control the preemtive imports and code
> initialization.  When we don't, the user may do things in any order
> and it becomes very difficult.
> 
> > 
> > 
> > I get that ASE needs to know whether it's running in parallel but
> > does
> > not know what program it's going to use. There are 3 obvious
> > solutions
> > to that:
> >  * make a separate MPI communicator subproject that implements the
> >    required interface (that would be reimplementing mpi4py) or
> >    alternatively, migrate mpi.c to ase since ase needs to know
> > about
> >    MPI! (I know this is not really a solution but this is what
> > makes
> >    sense)
> >  * try to load a working communicator implementation from well-
> > known
> >    compiled modules such as gpaw, asap, etc. (as long as the
> > interface
> >    is identical, it wouldn't change anything...)
> 
> I prefer/suggest that the program (gpaw, asap) knows what it wants
> and
> tells ASE what it knows about the runtime.  It is more explicit.

I agree. How would you handle scripts using both GPAW and ASAP?

> 
> >  * Make a modifiable ase.parallel.world and add a registration
> >    mechanism for gpaw to declare the existence of its MPI
> >    implementation (Jens Jørgen's suggestion).
> > 
> > 
> > Assuming I follow the last route, what exactly would pose problem
> > in
> > ASE?
> > 
> > The static rank numbers could become a Rank object instead with a
> > is_master() and is_slave() methods: that would seem to solve ~95%
> > of
> > the use cases in ASE (from a quick grep).
> > 
> > There doesn't seem to be any static construct which construction we
> > can't postpone until just before the calculation. Actually,
> > ase.parallel.world could itself be a smart object so that local
> > `self.world` copies are still up to date.
> 
> We can capture all we need at a single point: startup.  This is
> fundamentally simpler than needing things to initialize correctly at
> different points in the code.  The user can always somehow resolve a
> rank into an integer or boolean which later could become out-of-sync
> if parallel initialization can happen later.

I agree: it could be done in gpaw.__main__ then, guaranteeing a frozen
world from the very start. It's also very explicit and versatile: if
you want to do something else, then don't use `python -m gpaw`... I
quite like that approach.

I need to think about how it can be made. If you also have ideas about
how to get rid of ranks altogether, I'm all for. 

> 
> I have fought circular communicator imports and speculative attempts
> at initializing things on a few different occasions.  That's why I am
> so much in favour of initializing at startup level and ensuring a
> layer (gpaw-python or gpaw subcommand) that *we* control.

I'll work with that in mind.

Thanks both for your input.
Gaël

> 
> Best regards
> Ask
> 
> > 
> > Gaël
> > 
> > > 
> > > Jens Jørgen
> > > 
> > > > So I'd say the (MPI-based) parallelism must be completely
> > > > determined
> > > > when the program starts, and definitely before any line written
> > > > by
> > > > the
> > > > user is executed.
> > > > 
> > > > Best regards
> > > > Ask
> > > > 
> > > > > Gaël
> > > > > 
> > > > > > Best regards
> > > > > > Ask
> > > > > > 
> > > > > > > Gaël
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > gpaw-users mailing list
> > > > > > > gpaw-users at listserv.fysik.dtu.dk<mailto:
> > > > > > > gpaw-users at listserv.fysik.dtu.dk>
> > > > > > > https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
> > > > 
> > > > _______________________________________________
> > > > gpaw-users mailing list
> > > > gpaw-users at listserv.fysik.dtu.dk
> > > > https://listserv.fysik.dtu.dk/mailman/listinfo/gpaw-users
> > > 
> > >