[gpaw-users] questions about GPAW installation on a supercomputer

Wed Jul 11 19:19:41 CEST 2012

Hi Vladislav

2012/7/11 Vladislav Ivanistsev <vladislav.ivanistsev at ut.ee>:
>>> then you can use for example this scalability test
>>> https://wiki.fysik.dtu.dk/gpaw/devel/benchmarks.html#medium-size-system
>>> Make sure to run each test few times (take average, or the fastest run)
>>> on the nodes exclusively reserved for you when collecting timing results.
>
>
>  Can it be possible that domain decomposition slows down a calculation. I've
> got a strange result using most simple GPAW configuration. At 2CPU
> computation is faster than at 4CPU (both at one node). What could be the
> reason?
>
> ///////////////////////////////////////////////
>
> Total number of cores used: 2
> Domain Decomposition: 2 x 1 x 1
> Diagonalizer layout: Serial LAPACK
>
> Symmetries present: 1
> 1 k-point (Gamma)
> 1 k-point in the Irreducible Part of the Brillouin Zone
> Linear Mixing Parameter:           0.1
> Pulay Mixing with 3 Old Densities
> Damping of Long Wave Oscillations: 50
>
> Convergence Criteria:
> Total Energy Change:           0.0005 eV / electron
> Integral of Absolute Density Change:    0.0001 electrons
> Integral of Absolute Eigenstate Change: 4e-08 eV^2
> Number of Atoms: 768
> Number of Atomic Orbitals: 5888
> Number of Bands in Calculation:         1056
> Bands to Converge:                      Occupied States Only
> Number of Valence Electrons:            2048
>                      log10-error:    Total        Iterations:
>            Time      WFS    Density  Energy       Fermi  Poisson
> iter:   1  15:38:04                  -4005.626127  0      19
> iter:   2  15:58:27         -0.9     -3942.630366  0      10
> iter:   3  16:18:57         -1.0     -3865.223847  0      12
> iter:   4  16:39:25         -1.3     -3865.418828  0      12
> iter:   5  16:59:53         -1.8     -3862.644668  0      10
> Memory usage: 2.48 GB
>
> ///////////////////////////////////////////////
>
> Total number of cores used: 4
> Domain Decomposition: 2 x 2 x 1
> Diagonalizer layout: Serial LAPACK
>
> Symmetries present: 1
> 1 k-point (Gamma)
> 1 k-point in the Irreducible Part of the Brillouin Zone
> Linear Mixing Parameter:           0.1
> Pulay Mixing with 3 Old Densities
> Damping of Long Wave Oscillations: 50
>
> Convergence Criteria:
> Total Energy Change:           0.0005 eV / electron
> Integral of Absolute Density Change:    0.0001 electrons
> Integral of Absolute Eigenstate Change: 4e-08 eV^2
> Number of Atoms: 768
> Number of Atomic Orbitals: 5888
> Number of Bands in Calculation:         1056
> Bands to Converge:                      Occupied States Only
> Number of Valence Electrons:            2048
>                      log10-error:    Total        Iterations:
>            Time      WFS    Density  Energy       Fermi  Poisson
> iter:   1  15:46:32                  -4005.626127  0      19
> iter:   2  16:13:38         -0.9     -3942.630365  0      10
> iter:   3  16:40:53         -1.0     -3865.223847  0      12
> iter:   4  17:07:51         -1.3     -3865.418828  0      12
> iter:   5  17:34:59         -1.8     -3862.644668  0      10
> Memory usage: 1.83 GB

The obvious way in which this *could* happen would be if 4 processes
run on only 2 cores, which could be a problem with the MPI setup.  The
second-most obvious way is if it indeed runs on 4 cores, but the cores
are separated in groups of 2 with horrible interconnect.

But it can also happen for another reason.  I see that this is an LCAO
calculation.  In this calculation, with 6000 atomic orbitals, the
diagonalization dominates the time taken by all grid-based operations,
so an increase in domain decomposition should not really help
much--particularly since you use the serial LAPACK.  The amount of
communication necessary to distribute the Hamiltonian, as will be done
on every iteration, increases since the Hamiltonian is not
parallelized while the number of CPUs that must know the Hamiltonian
increases.  The operation should involve synchronizing the ~260 MiB
Hamiltonian matrix on all CPUs, effectively a couple of times per
iteration.  I don't see exactly how this effect could mean *that* much
though.

Conclusion: Use ScaLAPACK and/or think about the interconnect.

You might need a few more bands to converge the calculation well
unless there's a gap or something.

Regards
Ask