[gpaw-users] questions about GPAW installation on a supercomputer

Wed Jul 11 19:46:32 CEST 2012

On 07/11/12 19:19, Ask Hjorth Larsen wrote:
> Hi Vladislav
>
> 2012/7/11 Vladislav Ivanistsev<vladislav.ivanistsev at ut.ee>:
>>>> then you can use for example this scalability test
>>>> https://wiki.fysik.dtu.dk/gpaw/devel/benchmarks.html#medium-size-system
>>>> Make sure to run each test few times (take average, or the fastest run)
>>>> on the nodes exclusively reserved for you when collecting timing results.
>>
>>   Can it be possible that domain decomposition slows down a calculation. I've
>> got a strange result using most simple GPAW configuration. At 2CPU
>> computation is faster than at 4CPU (both at one node). What could be the
>> reason?
>>
>> ///////////////////////////////////////////////
>>
>> Total number of cores used: 2
>> Domain Decomposition: 2 x 1 x 1
>> Diagonalizer layout: Serial LAPACK
>>
>> Symmetries present: 1
>> 1 k-point (Gamma)
>> 1 k-point in the Irreducible Part of the Brillouin Zone
>> Linear Mixing Parameter:           0.1
>> Pulay Mixing with 3 Old Densities
>> Damping of Long Wave Oscillations: 50
>>
>> Convergence Criteria:
>> Total Energy Change:           0.0005 eV / electron
>> Integral of Absolute Density Change:    0.0001 electrons
>> Integral of Absolute Eigenstate Change: 4e-08 eV^2
>> Number of Atoms: 768
>> Number of Atomic Orbitals: 5888
>> Number of Bands in Calculation:         1056
>> Bands to Converge:                      Occupied States Only
>> Number of Valence Electrons:            2048
>>                       log10-error:    Total        Iterations:
>>             Time      WFS    Density  Energy       Fermi  Poisson
>> iter:   1  15:38:04                  -4005.626127  0      19
>> iter:   2  15:58:27         -0.9     -3942.630366  0      10
>> iter:   3  16:18:57         -1.0     -3865.223847  0      12
>> iter:   4  16:39:25         -1.3     -3865.418828  0      12
>> iter:   5  16:59:53         -1.8     -3862.644668  0      10
>> Memory usage: 2.48 GB
>>
>> ///////////////////////////////////////////////
>>
>> Total number of cores used: 4
>> Domain Decomposition: 2 x 2 x 1
>> Diagonalizer layout: Serial LAPACK
>>
>> Symmetries present: 1
>> 1 k-point (Gamma)
>> 1 k-point in the Irreducible Part of the Brillouin Zone
>> Linear Mixing Parameter:           0.1
>> Pulay Mixing with 3 Old Densities
>> Damping of Long Wave Oscillations: 50
>>
>> Convergence Criteria:
>> Total Energy Change:           0.0005 eV / electron
>> Integral of Absolute Density Change:    0.0001 electrons
>> Integral of Absolute Eigenstate Change: 4e-08 eV^2
>> Number of Atoms: 768
>> Number of Atomic Orbitals: 5888
>> Number of Bands in Calculation:         1056
>> Bands to Converge:                      Occupied States Only
>> Number of Valence Electrons:            2048
>>                       log10-error:    Total        Iterations:
>>             Time      WFS    Density  Energy       Fermi  Poisson
>> iter:   1  15:46:32                  -4005.626127  0      19
>> iter:   2  16:13:38         -0.9     -3942.630365  0      10
>> iter:   3  16:40:53         -1.0     -3865.223847  0      12
>> iter:   4  17:07:51         -1.3     -3865.418828  0      12
>> iter:   5  17:34:59         -1.8     -3862.644668  0      10
>> Memory usage: 1.83 GB
> The obvious way in which this *could* happen would be if 4 processes
> run on only 2 cores, which could be a problem with the MPI setup.  The
> second-most obvious way is if it indeed runs on 4 cores, but the cores
> are separated in groups of 2 with horrible interconnect.
>
> But it can also happen for another reason.  I see that this is an LCAO
> calculation.  In this calculation, with 6000 atomic orbitals, the
> diagonalization dominates the time taken by all grid-based operations,
> so an increase in domain decomposition should not really help
> much--particularly since you use the serial LAPACK.  The amount of
> communication necessary to distribute the Hamiltonian, as will be done
> on every iteration, increases since the Hamiltonian is not
> parallelized while the number of CPUs that must know the Hamiltonian
> increases.  The operation should involve synchronizing the ~260 MiB
> Hamiltonian matrix on all CPUs, effectively a couple of times per
> iteration.  I don't see exactly how this effect could mean *that* much
> though.
this looks also like a memory bandwidth problem
(with using more and more cores on the node the individual cores become slower),
maybe combined with other jobs running at the same time on the node?
80./110 ~ 0.73 is  a very bad number, but a performance drop from 2 to 4 cores on a misused memory system can be of ~0.95:
https://wiki.fysik.dtu.dk/gpaw/devel/benchmarks.html#dual-socket-quad-core-64-bit-intel-westmere-xeon-x5667-quad-core-3-07-ghz-3-gb-ram-per-core-el5
(numactl was misconfigured for a borrowed westmere machine and i had no chance to repeat the test again).
I would suggest also to disable numa (or mpi processor affinity if you use it).

When you compare time you have to start from the full 1 node (all the 
cores) and add more nodes, or always use 1 core per node.
The nodes must  be exclusively used by your program.
(see https://wiki.fysik.dtu.dk/niflheim/Batch_jobs#more-memory-needed).
Always repeat runs few times, as runtimes on Linux vary easily by 5%.
You could also perform 
https://wiki.fysik.dtu.dk/gpaw/devel/benchmarks.html#goal in order to
see what the expected performance drop on you system.

Marcin
>
> Conclusion: Use ScaLAPACK and/or think about the interconnect.
>
> You might need a few more bands to converge the calculation well
> unless there's a gap or something.
>
> Regards
> Ask

-- 
***********************************

Marcin Dulak
Technical University of Denmark
Department of Physics
Building 307, Room 229
DK-2800 Kongens Lyngby
Denmark
Tel.: (+45) 4525 3157
Fax.: (+45) 4593 2399
email: Marcin.Dulak at fysik.dtu.dk

***********************************