performance.rst | searchcode

/doc/platforms/BGP/performance.rst

https://gitlab.com/eojons/gpaw
ReStructuredText | 275 lines | 193 code | 82 blank | 0 comment | 0 complexity | dd58591e4f185a8a01ef379fa22994b6 MD5 | raw file

.. _bgp_performance:

==============================
Maximizing performance on BG/P
==============================

Begin by reading up on the GPAW parallelization strategies
(:ref:`parallel_runs`) and the `BG/P architecture
<https://wiki.alcf.anl.gov/index.php/References>`_.  In particular,
:ref:`band_parallelization` will be needed to scale your calculation
to large number of cores. The BG/P systems at the `Argonne Leadership
Computing Facility <http://www.alcf.anl.gov>`_ uses Cobalt for
scheduling and it will be referred to frequently below. Other
schedulers should have similar functionality.

There are four key aspects that require careful considerations:

1) Choosing a parallelization strategy.

#) Selecting the correct partition size (number of nodes) and mapping.

#) Choosing an appropriate value of ``buffer_size``. The use of ``nblocks`` is no longer recommended.

#) Setting the appropriate DCMF environmental variables.

In the sections that follow, we aim to cultivate an understanding of
how to choose these parameters.

Parallelization Strategy
====================================

Parallelization options are specified at the ``gpaw-python`` command
line.  Domain decomposition  with ``--domain-decomposition=Nx,Ny,Nz``
and band parallelization with ``--state-parallelization=B``.
Additionally, the ``parallel`` keyword is also available.

The smallest calculation that can benefit from band/state
parallelization is *nbands = 1000*. If you are using fewer bands, you
are possibly *not* in need of a leadership class computing facility. 
Note that only :ref:`RMM-DIIS` eigensolver is compatible with band
parallelization. Furthermore, the RMM-DIIS eigensolver requires 
some unoccupied bands in order to converge properly. Recommend range is::

  spinpol=False
  nbands = valence electrons/2*[1.0 - 1.2]

  spinpol=True
  nbands = max(up valence electrons, down valence electrons)*[1.0 - 1.2]
  
It was empirically determined that you need to have *nbands/B
> 256*  for reasonable performance. It is also possible use smaller groups, 
*nbands/B < 256*, but this may require large domains. It is *required* that
*nbands/B* be integer-divisible. The best values for B =2, 4, 8, 16,
32, 64, and 128.

Obviously, the number of total cores must equal::
  
   Nx*Ny*Nz*B

The parallelization strategy will require careful consideration of the
partition size and mapping. And, obviously, also memory!

Partition size and Mapping 
========================================
The BG/P partition dimensions (Px, Py, Pz, T) for Surveyor and Intrepid at the
Argonne Leadership Computing Facility are `available here 
<https://wiki.alcf.anl.gov/index.php/Running#What_are_the_sizes_and_dimensions_of_the_partitions_on_the_system.3F>`_,
where T represents the number of MPI tasks per node (not whether a
torus network is available). The number of cores per node which
execute MPI tasks is specified by the Cobalt flag::

  --mode={smp,dual,vn}

Hence, the possible values of T are::

  T = 1 for smp
  T = 2 for dual
  T = 4 for vn

Note that there are 4 cores per node and 2 GB per node on BG/P. As GPAW is
presently an MPI-only code, vn mode is preferred since all cores will
perform computational work.

It is essential to think of the BG/P network as a 4-dimensional object with
3 spatial dimensions and a T-dimension. For optimum scalability it
would seem necessary to maximize the locality of two distinct
communications patterns arising in the canonical O(N^3) DFT algorithm: 

1) H*Psi products  

#) parallel matrix multiplies. 

However, it turns out that this is *not* necessary. The mesh network can
handle small messages rather efficiently such that the time to send a
small message to a nearest-neighbor node versus a node half-way across
the machine is comparable. Hence, it is only necessary to optimize the
mapping for the communication arising from the parallel matrix
multiply which is a simple 1D systolic communication pattern.

Here we show the examples of different mappings on a 512-node BG/P
partition. Band groups are colored coded. *(Left)* Inefficient mapping
for four groups of bands (B = 4). This mapping leads to contention on
network links in the z-direction. *(Right)*  Efficient mapping for eight
groups of bands (B=8). Correct mapping maximizes scalability and
single-core peak performance. 

|mapping1|		|mapping2|

.. |mapping1| image:: bgp_mapping1.png	
   :width: 40 % 
  
.. |mapping2| image:: bgp_mapping2.png
   :width: 40 %

For the mapping on the *(Right)* above image, there are 
two communication patterns (and hence mappings) that are worth
distinguishing.

|intranode|

.. |intranode| image:: bgp_mapping_intranode.png
   :width: 60 %

The boxes in these images represent a node and the numbers inside
the box repesent the distinct cores in the node (four for BG/P).
Intuitively, the communication pattern of the *(Left)* image should
lead to less network contention than the *(Right)*. However, this is
not the case due to lack of optimization in the intranode
implementation of MPI. The performance of these communications
patterns is presently identical, though this may change in future
version of the BG/P implementation of MPI. 

Mapping is accomplished by the Cobalt flag::
  
   --env=BG_MAPPING=<mapping>

where *<mapping>* can be one of the canonical BG/P mappings 
(permutations of XYZT with T at the beginning or end) or a mapfile.

Lastly, it is important to note that GPAW orders the MPI tasks as
follows::
  
   Z, Y, X, bands, kpoints, and spins.

A list of mappings is provided below. Note that this list is not
exhaustive. The contraint on the mapping comes from the value
of *B*; only *one* of these constraints must be true:

1) The last dimension in the canonical BG/P mapping equals the value of *B*.

#) For canonical BG/P mappings which end in T, the product of T and the
last cartesian dimension in the mapping equals *B*.

#) If a canonical mapping is not immediately suitable, the keyword
``order`` in the ``parallel`` dictionary can be used to rectify the
problem.  See the documentation on :ref:`parallel_runs`.


B = 2
--------
Simply set the following variables in your submission script::

  mode = dual
  mapping = any canonical mapping ending with a T

the constraint on the domain-decomposition is simply::

  Nx*Ny*Nz = Px*Py*Pz
 

B = 4
--------
Similar to the *B=2* case, but with::

  mode = vn

B = 8, 16, 32, 64, or 128
--------------------------

This is left as an exercises to the user. 


Setting the value of buffer_size
================================
Use ``buffer_size=2048``. Refer to :ref:`manual_parallel` for more
information about the ``buffer_size`` keyword. Larger values require
increasing the default value of DCMF_RECFIFO.

For those interested in more technical details, continue reading this section.

The computation of the hamiltonian and overlap matrix elements, as well as
the computation of the new wavefunctions, is accomplished by a hand-coded 
parallel matrix-multiply ``hs_operators.py`` employing a 1D systolic
ring algorithm. 

Under the *original* implementation of the matrix-multiply algorithm, 
it was necessary to select appropriate values for the number of blocks ``nblocks``::

  from gpaw.hs_operators import MatrixOperator
  MatrixOperator.nblocks = K
  MatrixOperator.async = True (default)

where the ``B`` groups of bands are further divided into ``K``
blocks. It was also required that *nbands/B/K* be integer-divisible. 
The value of ``K`` should be chosen so that 2 MB of wavefunctions are
interchanged.  The special cases of B=2, 4 as described
above permit the use blocks of wavefunctions larger than 2 MB to be
interchanged since there is only intranode communication. 

The size of the wavefunction being interchanged is given by::

  gpts = (Gx, Gy, Gz)
  size of wavefunction block in MB = (Gx/Nx, Gy/Ny, Gz/Nz)*(nbands/B/K)*8/1024^2

The constraints on the value of nbands are:

1) ``nbands/B`` must be integer divisible

#) ``nbands/B/K`` must be integer divisible. 

#) size of wavefunction block ~ 2 MB

#) ``nbands`` must be sufficient largely so that the RMM-DIIS eigensolver converges

The second constraint above is no longer applicable as of SVN version 7520.

Important DCMF environment variables
===============================================
`DCMF <http://dcmf.anl-external.org/wiki/index.php/Main_Page>`_  is one
of the lower layers in the BG/P implementation of MPI software stack. 

To understand th DCMF environment variables in greater detail, please read the
appropriate sections of the  IBM System Blue Gene Solution:  
`Blue Gene/P Application Development <http://www.redbooks.ibm.com/abstracts/sg247287.html?Open>`_ 

DCMF_EAGER and DCMF_RECFIFO
-----------------------------------
Communication and computation is overlapped to the extent allowed by the
hardware by using non-blocking sends (Isend) and receives (Irecv). It will be also be necessary to pass to Cobalt::

  --env=DCMF_EAGER=8388608

which corresponds to the larger size message that can be overlapped
(8 MB). Note that the number is specified in bytes and not
megabytes. This is larger than the target 2 MB size, but we keep this
for historical reasons since it is possible to use larger blocks of
wavefunctions in the case of *smp* or *dual* mode. This is also
equal to the default size of the DCMF_RECFIFO. If the following
warning is obtained,::

  A DMA unit reception FIFO is full.  Automatic recovery occurs
  for this event, but performance might be improved by increasing the FIFO size

the default value of the DCMF_RECFIFO should be increased::

   --env=DCMF_RECFIFO=<size in bytes>

DCMF_REUSE_STORAGE
-------------------------
If you receive allocation error on MPI_Allreduce, please add the following
environment variables::

  --env=DCMF_REDUCE_REUSE_STORAGE=N:DCMF_ALLREDUCE_REUSE_STORAGE=N:DCMF_REDUCE=RECT

It is very likely that your calculation is low on memory. Simply try using more nodes.

DCMF_ALLTOALL_PREMALLOC
-------------------------------
HDF5 uses MPI_Alltoall which can consume a significant amount of
memory.  The default behavior for MPI collectives on Blue Gene/P is to
not release memory between calls due to peformance reasons. We recommend
setting this environment variable to overide the default behavior::

  --env DCMF_ALLTOALL_PREMALLOC=N: