PageRenderTime 54ms CodeModel.GetById 20ms RepoModel.GetById 0ms app.codeStats 1ms

/doc/platforms/BGP/performance.rst

https://gitlab.com/eojons/gpaw
ReStructuredText | 275 lines | 193 code | 82 blank | 0 comment | 0 complexity | dd58591e4f185a8a01ef379fa22994b6 MD5 | raw file
  1. .. _bgp_performance:
  2. ==============================
  3. Maximizing performance on BG/P
  4. ==============================
  5. Begin by reading up on the GPAW parallelization strategies
  6. (:ref:`parallel_runs`) and the `BG/P architecture
  7. <https://wiki.alcf.anl.gov/index.php/References>`_. In particular,
  8. :ref:`band_parallelization` will be needed to scale your calculation
  9. to large number of cores. The BG/P systems at the `Argonne Leadership
  10. Computing Facility <http://www.alcf.anl.gov>`_ uses Cobalt for
  11. scheduling and it will be referred to frequently below. Other
  12. schedulers should have similar functionality.
  13. There are four key aspects that require careful considerations:
  14. 1) Choosing a parallelization strategy.
  15. #) Selecting the correct partition size (number of nodes) and mapping.
  16. #) Choosing an appropriate value of ``buffer_size``. The use of ``nblocks`` is no longer recommended.
  17. #) Setting the appropriate DCMF environmental variables.
  18. In the sections that follow, we aim to cultivate an understanding of
  19. how to choose these parameters.
  20. Parallelization Strategy
  21. ====================================
  22. Parallelization options are specified at the ``gpaw-python`` command
  23. line. Domain decomposition with ``--domain-decomposition=Nx,Ny,Nz``
  24. and band parallelization with ``--state-parallelization=B``.
  25. Additionally, the ``parallel`` keyword is also available.
  26. The smallest calculation that can benefit from band/state
  27. parallelization is *nbands = 1000*. If you are using fewer bands, you
  28. are possibly *not* in need of a leadership class computing facility.
  29. Note that only :ref:`RMM-DIIS` eigensolver is compatible with band
  30. parallelization. Furthermore, the RMM-DIIS eigensolver requires
  31. some unoccupied bands in order to converge properly. Recommend range is::
  32. spinpol=False
  33. nbands = valence electrons/2*[1.0 - 1.2]
  34. spinpol=True
  35. nbands = max(up valence electrons, down valence electrons)*[1.0 - 1.2]
  36. It was empirically determined that you need to have *nbands/B
  37. > 256* for reasonable performance. It is also possible use smaller groups,
  38. *nbands/B < 256*, but this may require large domains. It is *required* that
  39. *nbands/B* be integer-divisible. The best values for B =2, 4, 8, 16,
  40. 32, 64, and 128.
  41. Obviously, the number of total cores must equal::
  42. Nx*Ny*Nz*B
  43. The parallelization strategy will require careful consideration of the
  44. partition size and mapping. And, obviously, also memory!
  45. Partition size and Mapping
  46. ========================================
  47. The BG/P partition dimensions (Px, Py, Pz, T) for Surveyor and Intrepid at the
  48. Argonne Leadership Computing Facility are `available here
  49. <https://wiki.alcf.anl.gov/index.php/Running#What_are_the_sizes_and_dimensions_of_the_partitions_on_the_system.3F>`_,
  50. where T represents the number of MPI tasks per node (not whether a
  51. torus network is available). The number of cores per node which
  52. execute MPI tasks is specified by the Cobalt flag::
  53. --mode={smp,dual,vn}
  54. Hence, the possible values of T are::
  55. T = 1 for smp
  56. T = 2 for dual
  57. T = 4 for vn
  58. Note that there are 4 cores per node and 2 GB per node on BG/P. As GPAW is
  59. presently an MPI-only code, vn mode is preferred since all cores will
  60. perform computational work.
  61. It is essential to think of the BG/P network as a 4-dimensional object with
  62. 3 spatial dimensions and a T-dimension. For optimum scalability it
  63. would seem necessary to maximize the locality of two distinct
  64. communications patterns arising in the canonical O(N^3) DFT algorithm:
  65. 1) H*Psi products
  66. #) parallel matrix multiplies.
  67. However, it turns out that this is *not* necessary. The mesh network can
  68. handle small messages rather efficiently such that the time to send a
  69. small message to a nearest-neighbor node versus a node half-way across
  70. the machine is comparable. Hence, it is only necessary to optimize the
  71. mapping for the communication arising from the parallel matrix
  72. multiply which is a simple 1D systolic communication pattern.
  73. Here we show the examples of different mappings on a 512-node BG/P
  74. partition. Band groups are colored coded. *(Left)* Inefficient mapping
  75. for four groups of bands (B = 4). This mapping leads to contention on
  76. network links in the z-direction. *(Right)* Efficient mapping for eight
  77. groups of bands (B=8). Correct mapping maximizes scalability and
  78. single-core peak performance.
  79. |mapping1| |mapping2|
  80. .. |mapping1| image:: bgp_mapping1.png
  81. :width: 40 %
  82. .. |mapping2| image:: bgp_mapping2.png
  83. :width: 40 %
  84. For the mapping on the *(Right)* above image, there are
  85. two communication patterns (and hence mappings) that are worth
  86. distinguishing.
  87. |intranode|
  88. .. |intranode| image:: bgp_mapping_intranode.png
  89. :width: 60 %
  90. The boxes in these images represent a node and the numbers inside
  91. the box repesent the distinct cores in the node (four for BG/P).
  92. Intuitively, the communication pattern of the *(Left)* image should
  93. lead to less network contention than the *(Right)*. However, this is
  94. not the case due to lack of optimization in the intranode
  95. implementation of MPI. The performance of these communications
  96. patterns is presently identical, though this may change in future
  97. version of the BG/P implementation of MPI.
  98. Mapping is accomplished by the Cobalt flag::
  99. --env=BG_MAPPING=<mapping>
  100. where *<mapping>* can be one of the canonical BG/P mappings
  101. (permutations of XYZT with T at the beginning or end) or a mapfile.
  102. Lastly, it is important to note that GPAW orders the MPI tasks as
  103. follows::
  104. Z, Y, X, bands, kpoints, and spins.
  105. A list of mappings is provided below. Note that this list is not
  106. exhaustive. The contraint on the mapping comes from the value
  107. of *B*; only *one* of these constraints must be true:
  108. 1) The last dimension in the canonical BG/P mapping equals the value of *B*.
  109. #) For canonical BG/P mappings which end in T, the product of T and the
  110. last cartesian dimension in the mapping equals *B*.
  111. #) If a canonical mapping is not immediately suitable, the keyword
  112. ``order`` in the ``parallel`` dictionary can be used to rectify the
  113. problem. See the documentation on :ref:`parallel_runs`.
  114. B = 2
  115. --------
  116. Simply set the following variables in your submission script::
  117. mode = dual
  118. mapping = any canonical mapping ending with a T
  119. the constraint on the domain-decomposition is simply::
  120. Nx*Ny*Nz = Px*Py*Pz
  121. B = 4
  122. --------
  123. Similar to the *B=2* case, but with::
  124. mode = vn
  125. B = 8, 16, 32, 64, or 128
  126. --------------------------
  127. This is left as an exercises to the user.
  128. Setting the value of buffer_size
  129. ================================
  130. Use ``buffer_size=2048``. Refer to :ref:`manual_parallel` for more
  131. information about the ``buffer_size`` keyword. Larger values require
  132. increasing the default value of DCMF_RECFIFO.
  133. For those interested in more technical details, continue reading this section.
  134. The computation of the hamiltonian and overlap matrix elements, as well as
  135. the computation of the new wavefunctions, is accomplished by a hand-coded
  136. parallel matrix-multiply ``hs_operators.py`` employing a 1D systolic
  137. ring algorithm.
  138. Under the *original* implementation of the matrix-multiply algorithm,
  139. it was necessary to select appropriate values for the number of blocks ``nblocks``::
  140. from gpaw.hs_operators import MatrixOperator
  141. MatrixOperator.nblocks = K
  142. MatrixOperator.async = True (default)
  143. where the ``B`` groups of bands are further divided into ``K``
  144. blocks. It was also required that *nbands/B/K* be integer-divisible.
  145. The value of ``K`` should be chosen so that 2 MB of wavefunctions are
  146. interchanged. The special cases of B=2, 4 as described
  147. above permit the use blocks of wavefunctions larger than 2 MB to be
  148. interchanged since there is only intranode communication.
  149. The size of the wavefunction being interchanged is given by::
  150. gpts = (Gx, Gy, Gz)
  151. size of wavefunction block in MB = (Gx/Nx, Gy/Ny, Gz/Nz)*(nbands/B/K)*8/1024^2
  152. The constraints on the value of nbands are:
  153. 1) ``nbands/B`` must be integer divisible
  154. #) ``nbands/B/K`` must be integer divisible.
  155. #) size of wavefunction block ~ 2 MB
  156. #) ``nbands`` must be sufficient largely so that the RMM-DIIS eigensolver converges
  157. The second constraint above is no longer applicable as of SVN version 7520.
  158. Important DCMF environment variables
  159. ===============================================
  160. `DCMF <http://dcmf.anl-external.org/wiki/index.php/Main_Page>`_ is one
  161. of the lower layers in the BG/P implementation of MPI software stack.
  162. To understand th DCMF environment variables in greater detail, please read the
  163. appropriate sections of the IBM System Blue Gene Solution:
  164. `Blue Gene/P Application Development <http://www.redbooks.ibm.com/abstracts/sg247287.html?Open>`_
  165. DCMF_EAGER and DCMF_RECFIFO
  166. -----------------------------------
  167. Communication and computation is overlapped to the extent allowed by the
  168. hardware by using non-blocking sends (Isend) and receives (Irecv). It will be also be necessary to pass to Cobalt::
  169. --env=DCMF_EAGER=8388608
  170. which corresponds to the larger size message that can be overlapped
  171. (8 MB). Note that the number is specified in bytes and not
  172. megabytes. This is larger than the target 2 MB size, but we keep this
  173. for historical reasons since it is possible to use larger blocks of
  174. wavefunctions in the case of *smp* or *dual* mode. This is also
  175. equal to the default size of the DCMF_RECFIFO. If the following
  176. warning is obtained,::
  177. A DMA unit reception FIFO is full. Automatic recovery occurs
  178. for this event, but performance might be improved by increasing the FIFO size
  179. the default value of the DCMF_RECFIFO should be increased::
  180. --env=DCMF_RECFIFO=<size in bytes>
  181. DCMF_REUSE_STORAGE
  182. -------------------------
  183. If you receive allocation error on MPI_Allreduce, please add the following
  184. environment variables::
  185. --env=DCMF_REDUCE_REUSE_STORAGE=N:DCMF_ALLREDUCE_REUSE_STORAGE=N:DCMF_REDUCE=RECT
  186. It is very likely that your calculation is low on memory. Simply try using more nodes.
  187. DCMF_ALLTOALL_PREMALLOC
  188. -------------------------------
  189. HDF5 uses MPI_Alltoall which can consume a significant amount of
  190. memory. The default behavior for MPI collectives on Blue Gene/P is to
  191. not release memory between calls due to peformance reasons. We recommend
  192. setting this environment variable to overide the default behavior::
  193. --env DCMF_ALLTOALL_PREMALLOC=N: