PageRenderTime 25ms CodeModel.GetById 0ms RepoModel.GetById 0ms app.codeStats 0ms

/labs/04_MatrixMul/instructions.tex

https://bitbucket.org/aterrel/2012summerssc395
LaTeX | 157 lines | 117 code | 40 blank | 0 comment | 0 complexity | 69101492adf8e8af93151a8f2f40d78a MD5 | raw file
  1. \documentclass{article}
  2. \usepackage{fullpage}
  3. \usepackage{url}
  4. \usepackage{listings}
  5. \begin{document}
  6. \begin{center}
  7. {\LARGE Lab 04: Matrix Matrix Multiply}\\
  8. {\large SSC 375/395 Summer 2012}
  9. \end{center}
  10. \section{Setting up the environment}
  11. \begin{description}
  12. \item {\bf Download the class git repo}\\
  13. Open a terminal and execute {\tt git clone
  14. git@bitbucket.org:aterrel/2012summerssc395.git}.
  15. \item {\bf Make the lab 1 device query}\\
  16. Go to the directory 2012summerscc395/labs/04\_MatrixMul/ inside the directory
  17. you will find the following:
  18. \begin{itemize}
  19. \item cuda\_helper.cu/.h - A set of helper functions for interaction with the
  20. device
  21. \item matrixMul.h - set of program parameters
  22. \item matrixMul\_gold.cpp - a cpu version to check
  23. \item matrixMul\_kernel.cu - the gpu matrix matrix mulitply kernel
  24. \item matmult\_runner.cu - executable to test the matrixMul\_kernel
  25. \end{itemize}
  26. \end{description}
  27. \section{Writing non-shared memory matrix multiply}
  28. The first task is to write the matrix-matrix multiply in a naive manor ($C=AB$). In
  29. this task each thread will compute a single entry into matrix $C$. See the
  30. stub below (also in {\tt matrixMul\_kernel.cu})
  31. \begin{lstlisting}
  32. template <int BLOCK_SIZE> __global__ void
  33. matrixMul( float* C, float* A, float* B, int wA, int wB)
  34. {
  35. // Calculate the row index of the Pd element and M
  36. int Row = XXX;
  37. // Calculate the column idenx of Pd and N
  38. int Col = XXX;
  39. float Cvalue = 0;
  40. // each thread computes one element of the block sub-matrix
  41. for (int k = 0; k < XXX; ++k)
  42. Cvalue += A[XXX]*B[XXX];
  43. C[XXX] = Cvalue;
  44. }
  45. \end{lstlisting}
  46. Fill in the {\tt matrixMul\_kernel.cu}; run make; run matmult\_runner. What is
  47. the performance of the card and what is your performance?
  48. \section{Writing shared memory matrix multiply}
  49. One major bottleneck as discussed in class is to load shared memory
  50. correctly. In this version of the code each block is responsible for a block of
  51. the matrix then loads it back into the resulting matrix.
  52. Move the {\tt matrixMul\_kernel.cu} to {\tt matrixMul\_kernel\_simple.cu} and now
  53. write a tiled version {\tt matrixMul\_kernel.cu}
  54. Below is a stub to get started:
  55. \begin{lstlisting}
  56. #define AS(i, j) As[i][j]
  57. #define BS(i, j) Bs[i][j]
  58. template <int BLOCK_SIZE> __global__ void
  59. matrixMul( float* C, float* A, float* B, int wA, int wB)
  60. {
  61. // Block index
  62. int bx = blockIdx.x;
  63. int by = blockIdx.y;
  64. // Thread index
  65. int tx = threadIdx.x;
  66. int ty = threadIdx.y;
  67. // Index of the first sub-matrix of A processed by the block
  68. int aBegin = XXX;
  69. // Index of the last sub-matrix of A processed by the block
  70. int aEnd = XXX;
  71. // Step size used to iterate through the sub-matrices of A
  72. int aStep = BLOCK_SIZE;
  73. // Index of the first sub-matrix of B processed by the block
  74. int bBegin = BLOCK_SIZE * bx;
  75. // Step size used to iterate through the sub-matrices of B
  76. int bStep = BLOCK_SIZE * wB;
  77. // Csub is used to store the element of the block sub-matrix
  78. // that is computed by the thread
  79. float Csub = 0;
  80. // Loop over all the sub-matrices of A and B
  81. // required to compute the block sub-matrix
  82. for (int a = aBegin, b = bBegin;
  83. a <= aEnd;
  84. a += aStep, b += bStep) {
  85. // Declaration of the shared memory array As used to
  86. // store the sub-matrix of A
  87. __shared__ float As[XXX][XXX];
  88. // Declaration of the shared memory array Bs used to
  89. // store the sub-matrix of B
  90. __shared__ float Bs[XXX][XXX];
  91. // Load the matrices from device memory
  92. // to shared memory; each thread loads
  93. // one element of each matrix
  94. AS(ty, tx) = A[XXX];
  95. BS(ty, tx) = B[XXX];
  96. // Synchronize to make sure the matrices are loaded
  97. __syncthreads();
  98. // Multiply the two matrices together;
  99. // each thread computes one element
  100. // of the block sub-matrix
  101. #pragma unroll
  102. for (int k = 0; k < BLOCK_SIZE; ++k)
  103. Csub += AS(XXX, XXX) * BS(XXX, XXX);
  104. // Synchronize to make sure that the preceding
  105. // computation is done before loading two new
  106. // sub-matrices of A and B in the next iteration
  107. __syncthreads();
  108. }
  109. // Write the block sub-matrix to device memory;
  110. // each thread writes one element
  111. int c = XXX;
  112. C[XXX] = Csub;
  113. }
  114. \end{lstlisting}
  115. run make; run matmult\_runner. What is the performance of the card and what is
  116. your performance?
  117. \end{document}