Many parallel algorithms require, at some stage, variables distributed across multiple processing units to be reduced to a single value by a binary operation. This reduced value must then be made accessible to all processing units. For instance, in a series of vector operations, it may happen that the result of the scalar product of two vectors must then be made available to all processing units for the next stage in the calculations.
(a) For shared memory systems, where the processing units are threads, all threads can read the memory location containing the reduced value. However, they should not begin subsequent calculations until the reduction calculation has been completed.
(i) For a GPU, suppose the reduction had been completed by threads within a single work group. Why is it beneficial to use local, rather than global, memory for intermediate calculations in this situation?
(ii) Still for a GPU, how would you ensure the result of the reduction performed by a single work group, in local memory, has been completed, and can be read by all threads for the subsequent calculations? Explain your answer.
(b) For distributed memory systems, where processing units are processes, the issue becomes communicating the result of the reduction to all processes. Suppose that, after the reduction, the reduced value is known only to one process, e.g. rank 0 for MPI.
(i) What form of collective communication should be used to send the reduced value to all processes? You do not need to give the actual MPI function name, but may do so if you like.
(ii) Someone suggests using point-to-point instead of collective communication, and you rightly point out that this will likely be slower than using collective communication. Justify this claim by estimating how the communication time tcomm varies with the total number of processes numProcs for both methods. You should assume that the collective communication uses a binary tree.
(iii) Given barriers are not used in the binary tree, how might the necessary synchronization be achieved?
(iv) In fact, MPI already provides a function MPI Allreduce that both reduces, and distributes the final answer to all processes. One possible implementation is essentially a combination of binary trees. An example is given in Fig. 1 for numProcs=4. Redraw Fig. 1 for the case numProcs=2, for which there will be 2 levels rather than 3, and therefore 4 nodes in total.
(v) How many communications are there in total?
(vi) Returning to Fig. 1, note that in the final row of communications, some processes send two partial sums whereas others send none. How would you alter this final exchange of partial sums to make the communication better balanced, i.e. so processes send at most one partial sum? Use the given rank numbers in your answer.
(c) Notice that Fig. 1 is a task graph. Assume that each task (node) corresponds to the same amount of time, including those on the top and bottom rows.
(i) What is the work and span of the task graph given in Fig. 1? What is the maximum performance as predicted by the work-span model?
(ii) Suppose there are p = 2m processes. What is the work, span, and prediction of the work-span model now, for arbitrary m? (iii) It has been assumed that each task takes the same time to execute. Suppose each task now takes a different, but known, time to execute. Describe in general terms how you would modify the definition of work and span, and the prediction of the work–span model, for this situation. You do not need to derive expressions or perform actual calculations, but should explain your answer.