留学生计算机代考 | Parallel Computation Exam COMP322101

在最简单的意义上,并行计算是同时使用多个计算资源来解决一个计算问题。

  • 一个问题被分解成可以同时解决的离散部分
  • 每个部分被进一步分解为一系列的指令
  • 每个部分的指令在不同的处理器上同时执行
  • 采用一个整体控制/协调机制

计算问题应该能够:

  • 被分解成可以同时解决的离散工作片段。
  • 在任何时候都能执行多个程序指令。
  • 用多个计算资源解决的时间比用单个计算资源解决的时间短。
  • 计算资源通常是。
  • 一台具有多个处理器/核的计算机
  • 由网络连接的任意数量的此类计算机
使用并行编程的主要原因

节省时间和/或金钱

从理论上讲,在一项任务上投入更多的资源会缩短完成任务的时间,有可能节省成本。并行计算机可以用廉价的商品部件建造。

解决更大/更复杂的问题

许多问题非常大和/或复杂,使用串行程序来解决它们是不切实际或不可能的,特别是考虑到计算机内存有限。

提供并发性

一个计算资源一次只能做一件事。多个计算资源可以同时做很多事情。

例子。协作网络提供了一个全球性的场所,来自世界各地的人可以 “虚拟 “见面并开展工作。

利用非本地资源的优势

在本地计算资源稀缺或不足时,使用广域网甚至互联网上的计算资源。

例如。SETI@home(setiathome.berkeley.edu)在全世界几乎每个国家都有超过170万用户(2018年5月)。

更好地利用底层并行硬件

现代计算机,甚至是笔记本电脑,都是具有多个处理器/核的并行架构。 并行软件是专门为具有多个内核、线程等的并行硬件而设计的。在大多数情况下,在现代计算机上运行的串行程序会 “浪费 “潜在的计算能力。

下面是一个C语言并行计算的代写高分案例:

Many parallel algorithms require, at some stage, variables distributed across multiple processing units to be reduced to a single value by a binary operation. This reduced value must then be made accessible to all processing units. For instance, in a series of vector operations, it may happen that the result of the scalar product of two vectors must then be made available to all processing units for the next stage in the calculations.

(a) For shared memory systems, where the processing units are threads, all threads can read the memory location containing the reduced value. However, they should not begin subsequent calculations until the reduction calculation has been completed.

(i) For a GPU, suppose the reduction had been completed by threads within a single work group. Why is it beneficial to use local, rather than global, memory for intermediate calculations in this situation?

(ii) Still for a GPU, how would you ensure the result of the reduction performed by a single work group, in local memory, has been completed, and can be read by all threads for the subsequent calculations? Explain your answer.

(b) For distributed memory systems, where processing units are processes, the issue becomes communicating the result of the reduction to all processes. Suppose that, after the reduction, the reduced value is known only to one process, e.g. rank 0 for MPI.

(i) What form of collective communication should be used to send the reduced value to all processes? You do not need to give the actual MPI function name, but may do so if you like.

(ii) Someone suggests using point-to-point instead of collective communication, and you rightly point out that this will likely be slower than using collective communication. Justify this claim by estimating how the communication time tcomm varies with the total number of processes numProcs for both methods. You should assume that the collective communication uses a binary tree.

(iii) Given barriers are not used in the binary tree, how might the necessary synchronisation be achieved?

(iv) In fact, MPI already provides a function MPI Allreduce that both reduces, and distributes the final answer to all processes. One possible implementation is essentially a combination of binary trees. An example is given in Fig. 1 for numProcs=4. Redraw Fig. 1 for the case numProcs=2, for which there will be 2 levels rather than 3, and therefore 4 nodes in total.

(v) How many communications are there in total?

(vi) Returning to Fig. 1, note that in the final row of communications, some processes send two partial sums whereas others send none. How would you alter this final exchange of partial sums to make the communication better balanced, i.e. so processes send at most one partial sum? Use the given rank numbers in your answer.

(c) Notice that Fig. 1 is a task graph. Assume that each task (node) corresponds to the same amount of time, including those on the top and bottom rows.

(i) What is the work and span of the task graph given in Fig. 1? What is the maximum performance as predicted by the work-span model?

(ii) Suppose there are p = 2m processes. What is the work, span, and prediction of the work-span model now, for arbitrary m?

(iii) It has been assumed that each task takes the same time to execute. Suppose each task now takes a different, but known, time to execute. Describe in general terms how you would modify the definition of work and span, and the prediction of the work–span model, for this situation. You do not need to derive expressions or perform actual calculations, but should explain your answer.

contact

Assignment Exmaple

Recent Case

Service Scope

C|C++|Java|Python|Matlab|Android|Jsp|Prolo
g|MIPS|Haskell|R|Linux|C#|PHP|SQL|.Net|Hand
oop|Processing|JS|Ruby|Scala|Rust|Data Mining|数据库|Oracle|Mysql|Sqlite|IOS|Data Mining|网络编程|多线程编程|Linux编程操作系统|计算机网络|留学生|编程|程序|代写|加急|个人代写|作业代写|Assignment

Wechat:maxxuezhang

wechat