Tom Leighton: Introduction to Parallel Algorithms and Architecture: Arrays, Trees, Hypercubes,. Morgan Kaufmann: San Mateo, CA, Joseph JaJa. How to solve given problems efficiently on parallel computers? •“Efficient”? . • Joseph JaJa: An Introduction to Parallel Algorithms. Addison-. the algorithm and data structure design and analysis – the compare algorithms and data structures . An Introduction to Parallel Algorithms Joseph Jaja.
|Language:||English, Spanish, Portuguese|
|ePub File Size:||18.51 MB|
|PDF File Size:||11.33 MB|
|Distribution:||Free* [*Regsitration Required]|
JaJa caite.infothms Intro - Download as PDF File .pdf), Text File .txt) or read Joseph JaJa We show several parallel algorithms on these models, fol. List ranking: Joseph Jaja, An introduction to parallel algorithm, Chap.3 page , caite.info - The Euler. It is important to view the introduced techniques as general guidelines for of these parallel algorithms are used frequently in the remainder of this book.
A node whose outdegree is equal to zero represents an output. Razmi marked it as to-read Dec 15, During the first iteration. Directed acyclic graphs can be used to represent certain parallel computations in a natural 1 2 Chapter 1 Introduction way, and can provide a simple parallel model that does not include an; architecture-related features. On the other hand. Goodreads helps you keep track of books you want to read. Franxis Javier Ci Bsc.
Friend Reviews. To see what your friends thought of this book, please sign up. To ask other readers questions about Introduction to Parallel Algorithms , please sign up. Be the first to ask a question about Introduction to Parallel Algorithms.
Lists with This Book. Community Reviews.
Showing Rating details. Sort order. Dec 19, Darrell Ulm rated it really liked it. Pretty solid, better if reader has background in source material. Reshma rated it did not like it Sep 08, Ashish rated it really liked it Jan 14, Komali rated it it was amazing Aug 31, George Dumitru rated it it was amazing Aug 20, Qubitng rated it liked it Mar 11, Untitled rated it it was ok Mar 25, Shivakant Agrahari rated it it was amazing Nov 30, Yash rated it it was amazing Nov 25, Asam Ahmed rated it it was amazing Dec 06, Raghu Vallamkonda rated it liked it Mar 20, Amany Mohammed rated it did not like it Oct 02, Jafar Akhundov rated it it was amazing Dec 09, Andreas rated it liked it Mar 13, Bhawani rated it really liked it Jul 09, Md Masum rated it it was amazing Dec 04, Ron added it Aug 27, Marcus added it Sep 13, Philomath added it Apr 10, Luz added it Jan 12, Gagan Sohi marked it as to-read Oct 24, Lata marked it as to-read Nov 21, Jeroen Mostert added it Dec 15, Razmi marked it as to-read Dec 15, Vishnu marked it as to-read Jan 29, Sumit marked it as to-read Feb 04, Mahipal Gunawat added it Feb 10, Lucky Kashyap marked it as to-read Feb 11, Priyanka marked it as to-read Feb 12, Assignment statement: Such material is usually covered at the undergraduate level in computer science and computer engineering curricula.
There exists a limiting bound on the running time. Before introducing several candidate models. We next give a list of the statements most frequently used in our algorithms. Algorithms are expressed in a high-level language in common use. If the condition is true. In the case of nested conditional statements.
The same process is repeated with the new value of the variable. The else part is optional. The while loop is similar. These bounds are expressed asymptotically using the following standard notation: If the initial value is less than or equal to the final value. Conditional statement: We are primarily interested in the worst-case analysis of algorithms.
Exit statement: The bounds on the resources for example. We use one of the following two formats: The model should be simple enough to allow us to describe parallel algorithms easily.
This definition naturally leads to the questions of what constitutes a basic operation. For more details concerning the analysis of algorithms. A logarithm used in an asymptotic expression will always have a minimum value of 1. A knowledge of this model beyond our informal description is not necessary for understanding the material covered in this book. Modeling parallel computation is considerably more challenging given the new dimension introduced by the presence of many interconnected processors.
The cost of an operation does not depend on the word size. In addition. These issues depend on the specific problem at hand and the model of computation used. The running time of a sequential algorithm is estimated by the number of basic operations required by the algorithm as a function of the input size. In Sections 1. Thus far. The literature contains an abundant number of parallel algorithms for specific architectures and specific parallel machines.
The parallel algorithms developed for the model should be easily implementable on parallel computers. Other parallel models. The algorithm in Fig. We also state our choice of the parallel model used in this book and provide justification for this choice. A directed acyclic graph with n input nodes represents a computation that has no branching instructions and that has an input of size n. It is completely architecture-independent. In this case.
A dag specifies the operations performed by the algorithm. A node whose outdegree is equal to zero represents an output. Each input is represented by a node that has no incoming arcs. We assume the unit cost criterion.
More precisely. We can specify a particular implementation of the algorithm by scheduling each node for execution on a particular processor.. In either case. Given n3 processors. It is clear that the best schedule of the algorithm represented in Fig. The time t-t of an input node i is assumed to be 0. Each C i. For any given schedule. Matrix Multiplication Let A and B be two n x n matrices.
Consider the two sum algorithms presented in Example 1. These processors are indexed 1. Figure 1. Shared memory is also referred to as global memot?. Each processor is uniquely identified by an index.
Let A be partitioned as follows: In the asynchronous mode of operation. In the first mode. Each processor P. A standard name for the synchronous shared-memory model is the parallel random-access machine PRAM model.
Before introducing the next example. In the second mode. Since each processor can execute its own local program. For a given algorithm. V The effect of the global read instruction is to move the block of data X stored in the shared memory into the local variable Y. We can estimate the amount of computation and communication by the algorithm as follows.
On the other hand. At this point. An n x n matrix A and a vector x of order n residing in the shared memory. An important feature of Algorithm 1.
Therefore plicit synchronization primitive must be placed in each processor's p after the computation of z. The initialized local variables are 1 the order n. The same notation can be used to indicate a subvector of a given vector.
Each processo executes the same algorithm.. The processor responsible for performing. The initialized local variables are n and the processor number i.
The computation scheme step 3 is based on a balanced binary tree whose leaves correspond to the elements of A. Each processor executes the same algorithm. The array A holds its initial value.
During steps 1 and 2. The sum of the entries of A stored in the shared location S. Each of the remaining processors stays idle during that time.
Each internal node represents a sum operation. In each unit of time. The arrays A and B are stored in the global memory and can be accessed by any processor.
The specific processor executing the operation is indicated below each node. Using this example. Note that P i. An instruction of the form Set A: To simplify the presentation of PRAM algorithms. Remark 1. CRCW do not differ substantially in their computational powers.
The algorithm for processor Pjjj is stated next recall Remark 1. A In the remainder of this book. It turns out that these three models EREW. I identifying the processor. Two n x n matrices A and B stored in the shared memory. The initialized local variables are n. We discuss their relative powers in Chapter In fact.
As in the case of the shared-memory model. Compute C' i. In describing algorithms for the network model. If we modify step 3 of Algorithm 1. For example. It then stores the data in Y and resumes the execution of its program.
We use the following two constructs: As for the running time. The product C — AB stored in the shared memory. I from the shared memory during the execution of step 1. A processor P executing the receive instruction suspends the execution of its program until the data from processor Pj are received. The Linear Processor Array and the Ring.. Many other networks have been studied extensively for their suitability for parallel processing.
Assume that p divides n evenly. There are several parameters used to evaluate the topology of a network G. P p connected in a linear array.
We shall introduce the following representative topologies: Several books describing them are given in the list of references at the end of this chapter. The study of routing algorithms is outside the scope of this book. Note that. The diameter of the linear array is p. The network model incorporates the topology of the interconnection between the processors into the model itself.
A ring is a linear array of processors with an end-around connection. Set y: When the algorithm terminates. Then each processor can compute locally the product Bw. The computation performed by each processor consists of the twc operations in steps 1 and 3. The output vector will be stored in Pi. Processor P. At step 2. At this time. The algorithm to be executed by each processor is given next. This topology has several attractive features. The two-dimensional mesh is a two-dimensional version of the linear array.
When P-q receives the two inputs A i. Many of the algorithms developed for the mesh have been asynchronous. Then processorP. I is sent to P. After O n steps. Systolic algorithms operate in a fully synchronous fashion.
The Hypercube. When A i. Rows of A move synchronously into the left side. Two processors are connected if and only if their indices differ in exactly one bit position.
Notice that our processors are indexed from 0 top. We next develop synchronous hypercube algorithms for several simple problems. The hypercube is popular because of its regularity. The hypercube has a recursive structure. In other words. One cube has the most significant address bit equal to 0. Sum on the Hypercube Each entry A i of an array A of size n is stored initially in the local memory of processor P.
In the algorithm that follows. Algorithm 1. The algorithm to compute S is straightforward. The third iteration clearly sets -4 0 to the sumS. The remaining iterations continue in a similar fashion. We proceed from the lowest-order dimension to the highest dimension consecutively.
Algorithm for Processor P. At the completion of the second iteration. Notice that the indices of the array elements begin with 0. These sums are stored in the d. In the first substep. A simple strategy can be used to solve this problem. During the first iteration. In the second substep.
We index the processors by the triples l. Pi send copies of X to P2 and P3. Po and. The algorithm is stated next. The hypercube algorithms in this class use one dimension at each time unit such that consecutive dimensions are used at consecutive time units.
The overall algorithm consists of three stages. At the end of this stage. At the end of the second substage. In the first substage. Using our broadcasting algorithm Algorithm 1. The implementation of the first stage consists of two substages. In the second substage. Processor Pq.
The goal is to compute C i.
As we have seen before Algorithm 1. Sections 1. In spite of its simplicity. Our choice for the remainder of this book is the PRAM model. Although the network model seems to be considerably better suited to resolving both computation and communication issues than is the dag model. The dag model presents only partial information about a parallel algorithm. A PRAM algorithm includes an explicit understanding of the operations to be performed at each time unit.
Unless the algorithm is fairly regular. For the remainder of this book. Many of the network algorithms can be directly derived from PRAM algorithms. PRAM algorithms can be analyzed within this more general framework. The PRAM model. This framework is closely related to what is commonly referred to as data-parallel algorithms. Before we introduce our notion of optimality. When the number p of processors is larger than P n.
We have just explained why the following four ways of measuring the performance of parallel algorithms are asymptotically equivalent: The upper level suppresses specific details of the algorithm.
The work-time WT paradigm provides informal guidelines for a twolevel top-down description of parallel algorithms. Describe the algorithm in terms of a sequence of time units. The description paradigm we shall outline will help us further. The WT presentation of the same algorithm is given next. We define the work performed by a parallel algorithm to be the total number of operations used. Before presenting our next example.
The lower level follows a general scheduling principle. A remark concerning the adaptation of the WT scheduling principle is in order. Lower Level: Suppose that the WT presentation of algorithms results in a parallel algorithm that runs in T n time units while performing W n work that is. Set S: Using the general WT scheduling principle given next. If the simulation is successful. Pk must know the instruction it has to execute and the corresponding operands. It is stated only in terms of time units.
The success of this principle depends on two implementation issues: The presence of p processors would have bounded the number of operations to at mostp in each unit of time.
The WT Scheduling Principle: Let Wi n be the number of operations performed in time unit i. We now address the implementation details of the WT scheduling principle as related to the PRAM algorithm for computing the sum of n numbers Algorithm 1. The input arrays! At each height h of the binary tree. Let fF n be the number of operations performed in time unit i. We now address the implementation details of the WT scheduling principle as related to the PRAM algorithm for computing the sum of n numbers Algorithm The sum of the elements of A stored in the shared variable S.
Otherwise A: The algorithm executed by the sth processor is given next. The operation represented by a node is executed by the processor indicated below the node. The array A retains its original value. In thiscase. Step 1 takes o time. Step 3 takes 0 1 time. When n processors are available.
Work Versus Cost: The notion of cost introduced in Section 1.