### Parallel Algorithms Prof. Phalguni Gupta Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

#### Lecture - 4

Let us can also be discussing about the refund parades that can be used to solve problems on sequential machine. Today, we have we like to initial would like to discuss about the parallel algorithms and parallel machines, and then we have to find out or defined a deferent models available on the parallel machines. And finally if time formats, then we will discuss about one example how to solve that problem on this parallel machines. The problem will be considering for our study is that for example, we will find the sum of n number or may finding the minimum or maximum of n numbers.

Now, what happened? Actually, the demand of computing power is increasing day by day. We must agree on that; and the design a cording exist either best to improve the spirit of the computer skill. If you observe last thirty years that they are trying their best to increase the spirit of the computing power and every file here is what you are observed the speed of the computing power, speed of the machine increase by in force. but, then the limitation are limitations is coming out because of that speed of the comp1nt now, 3 component that can be used that will be used for your comp1nt which cannot be fas10 the speed of fly.

So you may not be able to achieve beyond certain speed. Now, what are the alternatives, because demand should be increasing day by day for example, initially if you remember there is our minus production model was based on 13 parameters and in order to obtain this thirteen parameters need not obtain the best, whereas during the 13 parameter and if I have to predict something mainly around 24 hours to solve the problem and to predict. But by that time then to the monsoon cloud will be covered or will be a crossing a designed area so, you may not be able to predict something.

Now, if I have to give you the better prediction model then, what we have to do? You have to predict the monsoon condition based on the cloud available in part away from (( )) there may be the wild my cloud is in therefore, or near ready a i must be payable to care where, when the cloud engage to this to area I not to do that you need ph1 number of parameter. But, increase on parameter that it increase that by many ph1, what we want

now the new like we increase the number of parameters for production another same patch the speech should be all a operating speech should be such that you must be able to predict about the monsoon day.

So, this is the very difficult task so, the demand is increasing or as a same time the cost of hardware is coming down. So, one solution could be instead of using 1 machine why not several machine to solve in a problem because the cost of the machine is decreasing at the same time, demand is increase demand of company model is increasing. So, what will testing that we use the several machines to solve a single problem.

(Refer Slide Time: 04:53)



So, what you have to do that the problem free if you have and, you have a n machines say problem p and the machines you have n machines n. Then you divide this problem p into n sub problems, and they pro is, sub pro a P i is sub problem is that is true in machine M i and, machine M i solve the sub problem P i and similarly for all i. Similarly, P 1 is solve the problem sub problem p 1 and so on, say let be S 1 be the solution of problem sub problem P 1 P 2 and S 2 is the solution of sub problem P 2 and s i is the solution of sub problem P i.

Finally, these machines are used together to combine the result to find the solution of p. So, that p is, that p is you want to use the several machines to solve the problems (()) and these is product deferent process, deferent machines, deferent problem and, it touch to solve the sub problems simultaneously, and then combined

that there are two kind of final solution and this idea will basically gives in the ideas of parallel machines.

Now, we are algorithm you desire for this parallel machine is known as parallel algorithm. Now, what if I have pens then some will become in inherently parallel, some sequential algorithm are inherently parallel and in that case problem is not that difficult and you can easily divide this problems into sub problems and that is causes the does solution of final solution of the sub problems are combines algorithm.

(Refer Slide Time: 07:02)



Say for example, if I have to do the two vector relationship, I have two vector and I want to add it, there are n machines what I can do this vector is divided into, is divided into n equal parts. And these say it is v 1 and v 2 say this addition, this 2 vector addition is being done by p 1 processor or p 1 or m 1 machine m 1 machine and this is d1 by m 2 machines and this is d1 by m n machine.

So, this is inherently that is inherent parallelism so, you can easily achieve or solution of this problem but, in reality may not be the case so, live you may have to so, you cannot use or may not be able to use the available sequential algorithm to solve the problem on parallel machines.

### (Refer Slide Time: 11:53)

So, in order to do that you may have to read this i the whole sequence the whole algorithm or read this i the algorithm for your parallel machines that is the goal of or name of this course. How to design or designs the parallel algorithms for deferent product I will call you though Flynn's according to Flynn's that whole class of machines or computer can be divide into the 4 classes this classes are based on stream of instructions or instructions stream and another 1 is the data stream.

By we take this stream being the set of instructions to be perform by deferent machines and will be better by data steam, the state of data's could be use binding participation. So, the Flynn there is classification tell the machine can be divided into 4 classes. 1 is known as a single instruction stream and single data stream next 1 is multiple instruction stream and single data stream then, you have single instruction stream and multiple data stream. Finally, you have multiple instruction stream and multiple data stream in short, we tell again s i s g m i s d s i m d and m i m d.

So, we in classifies are thus of the whole class of properties use do 4 categories 1 is the s i s d another 1 is the m i s d then s i m d and finally, you have m i m d well it is.

## (Refer Slide Time: 12:54)



Discuss about the first s i s d which is the pure available sequential machine basically and here what happen that you have you have 1 control unit you have 1 processor and you have memory so, the control unit broadcast the processor to perform extreme of this is of perform. See more stream of instructions the processor gets the data from the memory and performs the operational and to the results back to the memory, this is the simple structure of a s i s d and use a sequential machine walls on it.

(Refer Slide Time: 14:53)

Now, let us discuss about m i s d well here what happens you have several control units c 1 c 2 c 2 c n and, you have processor p 1 attach to control unit 1 control unit to this and then you have p 3 and you have p n and you have memory so, the control unit c 1 broadcast 1 instructions to p 1 say a this is the b structure this may be a this may be multiply and all the processor will get the data from the same memory locations.

In this that is that all the pro default all the control units broadcast the or we send the instruction to process defined types of operation but, the operation to be perform on the same data and in reality the application of such type of model does not exist and as results this model tiles on the spot.

(Refer Slide Time: 15:08)



Now, let us consider the third model which is s i m d here, we have 1 control unit and you have processor p 0 p 1 p 2 p n that control unit broadcast a single stream of instructions to all the process fix the data either from the local memory or, from the common memory based on the deferent model based on the model of the machines or it can get the data from any other processor, processors and perform the operations perform the operation. So, here what happens the data may be on deferent set like while p 0 is getting the data say from the location x p 1 may in the further location y this may be on z this may be on a and they perform with the data perform the operation and send very with the trying to the common memory or into the local memory.

So, control unit broadcast same set of instructions on different process, the process get there at either from the local memory or from the common memory all based on the ne2rk model. It is connected with the p i another mode another processor (( )) that processor perform the operations and sends the data into the occupational area. So, this is your s i m d model we will discuss s i m d later on. And then we have the multiple instruction stream and multiple data stream here you have.

(Refer Slide Time: 17:22)

Basically control unit 1 control unit 2 control unit n you have the processor p 1 processor p 2 processor p n and, this processor either connect to processor inter connection ne2rk or through a common memory then control unit 1 say the instruction to processor p 1 i took for whom sub 1 set of instructions and p 1 there is the data either from the local area or from the common memory or from the memory processors.

Similarly, you come to broadcast the another set of instructions, then we observe the same set of instructions and it d 2 perform the operation taking the data from the common memory or from the neighboring or from the local memory and so on. So, that is the idea on m i s d.

## (Refer Slide Time: 18:31)



So, all we discussing with s i m d and m i m d details and most of for algorithms always in this course will be considering that s i m d for our study here you have what you have, you have 1 control unit you have n processors and this processors are inter connected either through inter connection ne2rk, or the processors can communicate among themselves to inter connection ne2rk or all processor once to read the data through common memory.

So, you have 1 control unit you have several processor that control unit broadcast the instructions to deferent processors all the processor, which are active this take the data from the common memory or they can get it through the data from the neighboring processor through in the connection ne2rk performs the operations and choose the reserving to the deserve area.

The machine or model, which is based on the common memory is known as shared memory model and the machines were looking for all that is you have a controller unit along with a n transfers and processors can communicate among themselves to connection ne2rk. So, get there is inter connections base model now, based on the deferent tiles of inter connection model and you get a department a types of machine parallel machine.

# (Refer Slide Time: 20:57)



Now, one could be model which is known as p to p e model is you have a control unit you have p e 0 and memory attach to it your p e 1 m e memory is local memory and you have p e n minus 1 m e n minus 1 and finally, we have inter connection ne2rk.

So, this is p to p model that contrarily you gets send the inspection to all the processors all the equal processors gets the data either from its own local memory or if you once the data if get sends the 2 inter connection ne2rk.

(Refer Slide Time: 22:07).



Get data of the other processors and use that for its work there may be another model, which is known as p e to m e model here like a is that you have control units and here you have p e 0 p e 1 p e n minus 1 and here you have inter connected ne 2 rk inter connected ne 2 rk here you have m e 0 m e 1 m e n minus 1. So, p e 1 to get the data write down the instruction you can get the data from here to inter stream minus to a and I am saying that p e n minus get data from this processor so, this memory location one.

(Refer Slide Time: 23:26)

went Read Exclusive While Model (CREW)

So, these are the 3 ways you can think 1 is bonus shared memory model 1 is p e to p e model another p e to m e model now let us consider a first the shared memory model for detail in this session here you have n processors, n processors are numbered as p 0 p 1 p 2 p n minus 1, contributory users this n process and this processor are can be made active or in a keep by setting the mask all the accuprocess are allow to perform the operations taking the data from the common memory and then there are back to the common memory.

Only here based on the defined structure we can classified the shared memory into the 4 groups 1 is known as concurrent rate which, is the weakest model concurrent rate model next 1 is exclusive read concurrent write model and the exclusive read exclusive write model.

Now, what do you mean by concurrent rate model concurrent by concurrent read do we that deferent processor or 2 or more processor to allowed to read a particular memory

location simultaneously then by exclusive read I mean the note to processor are allowed to read and a particular memory location (( )) similarly, concurrent write mode means that 2 or more processor are allowed to write the same allowed to write the data into the same memory location by exclusive, write will be there no proof us that are allowed to write simultaneously, at a particular memory location at any ex10t of time.

Now, in the concurrent rate concurrent write model what happens mere issuing there is more than 2 processors or 2 or more processors are allowed to read simultaneously, from the same memory location and also allowed to write into the same memory locations are simultaneous any instant of time.

Now, think about this concurrent rate and you observe that more in reality deferent processor allowing you are allowing to write the same memory, you does not have more linier and it is must that easy to do handling. However we can take about this or similar (()) where concurrently can be allowed with discuss this part simple.

So, next model is the concurrent read and exclusive write model that means the 2 or more processor allowed to read the same memory locations simultaneously, processor allowed to write on the same memory locations at any instant of time at a simultaneously. Now, the this model does not have much meaning because this is exclusively you are not allowing to read which is the most simpler than simple program than the concurrent.

So, this actually did not last for long longer duration and so, we not discuss this 1 and finally, the note to processor allowed to read and simultaneously at a site I will know to processor allowed to write simultaneously, attaining of the time and how to handle in reality date that concurrent read part. Suppose there are m processor n processors are clear m processor 1 to read the same movement of 10sion simultaneously then this can be d1 through.

#### (Refer Slide Time: 28:48)



Broadcasting that is suppose p 0 p 1 p 2 p 3 p 4 p 5 p 6 p 7 simultaneously, they want to read the location 1 what will do the p 0 read self and rise into the location said b 1 now, p 1 reads 1 p 2 read b 1 simultaneously and rewrites into the location. Simultaneously, b 2 and b a. Now, p 3 read the data from 1 p 4 read the data form b 1 p 4 p 5 reads from b 2 and b 6 reads from b 3 and, we can write into b 4 b 5 b 6 and b 7 and so on.

So, can you tell me how much time you need to broadcast or to read the n processor from location n what is the time unit to read the con10t of location 1 by n processor here, the time will be order first of first time only 1 processor second time 2 processor will be reading the location third time 4 processor forth time 8 processors and so on.

So, basically you can find after log n interations after all a log n interations that all the processors all the n processor will be able to read the location l. So, e remain even you do not have the opportunity to a design a model of concurrently or to got thus we can handle the situations.

Now, what about the how to handle this concurrent like part now, there are different models like the power concurrent it is the advance possible think is the something n processors was to write simultaneously at a particular location the 1 way put to that you put the some.

## (Refer Slide Time: 31:52)



Results to belated into the notices that is suppose, p 1 was to write x 1 p 2 1s to write x 2 and p n 1s to write x n 1 person you will think is that you take the sum of x i and write into a location 1. The another way could be that the smallest index or the index or the processors which is smallest index to you allow or randomly 1 of them would be allowed to write and then, or you can defined and can we other way so, that you can handle the problem of concurrent.

(Refer Slide Time: 32:32)



Now, let us discuss some of the models based on the ne 2 rk connection inter connection based the first model is known as mash connected computers. Suppose, you have intra sets p 0 p 1 p n minus 1 this is the n plus sets you have now, this process are arranged in the form of q dimension say p 1 2 m 1 1 2 n 2 n q this is the p q q dimension where, that n 1 into n 2 into n q is your n where n 1 into n 2 into n q these or a and p 1 i 2 i q is the processor is the processor at the i 1th location i 1-h dimension i 2 at definition a i q-th position.

(Refer Slide Time: 34:47).



Now, these processor this processor has at most 2 q connections that is p i i 1 i 2 i q is connected with is connected with p i 1 i 2 i j plus minus 1 i q for all j for all j j equals to 1 to q.

So p i 1 i 2 it is the index it is the index of the processor i 1 i 2 i q it is connected with p i 1 p i 2 p i j plus minus 1 p pro provided the exits. So, in the case of 2 dimensional let, us consider let us consider are 2 dimensional mash connected computer and also consider that within the 16 processor and the processor arrange 4 cross 4.

(Refer Slide Time: 36:20).



This is the 4 cross 4 processors and you have say p 1 1 p 1 2 p 1 3 p 1 4 p 1 5 p 2 1 p 2 2 p 2 3 p 2 4. So, these are the 16 crosses now, we are to a tell it q 1 1 is connected with p 1 plus 1 plus 1 plus 1 and also, this side this connection is by direction that is why it is taking plus minus sign so, this is the 2 way mash connected computation.

Now, here we observed the processors what we are today there are p i j form in reality processors and numbers 0 1 to up to n minus 1. So, there is a need of introducing some indexes key so, that p r is can only map on to p j k.

(Refer Slide Time: 38:26)

So, there were the several scheme exist for these type of indexing 1 is known as row major indexing scheme, say p r is connected to p j k p r is connected p i is the processor with index i and p j k is the i the processor with the j-th row at fall up on 8h column. Now indexes keep these since it is the row major indexes scheme that number should like,

(Refer Slide Time: 39:14).



that p 0 p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 p 10 p 11 p 12 p 13 p 14 and p 15. So, we are form find out the relationship between say p 3 2 p 3 2 should points to p 9 or p 9 should occupy the position or the processor at the third row and the second column.

## (Refer Slide Time: 40:13)

So, what is the relationship between i j and k i is equals to is it that is when j is equals to 1 k is equals to 1 then it becomes 0 i is equals to 0 and then j is equals to 1 k equals to 2 k equals to 1 when j equals to 2 k equals to 1 that is 4 and so on. So, this is the relationship between i and j and k.

(Refer Slide Time: 41:25).

Similarly, we can have the number in scheme as a column measure that index is scheme then is another indexing scheme, we can defined which is known as snake like row major indexing scheme in the snake like row major indexing scheme, thus scheme looks like this there is in the form of snake or rope.

In that case what should be the relationship between p i and p j k this is should writ10 p i and p j k now, you observe there when the row number is odd it is 11 are here 1 that is p i is equals to j minus 1 n plus k minus 1 if j is odd, i get now if j is even then what j is connected j minus 1 into n plus n minus k. Let us see when 2 1 2 1 that is 4 plus is 4 minus 3 4 plus 3 7 and then, where is 2 4 2 4 then becomes 0 and there is 4 while j is equal when it is j minus 1 and plus n minus k. So, this is row snake like row indexing scheme and similarly, you can define snake like column.

(Refer Slide Time: 43:54).



Major indexing scheme where scheme this look like this, this is known as shuffled row major indexing scheme suppose the processor p i occupies in that arrange the location p j k then, we p h o and the j h column of the row major indexing scheme and the binary representation of i is b 1 b 2 b 3 b q then shuffled of i is defined by b 1 b q by 2 plus 1 b 2 b q by 2 plus 2 b q by 2.

Then what we will do that if it is i dash then p i dash occupies the location of p j k in the shuffled row major indexing scheme the idea is suppose p i p j k that means j through and k-th column of the 2 dimensional array under the row major indexing scheme and, i if i convert is in the binary representation of it is b 1 b 2 b 3 b q by q then, we defined shuffled of i as b 1 b 2 by 2 plus 1 b 2 by 2 plus 2 and b q by 2 then this is i does. So, p i

does should be occupies the will be occupying the position of j-th row and the k-th column.

(Refer Slide Time: 46:12)



## (Refer Slide Time: 48:55)



So, in that case the shuffled row indexing scheme becomes is p 0 p 1 then this is 2 2 means match to 4. So, this is becomes 4 this is becomes 5 this becomes 2 this becomes 3 this becomes 6 this becomes 7 this becomes 8 this becomes 9 this is 10 this is 11 12 13 14 15. So, basically you can think about this way. so, in generalize form



(Refer Slide Time: 49:38)

So, in generalize form, you can define suppose you have this side 16 processors this side another 16 this side 16 for this 32 cross 32 and here this 16 cross are you can define like this and so on. So, here this is p 0 p 1 and p 15 here you will have p 16 p 17 and here 31

and so on. You observe that for that 2 d miss corrected computed you have at most 4 connections.

(Refer Slide Time: 50:29)



Now, in the boundary since you may have 2 or 3 connections and there are for example, you have p 0 p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 10 11 12 13 14 15. So, p 9 you observe direct this is a 4 connections but, in the case of p 0 you will have the 2 connection while p 8 it is having 3 connections so on. Now, there is another model which is known as wrap around wrap around mesh connected computers and here it means that p 0 is connected with p 12 or p 12 13 connected with p 1 and so, 1 similarly, with the case with these the p's every processor will have in a 4 connection will have in the 4 connections.

So, illiac 4 or illiac machines are of this type as this model even that looks give me complex but, it has the homogeneity so, it is it easy to understand the body implement the algorithms on this model rambler here the purse cube ablation mesh connect computer unique the 2 q connections.

#### (Refer Slide Time: 52:29)



Now the next model is known as perfect shuffle computer let, us show there are n processor p 1 p 2 p 3 p n minus 1 these are the n processors and let us show the binary representation of i. So, for simplicity literal show let n is equal to 2 to the power q like the (()) for q can you tell me what should be the number of beads here is then be q beads and let, us then show i q minus 1 i q minus 2 this is i 0 with the binary representation of i can be express in the form of y q minus 1 i q minus 2 and i 0.

Then in the p i is connected with 3 processors provided they exist p j k and p l when j is obtained by the operation known as x j. So, exchange of i is nothing but, i q minus 1 i q minus 2 i 1 i 0 s compliment the exchange of i j is nothing but, the exchange of i we way exchange is defined by i q minus 1 i q minus 2 i 1 i 0. Now, k is known as shuffle of i shuffle of i which is defined as i 0 i q minus 1 i q minus 2 i 1 and 1 is un shuffle of i which is defined as i q minus 2 i q minus 1 i 0 i q minus 1. So, basically f b processor is connected with f b processor is connected with at most stream processor rambler in the case of 2 dimensional mesh every process connected with at most more processor, we are that means need the less number of processors a less number of connections between then a fall any processor. Now, in a do illustrate or this example illustrate is let us consider n equals to 8 n equals to 8 consider.

### (Refer Slide Time: 56:05)



See how it looks like so you have p p 0 say processor index index 0 1 2 3 4 5 6 7 these are 8 processor we have taken then, you have your properties j which is exchange operation exchange of i you have k it is nothing but, shuffle of i and you have l which is un shuffle of i.

So, exchange of i is 1 this is nothing but, triple 0 this is nothing but,  $0\ 0\ 1$  it is  $0\ 1\ 0\ 0\ 1$  (())  $1\ 0\ 0\ 1\ 0\ 1\ 1\ 0$  and all 1s. So, exchange is nothing but, this is 0 this is 3 this is 2 this is 5 this is 4 this is 7 and this is 6 shuffle of i this is 0 this is 4 this is 1 this is 5 this is 2 this is 6 this is 3 this is 7. Un-shuffle this is 0 this is 2 this is 4 this is 6 this is 1 this is 3 this is 7.

### (Refer Slide Time: 58:34)



So, i is connected to with exchange shuffle and un shuffle now, if i have to draw it then let, us do it p 0 p 1 p 2 p 3 p 4 p 5 p 6 p 7 now p 0 is connected with p 1 and p 0 p 1 is connected with p 0 and p 4 and p 2 p 2 is connected with p 3 p 1 and p 4 p 3 is connected with p 2 p 5 and p 6 p 4 is connected with p 5 p 2 p 1 p 5 is connected with p 4 p 6 p 6 and p 3 p 6 is connected with p 7 p 3 and p 5 p 7 is connected with p 6 p 7 and p 7.

So, this is the structure of perfect shuffle computed now here we observe that 1 thing i want tell you the perfect shuffle computer not only it is the less number of connections it is in based on the 2 important properties 1 property is that is d is available data d is available data t i and after.

(Refer Slide Time: 60:42).



Q shuffles when n is equal to 2 to the power q when n is equal to 2 to the power q after q shuffles n is the number of processors if after q shuffles the data will compare to its original position. So, property 1 is that if there are 2 to the power keep of 2 processors and that and q processor that they after q shuffle the data of each processor is come back to its original position.

(Refer Slide Time: 61:22).

This is because this is the q beads shuffle it 2 times so, it go back to its original 1 the second property is that suppose x data x in p i and y is in p j data x is in p i and y data y

is in p j the binary representation of binary representation of i n binary representation of j that default that default only e n minus k-th beads or q minus k-th beads q minus k-th beads, that is binary representation of i and binary representation j they differ only in the q minus k-th beads then after k shuffles the data will come back to the adjacent location.



(Refer Slide Time: 62:19)

Because that say you have i q minus 1 i q minus 2 and here you will get x not x say i and then here you have i 0 and this is same here it is i's compliments and here i 0 after the shuffle this will come here and this will come here. So, there will be adjacent location there will be in adjacent location.

So, this perfect shuffle computed most of the algorithm perfect shuffle computing depended on this 2 property is 1 property is that if that x data x is in p i and y is in p j and the binary representation of i and binary representation of j because on n minus k only in 1 beads n minus or q minus k-th beads then after k shuffles the data will come back to the adjacent location and another, 1 is that if you have prove the fault you have q processors then after q shuffles or q answer for that i will come back to the original position.

Now, in the next class what we like to consider the model they are would like to consider first 1 is that butterfly model second 1 is algorithm and third 1 is third 1 is on cube connected circle and the next 1 is the 3 model.

### (Refer Slide Time: 64:06)

One-Dimensional Pyramia

The one-dimensional pyramid and two-dimensional pyramid model. So, this the model you have to be consider are models of s i m d machine, we have already covered mesh connected perfect shuffle computed will be doing the butterfly hyper cube and the cube connected cycle tree model leaner array one-dimensional pyramid model and then you have 2 dimensional pyramid model. So, these are the models well known models which we like to consider we have already discussed about the mesh connected and perfect shuffle and this models will be discussing will be discussing tomorrow.