An InPlace Merge Algorithm for Parallel Systems
Denham CoatesEvelyn Department of Computer Science Kings College, The Strand London WC2R 2LS, U.K. denham@dcs.kcl.ac.uk April 2001
Abstract An algorithm that does O(n2 ) comparisons and data moves to sort a list of n random data values on a uniprocessor system is presented. We show a multiprocessor implementation √ that has complexity of O( n) using the MIMD model on a mesh connected system. The AT 2 √ complexity on this model is O(n5/2 lg(n)), which is suboptimum by a factor of O( n/ lg(n)).
Key Terms: Parallel Processing, Inversion, Merge, Uniprocessor/Multiprocessor, AT 2 complexity, E?ciency.
INTRODUCTION There are well known algorithms that sorts a sequence of N data values inplace and in optimum time by doing a sequence of merge operation iteratively or recursively. In this case, the algorithm is describe as a mergesort algorithm. Ideally we would like our merge algorithm to do the minimum possible compare and exchange operations for each element that is selected into its √ ?nal sorted location. The algorithm presented here on average does m + n compare/exchange operations for each element selected into its ?nal merged location when merging two sorted lists of m and n data values each inplace. This gives us an O(N 2 ) mergesort algorithm on a 1
2
Denham CoatesEvelyn
uniprocessor system. However, the algorithm gives AT 2 suboptimum performance when implemented on a MIMD mesh connected system. We begin our discussion with its implementation on a uniprocessor system.
1
Even Merge Algorithm
We de?ne the two dimensional (2D) array of data values in which each data location is reference by a row and a column subscript, A = (aij ) where i = 0, 1 · · · m ? 1 and j = 0, 1, · · · n ? 1. Generally, let m ≤ n.
Denote a sort on all the rows of A and a sort on all the columns of A respectively as: SORT(Ri )A = B, where Ri = (ai1 , ai2 , · · · ain ), where i = 0, 1 · · · m ? 1 SORT(Cj )A = D, where Cj = (a1j , a2j , · · · amj ), where j = 0, 1 · · · n ? 1 Theorem 1 SORT(Ri )A = B followed by SORT(Cj )B = D · · · (1) or SORT(Cj )A = D followed by SORT(Ri )D = B · · · (2) retains the previous ordering of the row values created in B (or the previous ordering of the column values created in D). In other words, the rows (or columns) will remain sorted when a row sort (or a column sort) is followed by a column (or a row) sort on A. Proof Now B = (bij ). Where i = 0, 1 · · · m ? 1, j = 0, 1 · · · n ? 1 and bi1 < bi2 · · · bin . De?ne the macro SORT() as a comparing linear sort. We use this macro to sort the columns and rows in the 2D array. Consider the element bij ∈ B and the inversions in the neighbourhood of bij . If bij > b(i+1)j then SORT (Cj )A will initiate Swap(bij , b(i+1)j ). But! If bij > b(i+1)(j+1) < b(i+1)(j+2) · · · b(i+1)n then since bij < bi(j+2) < . . . bin . Then bi(j+1) > b(i+1)(j+1) therefore SORT (Cj+1 )B initiates Swap(bi(j+1) , b(i+1)(j+1) ).
On the other hand!! If b(i+1)j < bi(j?1) > b(i+1)(j?2) > · · · b(i+1)1 . Since bij > bi(j?1) > bi(j?2) · · · bi1 . Then bi(j?1) < b(i+1)(j?1) . Therefore, SORT (Cj?1 )B initiates Swap(bi(j?1) , b(i+1)(j?1) ). The sequence of sorts performed on each column or row in A, B or D are independent of each other. Therefore, the order in which these columns (or rows) are sorted does not a?ect the ?nal
A Parallel Merge Algorithm
3
ordering created by the sort. A sort on all columns to the right and to the left of column j removes any inversion in the rows that intersects column j due to a sort on column j. Hence, the theorem is proven for the case of row sort followed by column sort on A.
It is not di?cult to see the proof for the case of a column sort followed by a row sort. The proof for this case can be easily followed by considering D = (dij ) instead of B, with the subscript values used in the above proof interchange. The column sorts are independent of each other and therefore, the order in which they are performed does not a?ect the ?nal outcome. The same is true of the row sorts. Hence, sorting on all m rows (or n columns) can be done concurrently. The reder should not misunderstand this result to mean that row sorting followed by column sorting creates the same permutation on A as column sorting followed by row sorting. It can be shown that the two sequence of sort operation gives distinct permutations on A.
The column sorts are independent of each other and therefore, the order in which they are performed does not a?ect the ?nal outcome. The same is true of the row sorts. Hence, sorting on all n columns (or m rows) can be done concurrently.
1.1
Sorting the 2 × p Row Vector
In this case, we set m = 2 and n = p to represent two presorted sequences. We then implement repatetive column and row operations on this 2 × p array until a single sorted sequence is obtain. We represent the two sorted sequences to be merge as F and G as de?ned below. The algorithm takes the two sorted sequences: F = {F [0], F [1], · · · F [p ? 1]} where F  = p and G = {G[0], G[1], · · · G[q ? 1]} where G = q and then merge them by iterative selection of the next maximum and minimum values among the two sublist. For simplicity we let F  = G = p. We do the following set of comparisons concurrently starting with the index variables i = 0, j = p ? 1, h = 0, k = p ? 1. (a) F [i] < G[h]? and (b) F [k] < G[j]?
We remove inversions between the pairs (F [i], G[h]) and (F [k], G[j]) by swapping where necessary
4
Denham CoatesEvelyn
and then increment i, or decrement j concurrently. The following algorithm selects F [0] as M in(F, G) and G[p ? 1] as M ax(F, G). Even Merge(F, G, p) START FOR i = 0 TO p/2 IF F[i] > G[i] SWAP(F[i], G[i]); NEXT i FOR j = p1 TO p/2 IF F[j] > G[j] SWAP(F[j], G[j]); NEXT j END Note that the processing done by the two FORNEXT loops are independent of each other. Therefore, they can be implemented as two concurrent processes on a multiprocessor system. Since F and G together are equivalent to a two row vector and the algorithm does the equivalent of a column sort on this row vector. It is not di?cult to see that Theorem 1 also applies to any m × n array. Hence whenever we apply this algorithm to F and G, the 2 × p array created from F and G remains column and row sorted. The algorithm below is a second but more general implementation of Even Merge on a uniprocessor system. Even Merge(F, G, p) START FOR j=0 TO p FOR i = 1 TO pj IF F[i+j] > G[i] SWAP(F[i+j], G[i]); NEXT i NEXT j END The number of comparisons done by the algorithm to create a row major sort of the 2 × p array created from F and G, is
p
j=1
1 j = p(p + 1) 2
A Parallel Merge Algorithm Theorem 2 De?ne the sequence: S = {(F [0], G[0]), (F [1], G[1]) · · · (F [p ? 1], G[p ? 1])}.
5
Even Merge selects (F [r], G[p ? r]) as the next (r ? 1)th minimum and maximum in S at the end of the (r ? 1)th iteration of the outer most FORNEXT loop.
Proof Since F and G have the ordering < de?ned on them then we have the following min(F ) = F [0], max(F ) = F [p ? 1], min(G) = G[0], max(G) = G[p ? 1]. When execution of the inner FORNEXT loop is terminated we have min(S) = F [0] and max(S) = G[p ? 1]. Since F [0] < G[0], G[p?1] > F [p?1] and the 2×p array remains column and row sorted, the next time the outer FORNEXT loop is executed, we have min(S ? F [0]) = F [1] and max(S ? G[p ? 1]) = G[p ? 2]. Assume the theorem to be true for r = k, we know that the algorithm does comparisons and data exchanges so as to create column and row sort between the subset of data values {F [k], F [k + 1], · · · F [p ? (k + 1)]} and {G[k], G[k + 1] · · · G[p ? (k + 1)]}. The algorithm does not compare paired values (F [j ? 1], G[p ? (j ? 1)]) for which j ≤ k or j ≥ p ? k. Hence, we see that at the end of the (k + 1)th iteration the outer fornext loop selects F [k] and G[p ? k] as min({F [k], · · · F [p ? k], G[k] · · · G[p ? k]}) and max({F [k], · · · F [p ? k], G[k], · · · G[p ? k]}) respectively. Hence, the theorem is proven by induction on r. The following sequence of permutations on F and G illustrates an instance of Even Merge.
1.2
Example Instance of Even merge
The original permutation on F and G are as follows: Let F = {00, 06, 15, 26, 31, 37, 43, 48, 54, 60} and G = {05, 11, 18, 20, 27, 29, 33, 45, 50, 58}
6
Denham CoatesEvelyn
(1)F = {00), 05, 06, 15, 26, 31, 37, 43, 48, 54}, G = {11, 18, 20, 27, 29, 33, 45, 50, 58, (60} F [0] = min(L), G[p ? 1] = max(L) (2)F = {00, 05), 06, 15, 26, 29, 37, 43, 48, 54}, G = {11, 18, 20, 27, 31, 33, 45, 50, (58, 60} (3)F = {00, 05, 06), 15, 20, 27, 31, 33, 45, 50}, G = {11, 18, 26, 29, 37, 43, 48, (54, 58, 60} (4)F = {00, 05, 06, 11), 18, 26, 29, 33, 43, 48}, G = {15, 20, 27, 31, 37, 45, (50, 54, 58, 60} (5)F = {00, 05, 06, 11, 15), 20, 27, 31, 37, 45}, G = {18, 26, 29, 33, 43, (48, 50, 54, 58, 60} (6)F = {00, 05, 06, 11, 15, 18), 26, 29, 33, 43}, G = {20, 27, 31, 37, (45, 48, 50, 54, 58, 60} (7)F = {00, 05, 06, 11, 15, 18, 20), 27, 31, 37}, G = {26, 29, 33, (43, 45, 48, 50, 54, 58, 60} (8)F = {00, 05, 06, 11, 15, 18, 20, 26), 29, 33}, G = {27, 31, (37, 43, 45, 48, 50, 54, 58, 60} (9)F = {00, 05, 06, 11, 15, 18, 20, 26, 27), 31}, G = {29, (33, 37, 43, 45, 48, 50, 54, 58, 60} (10)F = {00, 05, 06, 11, 15, 18, 20, 26, 27, 29}, G = {31, 33, 37, 43, 45, 48, 50, 54, 58, 60}
The algorithm requires that F and G to be presorted. It would therefore, on this point be suitable for creating well sorted submatrices formed from adjacent pairs of rows or columns in a column, row and RighttoLeft (RL) diagonal sorted matrix. However, the time complexity of the algorithm on the number of comparisons is O(n2 ) (see section on analysis).
1.3
Bottomup Even MergeSort
We use the algorithm iteratively or recursively to sort a list of N data values by doing lg(N ) split of the list of data values. The following algorithm uses Even Merge to do sort on a linear list of N data.
First, adjacent pairs of data values in the list to be sorted are sorted by PairSort. Therefore, the ?rst phase does N/2 comparisons. We de?ne PairSort as given below, where H is the array that contains the N data values to be sorted.
A Parallel Merge Algorithm Pair Sort(H, N) Start FOR(j = 0; j < N; j += 2) if(H[j] > H[j+1]) Swap(H[j], H[j+1]); NEXT j End
7
We de?ne the function Linear Sort used for creating the set of partitions to be merged by Even Merge.
Linear Sort(k, N) FOR x=2 to lg(N) Even Merge All pairs of adjacent sublists NEXT x
The list to be sorted is ?rst segmented into 2x sublists on each pass of the while loop. The FORNEXT loop implements Even Merge on each adjacent pair of these segments. The number of data values in each segment is 2x . The algorithm appears to be of little value because of its O(N 2 ) time complexity. However, it is implemented on the n × n connected mesh to give a time complexity of O( (n)) which is optimum for the n × n connected mesh [17]
1.4
Complexity on the Uniprocessor System
For the purpose of simplicity assume that N is a power of 2. During the rth pass of the outer FORNEXT loop in Linear Sort, H is divided into N/2r partitions. Each partition consists of 2 × 2r?1 presorted 1D arrays at the start of a sequence of merge operation on a partition. Where 1 ≤ r ≤ x = log2 (N ). The inner FORNEXT loop makes Yr = N/2r+1 calls of Even Merge for each pair of sublist in the Yr partition. Since there are 2r elements in each row vector of a partition, then Even Merge does 2r (2r?1 + 1)/2 comparisons on each partition. Total comparisons during the rth pass is Yr × 2r × (2r?1 + 1)/2. Number of passes made by Linear Sort on H is x. Therefore, number of comparisons performed during an instance of Linear Sort is
8
Denham CoatesEvelyn
x
Yj × 2j (2j?1 + 1)/2
j=1 x
=
j=1
N.2j?1 j?1 N (2 + 1)/2 ? j+1 2 8
(2j + 2)
= =
N x (2 ? 1 + 2x) 8 N (N + 2 lg(N ) ? 1) 8
The number of comparisons is always the same regardless of the nature of the input permutation. The algorithm does the smallest number of data moves when the 2 × p data array is already sorted.
The algorithm was implemented using C/C++ coding and then compiled and run on a PC. Results from random integers are consistent with our analysis on the number of data moves and comparisons.
2
Sorting On Parallel Machines
We can implement Even Merge as a set of primitive processes on a parallel system to give optimum performance on that system. As can be seen, the algorithm has a high degree of concurrence and is easily mapped onto existing parallel machines such as the MIMD meshconnected system, the SIMD meshconnected or Multiprocessor Orthogonal access machine [15,2]. Figure 1 illustrates the implementation of Even Merge on a 2 × p mesh and more generally on the n × n mesh using the MIMD model of computation.
A Parallel Merge Algorithm
9
F
F[0] . . .
F[i]
F[i+1]
F[i+2] . . .
F[p1]
G
G[0]
G[i]
G[i+1]
G[i+2]
G[p1]
Figure 1 Mesh interconnected 2 × p processors that maps the data structure used by Even Merge.
Consider the following 2 × p interconnections of processors illustrated in ?gure 1. We use F and G in ?gure 1 to represent sub?les F and G in Even Merge as on the uniprocessor system. Each processor has no less than 3 Registers and can execute the following instructions:
(i) Transmit a data value from a given register to any of its nearest neighbour. (ii) Receive a data value from any of its nearest neighbour into a given register. (iii) Compare the content of any two registers and place the Maximum and the Minimum values into selected registers. Figure 2 illustrates paths in the ?ow of data trough any of the processor.
6 ? Pi

Figure 2 Possible data ?ow path through each processor in the 2 × p array of interconnected processors.
6 ?
I @ @
Bidirectional data paths
10
Denham CoatesEvelyn
To see how we implement the equivalent of Even Merge on the 2 × p mesh, consider the section of the 2 × p mesh illustrated in ?gure 1. We refer to the content of register R1 in fi or gi as fi and gi respectively. Pairs of adjacent values in rows F and G are sorted as follows:
√ To describe how the interconnection is used to implement a sort in O( n) time, we de?ne the following function on processor F [i] and G[i]. Transmit(j, k): Transfer content of selected register j in a given processor to selected register k of a nearest neighbour processor. In ?gure 3 we use the symbol Ri to mean a register with the subscript to indicate the actual register number in a processor.
Start
Pair Merge F, G 1:Transmit fi (and gi ) to R2 of fi+1 (gi+1 ) Transmit fi+1 (and gi+1 ) to R2 of fi (gi ) Transmit fi+2 (and gi+2 ) to R2 of fi+3 (gi+3 ) Transmit fi+3 (and gi+3 ) to R2 of fi+2 (gi+2 ) 2: R1 ← Min(R1 , R2 ) For gx , fx R2 ← Max(R1 , R2 ) Where x = i, i + 2 R1 ← Max(R1 , R2 ) For gx = i, fx R2 ← Max(R1 , R2 ) Where y = i + 1, i + 3
CONCURRENTLY
CONCURRENTLY
End Figure 3 Sequence of parallel operations to implement the equivalent of PairMerge on a uniprocessor system.
This point in the algorithm is equivalent to a sort on adjacent pairs of data values in a linear list. Where (fi , fi+1 ), (gi , gi+1 ) and so on correspond to adjacent pairs in the linear list.
A Parallel Merge Algorithm
11
Even Merge F, G Start: DO CONCURRENTLY 1: Transmit fx to R2 of gx , x = i, i + 1 Transmit gx to R2 of fx , x = i, i + 1 2:R1 ←? Min(R1 , R2 ), for fx , x = i, i + 1 · · · i + 3 R2 ←? Max(R1 , R2 ), for fx , x = i, i + 1 · · · i + 3 R1 ←? Min(R1 , R2 ), for gy , y = i, i + 1 · · · i + 3 R2 ←? Max(R1 , R2 ), for gy , y = i, i + 1 · · · i + 3 3:Transmit Content of R2 in all fx to R2 of fx+1 , x = i, i + 1, i + 2 Transmit Content of R2 in all gy to R2 of gy?1 , y = i + 3, i + 2, i + 1 4: Repeat 2: and 3: for x = i + 1, i + 3, y = i, i + 2 END DO CONCURRENTLY 5: Exchange (R1 , R2 ) in fx for x = i + 2, i + 3; Exchange (R1 , R2 ) in gy for y = i, i + 1 6: Repeat 1: and DO WHILE (x < i + 3) DO 2: and DO 3 for x = x to i + 2; y = y to i + 1; x = x + 1, y = y ? 1; END DO WHILE Figure 4. Sequence of parallel operations to implement the equivalent of PairMerge as implemented on a uniprocessor system. At the end of 6 :, the system of data values in R1 of the 2 × p array of processors forms a sorted 2 × p array of data values. Note that the values in R1 , of fi and gi that forms the adjacent pairs (fi , fi+1 ) and (gi , gi+1 ) contains Min(fi , fi+1 ) and Min(gi , gi+1 ) respectively at the end of Pair Merge. Similarly R1 of fi+1 and gi+1 stores Max (fi , fi+1 ) and Max (gi , gi+1 ) respectively at the end of Pair Merge. The example instance given below clearly illustrates this.
If we do a row major traversal trough register R1 of each 2 × 2 array at the end of 4 : in Even Merge1, we see that we have the equivalent of two row major sorted 2× 2 arrays. The next stage in the algorithm merges adjacent pairs of these 2 × 2 arrays to create pairs of row major sorted 2 × 4 arrays. The example given below clearly illustrates this. Note also that at the end of 4 : the pair that is adjacent to (fi , fi+1 ), i.e. (fi+2 , fi+3 ) has an exact copy of the value in register R1 , of the corresponding pair in G, i.e. (gi , gi+1 ). Therefore, to create a linear sorted list from the array created from the 2 × 2 array consisting of (fi , fi+1 ) and (gi , gi+1 ), we simply ?ick the content of registers (R1 , R2 ) for the adjacent pairs (fi+2 , fi+3 ) and (gi , gi+1 ). This is repeated during instance of the 2 × 4 array. The example instance below clearly illustrate this.
12
Denham CoatesEvelyn Example Instance Of Parallel Evenmerge Illustrated
Consider the following permutation that is to be sorted, where p = 4. The values are sorted in the 2 × 4 meshed as illustrated by the following:
(1) 8 5 4 2 5,8
(2) 8,5 2,4 4,2
6 1 7 3 Even Merge 1,6 6,1 3,7 7,3
(3) 1,5 6,8 2,3 4,7
?
1,5
(4) 5,6 2,8 3,4
5,1 8,6 3,2 7,4 6,5 8,2 4,3 7,4
?
(5) 1,5 5, 6 2,6 3,8 Flick 6, 2 8,3 4, 3 7,4 DO 1 1,5 (6) 5,6 6, 2 8, 3
R1 , R2 2, 6 3, 8 4,3 7,4

?
A Parallel Merge Algorithm (7) 1,2 3,5 4,6 7,8 1,2 (8) 2, 3 4, 5
13
6, 7
2, 1 5, 8 6, 4 8, 7 3, 2 5, 4 7, 6 8, 7
(9) 1,2 2, 3 3, 4 5, 6
?
1,2
(10) 2, 3 3, 4 4, 5
4, 3 6, 5 7, 6 8, 7 5, 4 6, 5 7, 6 8, 7
(11) 1, 2 2, 3 3, 4 4, 5
?
2, 1
5, 8
6, 4
8, 7
Figure 5 Example instance of Linear Sort on a 2 × 4 connected mesh of processors that maps the data structure used by Even Merge. The following algorithm will generally implement sorting of 2p data values on the 2 × p MIMD mesh connected system. We set the variable k to represent the partitions of the p processors during an instance of Linear Sort1. We su?xed Linear Sort with a 1 to distinguish it from the uniprocessor implementation. We also de?ne the set of concurrent variables {imin } = 1, k + 1, · · · p ? k. To simplify our explanation, we assume p to be an exact power of 2.
14
Denham CoatesEvelyn Linear Sort k ←? 2; imin ←? {1, k + 1 · · · p ? k; } Pair Sort F , G; //All adjacent pairs in F and G are now sorted DO CONCURRENTLY ON ALL is IN ALL PARTITIONS Transmit fi to R2 of gi Transmit gi to R2 of fi DO CONCURRENTLY ON ALL is IN ALL PARTITIONS R1 ←? Min(R1 , R2 ), in fi R2 ←? Max(R1 , R2 ), in fi R1 ←? Min(R1 , R2 ), in gi R2 ←? Max(R1 , R2 ), in gi DO CONCURRENTLY ON ALL PARTITIONS Transmit Content of R2 in all fi to R2 of fi+1 , ?i ≥ imin Transmit Content of R2 in all gi to R2 of gi?1 , ?i ≤ imin DO CONCURRENTLY ON ALL is IN ALL PARTITIONS x ←? 1 WHILE(x ? k) DO REPEAT 2: and 3: IN F for(i = imin + x, < imin + k; i := i + 1) REPEAT 2: and 3: IN G for(i = (imin + k) ? x, > imin + k; i := i ? 1) x := x + 1; END DO k ←? 2k; imin ←? 1, k + 1 · · · p ? k DO CONCURRENTLY ON ALL PARTITIONS WHILE(k ? p) DO 1:, 2:, 3:, 4:, 5: END DO END DO WHILE Figure 6 Implementation of a linear Sort on a multiprocessor system.
1:
2:
3:
4:
5: 6:
Theorem 3 LinearSort1( ) requires time proportional to n/2 + lg(n/2) to sort n data values on the 2 × p mesh, where n = 2p. Proof Denote the time delay to transmit a data value between any two adjacent processors as Tt and the time used to exchange the contents of registers R1 and R2 as Te . Figure 7 illustrates the binary three that partitions the data set during the mergesort operation. We use this to illustrate the following proof of the theorem.
The data set is partitioned and merged through lg(p) steps. The sequence of partitions created at each step duing the merge mapps onto a a binary tree. At level r of this binary tree, there
A Parallel Merge Algorithm are 2r data values in each partition to be sorted. Each partition is sorted in parallel. Values of r x = log2 (p) . . . 4 3 2 1 1 2 3 4 5 6 ··· ··· ··· ··· 7 8 p1 p
15
1
2
3
4
5
6
7
8
p1
p
Figure 7 The sequence of parallel partitions during the sort maps onto a binary tree. The number of times the Transmit instruction is executed concurrently at level r is 2r + 2. The extra 2 instructions are executed whenever a copy of the value in gi is sent to R2 of fi and a copy of the value in fi is sent to R2 of gi . Each transmit instruction is followed by a compare and possibly exchange operation. Therefore, we see that the total time required to sort a partition at the rth level is (2r + 2)Tt + 2r Te . The height of the binary tree is x = lg(p). Therefore, summing over all rs gives total time.
x
TT otal =
r=1
{(2r + 2)Tt + 2r Te }
= {(p ? 1) + 2 lg(p)} Tt + (p ? 1)Te 1 = n(Tt + Te ) + 2Tt lg(n) ? (3Tt + Te ) 2 This ends the proof.
We see that the complexity of Linear Sort1 on the number of comparisons data transmissions and data exchange is O(n). Using e?ciency metrics as de?ned in [2], we see that the cost of this
16
Denham CoatesEvelyn
system is O(n2 ) and the e?ciency is O(lg(n)/n). We optimise these values by adopting Linear Sort1 to do sorting on the m × m mesh. Where m = n 2 . We illustrate an interconnection con?guration in ?gure 8.
1
We can connect f1 and gp to an external memory element such as a shift register so that the p data values in F and G are shifted out simultaneously to external memory during the last phase of the algorithm. Hence, we can do modi?cation to the system so that the cost of external data storage and retrieval is minimal. Using word parallel modules for each processor, we see that the Area complexity of the system is O(n lg(n)). Hence, this gives us an AT 2 complexity of O(n3 lg(n)). We further improve the complexity to O(n 2 lg(n)) as follows: Consider the interconnection of processors in ?gure 8.
5
P1j . . . . .
.
P1i
. . .
Pij
. . .
Pin
. . . Pmj Figure 8 Interconnection and communication of processors in the m × n mesh. We assume that each processor has at least 3 internal registers and can interact with any of its nearest neighbour as illustrated by the interconnections in ?gure 8. Assume that each processor interacts with its nearest neighbor within the ?xed time interval of Tt . We do the equivalent of a column followed by a row and RL diagonal sort as illustrated in ?gure 9: Note that the time
A Parallel Merge Algorithm complexity of steps 1: and 2: is O(m).
17
START 1: DO Linear Sort1() on all pairs of adjacent columns 2: DO Linear Sort1() on all pairs of adjacent rows 3: DO CONCURRENTLY ON ALL Pij ∈ Dij Transmit content of R2 in all Pi?1j to R2 of Pi?1j+1 ?i, j Transmit content of R2 in all Pij+1 to R2 of Pij ?i, j 4: R1 ←? Min(R1 , R2 ) R2 ←? Max(R1 , R2 ) in Pij R1 ←? Max(R1 , R2 ) R2 ←? Min(R1 , R2 ) in Pi+1j END Figure 9 Sequence of concurrent operation that implements Linear Sort1 on the m × m connected mesh.
The last two steps in the algorithm correspond to a RL diagonal sort. The complexity of this step is the same as with the ?rst two steps. From this, we see that the overall time complexity for sorting the n data values is O( (n)). This is within the theoretical limit for sorting on the 2 × p mesh.
18
Denham CoatesEvelyn
3
Conclusions
From the discussion, we see that Even Merge is nonoptimum when implemented on a uniprocessor system. However, we see marked improvement in the performance of the algorithm when it is implemented on a parallel system. The performance of the algorithm on a uniprocessor system can be improved by by taking advantage of presorted runs of data values between F and G. To do this, we insert the following piece of code just before the nested FORNEXT loop in Even Merge.
Start While(F[p1]<G[j]) j; while(F[i]<G[0]) i++; End
In our analysis we assume that the number of data values to be sorted n is an exact power of 2. As with many other mergesort algorithm, the algorithm can be easily modify in a similar manner to so that it does sorting on a list of n data values for which n is not an exact power of 2.
It should be noted that whenever the algorithm is implemented on the mesh connected system to do the equivalent of a column and row sort, the e?ciency of the algorithm is not a?ected by the order in which column and row sorts are implemented. In fact, a column sort in this case creates column major sort on adjacent pairs of columns and row major sort on adjacent pairs of rows. However, because of the nature of this parallel system and its limitations, it is not possible to take advantage of these end results on this system. It is possible to do so in the case of the MIMD PRAM or the orthogonal multiprocessor parallel computer systems.
A Parallel Merge Algorithm
19
References
[1] A. V. Aho, J. E. Hopcroft and J. D. Ullman. The Design and Analysis of Computer Algorithms. AddisonWesley, 1974. [2] Selim G. Akl Parallel Sorting Algorithms Academic Press Inc, 1985 [3] G. H. Barnes The ILLIAC Computer IEEE Trans.Comp 17 1979, 746757 [4] H. Lynn Beus The use of information in Sorting. Jour. ACM 17 (1970) 482[5] Francis Y. Chin and K. Samson Fok Fast Sorting Algorithms on Uniform Ladders (Multiple Shift Register Loops) IEEE Trans. Comp. 29 Feb. 1980 618631 [6] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest Introduction to algorithms MIT Press, McGrawHill [7] R. B. Dewar A stable minimum storage algorithm. Inform. Process. Lett. 2 (1974) 162164
20 [8] Kai Hwang, PingSheng Tseng and Dongseung Kim An Orthogonal Multiprocessor for Parallel Scienti?c Computation IEEE Trans. Comp. 38 #1 Jan 1989 4761 [9] Rainer Kemp; B.G. Teubner
Denham CoatesEvelyn
Fundamentals of the Average Case Analysis of Particular Algorithms John Wiley & Sons, 1984 [10] D. E. Knuth The Art of Computer Programming, Vol. 3 AddisonWesley, Reading Massachusetts
[11] Glenn K. Manacher The FordJohnson Sorting Algorithm is not optimal Jour. ACM, 26 1979 441456 [12] Fredrick Manne & Tor Sorevik Optimal partitioning of a sequence. Journ. of Algorithms. 2 . 9. 95 [13] Heikki Mannila Measures of Presortedness and Optimal Algorithms IEEE Trans. Comp. 34 (1985) 318325 [14] D. E. Muller and F. P. Preparata Bounds to Complexities of Networks for Sorting and for Switching Journ. Assoc. Comp. Mach. 22 Apr. 1975 195201
A Parallel Merge Algorithm [15] Gregory J.E. Rowlings An introduction to the Analysis of Algorithms pp 230  285 [16] Kazuhiro Sado and Yoshide Igarashi Some Parallel Sorts on a Mesh Connected Processor array and their time e?ciency Journ. Parall. & Distrib. Comp 3 (1986) 398410 [17] Isaac D. Scherson, Sandeep Sen Parallel Sorting in TwoDimensional VLSI Models of Computation IEEE Comp. Soct. Press (1993) 159170 [18] D. L. Shell A HighSpeed Sorting Procedure Comm. ACM (Jul. 1959) 303 [19] H. J. Siegel A model of SIMD Machines and Comparisons of various Interconnected Network. IEEE Trans. Comp. 28 Dec. 1979, 907917 [20] Clark D. Thompson The VLSI Complexity of Sorting IEEE Trans. Compt 32 Dec 1983 [21] C. D. Thompson, H. T. Kung Sorting on a MeshConnected Parallel Computer Comm. ACM 20 Apri. 1977, 263271
21
22 [22] Valliant, L. G. Parallelism in Comparison Problems, SIAM Journal of Computing 4 #3 (1977) 348355
Denham CoatesEvelyn
文档资料共享网 nexoncn.com
copyright ©right 20102020。
文档资料共享网内容来自网络，如有侵犯请联系客服。email:zhit325@126.com