# An In-Place Merge Algorithm for Parallel Systems

An In-Place Merge Algorithm for Parallel Systems
Denham Coates-Evelyn Department of Computer Science Kings College, The Strand London WC2R 2LS, U.K. denham@dcs.kcl.ac.uk April 2001

Abstract An algorithm that does O(n2 ) comparisons and data moves to sort a list of n random data values on a uni-processor system is presented. We show a multiprocessor implementation √ that has complexity of O( n) using the MIMD model on a mesh connected system. The AT 2 √ complexity on this model is O(n5/2 lg(n)), which is sub-optimum by a factor of O( n/ lg(n)).

Key Terms: Parallel Processing, Inversion, Merge, Uni-processor/Multi-processor, AT 2 complexity, E?ciency.

INTRODUCTION There are well known algorithms that sorts a sequence of N data values in-place and in optimum time by doing a sequence of merge operation iteratively or recursively. In this case, the algorithm is describe as a merge-sort algorithm. Ideally we would like our merge algorithm to do the minimum possible compare and exchange operations for each element that is selected into its √ ?nal sorted location. The algorithm presented here on average does m + n compare/exchange operations for each element selected into its ?nal merged location when merging two sorted lists of m and n data values each in-place. This gives us an O(N 2 ) merge-sort algorithm on a 1

2

Denham Coates-Evelyn

uni-processor system. However, the algorithm gives AT 2 sub-optimum performance when implemented on a MIMD mesh connected system. We begin our discussion with its implementation on a uni-processor system.

1

Even Merge Algorithm

We de?ne the two dimensional (2-D) array of data values in which each data location is reference by a row and a column subscript, A = (aij ) where i = 0, 1 · · · m ? 1 and j = 0, 1, · · · n ? 1. Generally, let m ≤ n.

Denote a sort on all the rows of A and a sort on all the columns of A respectively as: SORT(Ri )A = B, where Ri = (ai1 , ai2 , · · · ain ), where i = 0, 1 · · · m ? 1 SORT(Cj )A = D, where Cj = (a1j , a2j , · · · amj ), where j = 0, 1 · · · n ? 1 Theorem 1 SORT(Ri )A = B followed by SORT(Cj )B = D · · · (1) or SORT(Cj )A = D followed by SORT(Ri )D = B · · · (2) retains the previous ordering of the row values created in B (or the previous ordering of the column values created in D). In other words, the rows (or columns) will remain sorted when a row sort (or a column sort) is followed by a column (or a row) sort on A. Proof Now B = (bij ). Where i = 0, 1 · · · m ? 1, j = 0, 1 · · · n ? 1 and bi1 < bi2 · · · bin . De?ne the macro SORT() as a comparing linear sort. We use this macro to sort the columns and rows in the 2-D array. Consider the element bij ∈ B and the inversions in the neighbourhood of bij . If bij > b(i+1)j then SORT (Cj )A will initiate Swap(bij , b(i+1)j ). But! If bij > b(i+1)(j+1) < b(i+1)(j+2) · · · b(i+1)n then since bij < bi(j+2) < . . . bin . Then bi(j+1) > b(i+1)(j+1) therefore SORT (Cj+1 )B initiates Swap(bi(j+1) , b(i+1)(j+1) ).

On the other hand!! If b(i+1)j < bi(j?1) > b(i+1)(j?2) > · · · b(i+1)1 . Since bij > bi(j?1) > bi(j?2) · · · bi1 . Then bi(j?1) < b(i+1)(j?1) . Therefore, SORT (Cj?1 )B initiates Swap(bi(j?1) , b(i+1)(j?1) ). The sequence of sorts performed on each column or row in A, B or D are independent of each other. Therefore, the order in which these columns (or rows) are sorted does not a?ect the ?nal

A Parallel Merge Algorithm

3

ordering created by the sort. A sort on all columns to the right and to the left of column j removes any inversion in the rows that intersects column j due to a sort on column j. Hence, the theorem is proven for the case of row sort followed by column sort on A.

It is not di?cult to see the proof for the case of a column sort followed by a row sort. The proof for this case can be easily followed by considering D = (dij ) instead of B, with the subscript values used in the above proof interchange. The column sorts are independent of each other and therefore, the order in which they are performed does not a?ect the ?nal outcome. The same is true of the row sorts. Hence, sorting on all m rows (or n columns) can be done concurrently. The reder should not misunderstand this result to mean that row sorting followed by column sorting creates the same permutation on A as column sorting followed by row sorting. It can be shown that the two sequence of sort operation gives distinct permutations on A.

The column sorts are independent of each other and therefore, the order in which they are performed does not a?ect the ?nal outcome. The same is true of the row sorts. Hence, sorting on all n columns (or m rows) can be done concurrently.

1.1

Sorting the 2 × p Row Vector

In this case, we set m = 2 and n = p to represent two pre-sorted sequences. We then implement repatetive column and row operations on this 2 × p array until a single sorted sequence is obtain. We represent the two sorted sequences to be merge as F and G as de?ned below. The algorithm takes the two sorted sequences: F = {F [0], F [1], · · · F [p ? 1]} where |F | = p and G = {G[0], G[1], · · · G[q ? 1]} where |G| = q and then merge them by iterative selection of the next maximum and minimum values among the two sub-list. For simplicity we let |F | = |G| = p. We do the following set of comparisons concurrently starting with the index variables i = 0, j = p ? 1, h = 0, k = p ? 1. (a) F [i] < G[h]? and (b) F [k] < G[j]?

We remove inversions between the pairs (F [i], G[h]) and (F [k], G[j]) by swapping where necessary

4

Denham Coates-Evelyn

and then increment i, or decrement j concurrently. The following algorithm selects F [0] as M in(F, G) and G[p ? 1] as M ax(F, G). Even Merge(F, G, p) START FOR i = 0 TO p/2 IF F[i] > G[i] SWAP(F[i], G[i]); NEXT i FOR j = p-1 TO p/2 IF F[j] > G[j] SWAP(F[j], G[j]); NEXT j END Note that the processing done by the two FOR-NEXT loops are independent of each other. Therefore, they can be implemented as two concurrent processes on a multi-processor system. Since F and G together are equivalent to a two row vector and the algorithm does the equivalent of a column sort on this row vector. It is not di?cult to see that Theorem 1 also applies to any m × n array. Hence whenever we apply this algorithm to F and G, the 2 × p array created from F and G remains column and row sorted. The algorithm below is a second but more general implementation of Even Merge on a uni-processor system. Even Merge(F, G, p) START FOR j=0 TO p FOR i = 1 TO p-j IF F[i+j] > G[i] SWAP(F[i+j], G[i]); NEXT i NEXT j END The number of comparisons done by the algorithm to create a row major sort of the 2 × p array created from F and G, is

p

j=1

1 j = p(p + 1) 2

A Parallel Merge Algorithm Theorem 2 De?ne the sequence: S = {(F [0], G[0]), (F [1], G[1]) · · · (F [p ? 1], G[p ? 1])}.

5

Even Merge selects (F [r], G[p ? r]) as the next (r ? 1)th minimum and maximum in S at the end of the (r ? 1)th iteration of the outer most FOR-NEXT loop.

Proof Since F and G have the ordering < de?ned on them then we have the following min(F ) = F [0], max(F ) = F [p ? 1], min(G) = G[0], max(G) = G[p ? 1]. When execution of the inner FOR-NEXT loop is terminated we have min(S) = F [0] and max(S) = G[p ? 1]. Since F [0] < G[0], G[p?1] > F [p?1] and the 2×p array remains column and row sorted, the next time the outer FOR-NEXT loop is executed, we have min(S ? F [0]) = F [1] and max(S ? G[p ? 1]) = G[p ? 2]. Assume the theorem to be true for r = k, we know that the algorithm does comparisons and data exchanges so as to create column and row sort between the sub-set of data values {F [k], F [k + 1], · · · F [p ? (k + 1)]} and {G[k], G[k + 1] · · · G[p ? (k + 1)]}. The algorithm does not compare paired values (F [j ? 1], G[p ? (j ? 1)]) for which j ≤ k or j ≥ p ? k. Hence, we see that at the end of the (k + 1)th iteration the outer for-next loop selects F [k] and G[p ? k] as min({F [k], · · · F [p ? k], G[k] · · · G[p ? k]}) and max({F [k], · · · F [p ? k], G[k], · · · G[p ? k]}) respectively. Hence, the theorem is proven by induction on r. The following sequence of permutations on F and G illustrates an instance of Even Merge.

1.2

Example Instance of Even merge

The original permutation on F and G are as follows: Let F = {00, 06, 15, 26, 31, 37, 43, 48, 54, 60} and G = {05, 11, 18, 20, 27, 29, 33, 45, 50, 58}

6

Denham Coates-Evelyn

(1)F = {00), 05, 06, 15, 26, 31, 37, 43, 48, 54}, G = {11, 18, 20, 27, 29, 33, 45, 50, 58, (60} F [0] = min(L), G[p ? 1] = max(L) (2)F = {00, 05), 06, 15, 26, 29, 37, 43, 48, 54}, G = {11, 18, 20, 27, 31, 33, 45, 50, (58, 60} (3)F = {00, 05, 06), 15, 20, 27, 31, 33, 45, 50}, G = {11, 18, 26, 29, 37, 43, 48, (54, 58, 60} (4)F = {00, 05, 06, 11), 18, 26, 29, 33, 43, 48}, G = {15, 20, 27, 31, 37, 45, (50, 54, 58, 60} (5)F = {00, 05, 06, 11, 15), 20, 27, 31, 37, 45}, G = {18, 26, 29, 33, 43, (48, 50, 54, 58, 60} (6)F = {00, 05, 06, 11, 15, 18), 26, 29, 33, 43}, G = {20, 27, 31, 37, (45, 48, 50, 54, 58, 60} (7)F = {00, 05, 06, 11, 15, 18, 20), 27, 31, 37}, G = {26, 29, 33, (43, 45, 48, 50, 54, 58, 60} (8)F = {00, 05, 06, 11, 15, 18, 20, 26), 29, 33}, G = {27, 31, (37, 43, 45, 48, 50, 54, 58, 60} (9)F = {00, 05, 06, 11, 15, 18, 20, 26, 27), 31}, G = {29, (33, 37, 43, 45, 48, 50, 54, 58, 60} (10)F = {00, 05, 06, 11, 15, 18, 20, 26, 27, 29}, G = {31, 33, 37, 43, 45, 48, 50, 54, 58, 60}

The algorithm requires that F and G to be pre-sorted. It would therefore, on this point be suitable for creating well sorted sub-matrices formed from adjacent pairs of rows or columns in a column, row and Right-to-Left (RL) diagonal sorted matrix. However, the time complexity of the algorithm on the number of comparisons is O(n2 ) (see section on analysis).

1.3

Bottom-up Even Merge-Sort

We use the algorithm iteratively or recursively to sort a list of N data values by doing lg(N ) split of the list of data values. The following algorithm uses Even Merge to do sort on a linear list of N data.

First, adjacent pairs of data values in the list to be sorted are sorted by Pair-Sort. Therefore, the ?rst phase does N/2 comparisons. We de?ne Pair-Sort as given below, where H is the array that contains the N data values to be sorted.

A Parallel Merge Algorithm Pair Sort(H, N) Start FOR(j = 0; j < N; j += 2) if(H[j] > H[j+1]) Swap(H[j], H[j+1]); NEXT j End

7

We de?ne the function Linear Sort used for creating the set of partitions to be merged by Even Merge.

Linear Sort(k, N) FOR x=2 to lg(N) Even Merge All pairs of adjacent sub-lists NEXT x

The list to be sorted is ?rst segmented into 2x sub-lists on each pass of the while loop. The FOR-NEXT loop implements Even Merge on each adjacent pair of these segments. The number of data values in each segment is 2x . The algorithm appears to be of little value because of its O(N 2 ) time complexity. However, it is implemented on the n × n connected mesh to give a time complexity of O( (n)) which is optimum for the n × n connected mesh [17]

1.4

Complexity on the Uni-processor System

For the purpose of simplicity assume that N is a power of 2. During the rth pass of the outer FOR-NEXT loop in Linear Sort, H is divided into N/2r partitions. Each partition consists of 2 × 2r?1 pre-sorted 1-D arrays at the start of a sequence of merge operation on a partition. Where 1 ≤ r ≤ x = log2 (N ). The inner FOR-NEXT loop makes Yr = N/2r+1 calls of Even Merge for each pair of sub-list in the Yr partition. Since there are 2r elements in each row vector of a partition, then Even Merge does 2r (2r?1 + 1)/2 comparisons on each partition. Total comparisons during the rth pass is Yr × 2r × (2r?1 + 1)/2. Number of passes made by Linear Sort on H is x. Therefore, number of comparisons performed during an instance of Linear Sort is

8

Denham Coates-Evelyn

x

Yj × 2j (2j?1 + 1)/2
j=1 x

=
j=1

N.2j?1 j?1 N (2 + 1)/2 ? j+1 2 8

(2j + 2)

= =

N x (2 ? 1 + 2x) 8 N (N + 2 lg(N ) ? 1) 8

The number of comparisons is always the same regardless of the nature of the input permutation. The algorithm does the smallest number of data moves when the 2 × p data array is already sorted.

The algorithm was implemented using C/C++ coding and then compiled and run on a PC. Results from random integers are consistent with our analysis on the number of data moves and comparisons.

2

Sorting On Parallel Machines

We can implement Even Merge as a set of primitive processes on a parallel system to give optimum performance on that system. As can be seen, the algorithm has a high degree of concurrence and is easily mapped onto existing parallel machines such as the MIMD mesh-connected system, the SIMD mesh-connected or Multiprocessor Orthogonal access machine [15,2]. Figure 1 illustrates the implementation of Even Merge on a 2 × p mesh and more generally on the n × n mesh using the MIMD model of computation.

A Parallel Merge Algorithm

9

F

F[0] . . .

F[i]

F[i+1]

F[i+2] . . .

F[p-1]

G

G[0]

G[i]

G[i+1]

G[i+2]

G[p-1]

Figure 1 Mesh interconnected 2 × p processors that maps the data structure used by Even Merge.

Consider the following 2 × p interconnections of processors illustrated in ?gure 1. We use F and G in ?gure 1 to represent sub-?les F and G in Even Merge as on the uni-processor system. Each processor has no less than 3 Registers and can execute the following instructions:

(i) Transmit a data value from a given register to any of its nearest neighbour. (ii) Receive a data value from any of its nearest neighbour into a given register. (iii) Compare the content of any two registers and place the Maximum and the Minimum values into selected registers. Figure 2 illustrates paths in the ?ow of data trough any of the processor.

6 ?  Pi

 -

Figure 2 Possible data ?ow path through each processor in the 2 × p array of interconnected processors.

6  ?

I @ @

Bi-directional data paths

10

Denham Coates-Evelyn

To see how we implement the equivalent of Even Merge on the 2 × p mesh, consider the section of the 2 × p mesh illustrated in ?gure 1. We refer to the content of register R1 in fi or gi as fi and gi respectively. Pairs of adjacent values in rows F and G are sorted as follows:

√ To describe how the interconnection is used to implement a sort in O( n) time, we de?ne the following function on processor F [i] and G[i]. Transmit(j, k): Transfer content of selected register j in a given processor to selected register k of a nearest neighbour processor. In ?gure 3 we use the symbol Ri to mean a register with the subscript to indicate the actual register number in a processor.

Start

Pair Merge F, G 1:Transmit fi (and gi ) to R2 of fi+1 (gi+1 ) Transmit fi+1 (and gi+1 ) to R2 of fi (gi ) Transmit fi+2 (and gi+2 ) to R2 of fi+3 (gi+3 ) Transmit fi+3 (and gi+3 ) to R2 of fi+2 (gi+2 ) 2: R1 ← Min(R1 , R2 ) For gx , fx R2 ← Max(R1 , R2 ) Where x = i, i + 2 R1 ← Max(R1 , R2 ) For gx = i, fx R2 ← Max(R1 , R2 ) Where y = i + 1, i + 3

CONCURRENTLY

CONCURRENTLY

End Figure 3 Sequence of parallel operations to implement the equivalent of Pair-Merge on a uni-processor system.

This point in the algorithm is equivalent to a sort on adjacent pairs of data values in a linear list. Where (fi , fi+1 ), (gi , gi+1 ) and so on correspond to adjacent pairs in the linear list.

A Parallel Merge Algorithm

11

Even Merge F, G Start: DO CONCURRENTLY 1: Transmit fx to R2 of gx , x = i, i + 1 Transmit gx to R2 of fx , x = i, i + 1 2:R1 ←? Min(R1 , R2 ), for fx , x = i, i + 1 · · · i + 3 R2 ←? Max(R1 , R2 ), for fx , x = i, i + 1 · · · i + 3 R1 ←? Min(R1 , R2 ), for gy , y = i, i + 1 · · · i + 3 R2 ←? Max(R1 , R2 ), for gy , y = i, i + 1 · · · i + 3 3:Transmit Content of R2 in all fx to R2 of fx+1 , x = i, i + 1, i + 2 Transmit Content of R2 in all gy to R2 of gy?1 , y = i + 3, i + 2, i + 1 4: Repeat 2: and 3: for x = i + 1, i + 3, y = i, i + 2 END DO CONCURRENTLY 5: Exchange (R1 , R2 ) in fx for x = i + 2, i + 3; Exchange (R1 , R2 ) in gy for y = i, i + 1 6: Repeat 1: and DO WHILE (x < i + 3) DO 2: and DO 3 for x = x to i + 2; y = y to i + 1; x = x + 1, y = y ? 1; END DO WHILE Figure 4. Sequence of parallel operations to implement the equivalent of Pair-Merge as implemented on a uni-processor system. At the end of 6 :, the system of data values in R1 of the 2 × p array of processors forms a sorted 2 × p array of data values. Note that the values in R1 , of fi and gi that forms the adjacent pairs (fi , fi+1 ) and (gi , gi+1 ) contains Min(fi , fi+1 ) and Min(gi , gi+1 ) respectively at the end of Pair Merge. Similarly R1 of fi+1 and gi+1 stores Max (fi , fi+1 ) and Max (gi , gi+1 ) respectively at the end of Pair Merge. The example instance given below clearly illustrates this.

If we do a row major traversal trough register R1 of each 2 × 2 array at the end of 4 : in Even Merge1, we see that we have the equivalent of two row major sorted 2× 2 arrays. The next stage in the algorithm merges adjacent pairs of these 2 × 2 arrays to create pairs of row major sorted 2 × 4 arrays. The example given below clearly illustrates this. Note also that at the end of 4 : the pair that is adjacent to (fi , fi+1 ), i.e. (fi+2 , fi+3 ) has an exact copy of the value in register R1 , of the corresponding pair in G, i.e. (gi , gi+1 ). Therefore, to create a linear sorted list from the array created from the 2 × 2 array consisting of (fi , fi+1 ) and (gi , gi+1 ), we simply ?ick the content of registers (R1 , R2 ) for the adjacent pairs (fi+2 , fi+3 ) and (gi , gi+1 ). This is repeated during instance of the 2 × 4 array. The example instance below clearly illustrate this.

12

Denham Coates-Evelyn Example Instance Of Parallel Even-merge Illustrated

Consider the following permutation that is to be sorted, where p = 4. The values are sorted in the 2 × 4 meshed as illustrated by the following:

(1) 8 5 4 2 5,8

(2) 8,5 2,4 4,2

6 1 7 3 Even Merge 1,6 6,1 3,7 7,3

(3) 1,5 6,8 2,3 4,7

?
1,5

(4) 5,6 2,8 3,4

5,1 8,6 3,2 7,4 6,5 8,2 4,3 7,4

?
(5) 1,5 5, 6 2,6 3,8 Flick 6, 2 8,3 4, 3 7,4 DO 1 1,5 (6) 5,6 6, 2 8, 3

R1 , R2 2, 6 3, 8 4,3 7,4

-

?

A Parallel Merge Algorithm (7) 1,2 3,5 4,6 7,8 1,2 (8) 2, 3 4, 5

13

6, 7

2, 1 5, 8 6, 4 8, 7 3, 2 5, 4 7, 6 8, 7

(9) 1,2 2, 3 3, 4 5, 6

?
1,2

(10) 2, 3 3, 4 4, 5

4, 3 6, 5 7, 6 8, 7 5, 4 6, 5 7, 6 8, 7

(11) 1, 2 2, 3 3, 4 4, 5

?

2, 1

5, 8

6, 4

8, 7

Figure 5 Example instance of Linear Sort on a 2 × 4 connected mesh of processors that maps the data structure used by Even Merge. The following algorithm will generally implement sorting of 2p data values on the 2 × p MIMD mesh connected system. We set the variable k to represent the partitions of the p processors during an instance of Linear Sort1. We su?xed Linear Sort with a 1 to distinguish it from the uni-processor implementation. We also de?ne the set of concurrent variables {imin } = 1, k + 1, · · · p ? k. To simplify our explanation, we assume p to be an exact power of 2.

14

Denham Coates-Evelyn Linear Sort k ←? 2; imin ←? {1, k + 1 · · · p ? k; } Pair Sort F , G; //All adjacent pairs in F and G are now sorted DO CONCURRENTLY ON ALL is IN ALL PARTITIONS Transmit fi to R2 of gi Transmit gi to R2 of fi DO CONCURRENTLY ON ALL is IN ALL PARTITIONS R1 ←? Min(R1 , R2 ), in fi R2 ←? Max(R1 , R2 ), in fi R1 ←? Min(R1 , R2 ), in gi R2 ←? Max(R1 , R2 ), in gi DO CONCURRENTLY ON ALL PARTITIONS Transmit Content of R2 in all fi to R2 of fi+1 , ?i ≥ imin Transmit Content of R2 in all gi to R2 of gi?1 , ?i ≤ imin DO CONCURRENTLY ON ALL is IN ALL PARTITIONS x ←? 1 WHILE(x ? k) DO REPEAT 2: and 3: IN F for(i = imin + x, < imin + k; i := i + 1) REPEAT 2: and 3: IN G for(i = (imin + k) ? x, > imin + k; i := i ? 1) x := x + 1; END DO k ←? 2k; imin ←? 1, k + 1 · · · p ? k DO CONCURRENTLY ON ALL PARTITIONS WHILE(k ? p) DO 1:, 2:, 3:, 4:, 5: END DO END DO WHILE Figure 6 Implementation of a linear Sort on a multi-processor system.

1:

2:

3:

4:

5: 6:

Theorem 3 LinearSort1( ) requires time proportional to n/2 + lg(n/2) to sort n data values on the 2 × p mesh, where n = 2p. Proof Denote the time delay to transmit a data value between any two adjacent processors as Tt and the time used to exchange the contents of registers R1 and R2 as Te . Figure 7 illustrates the binary three that partitions the data set during the merge-sort operation. We use this to illustrate the following proof of the theorem.

The data set is partitioned and merged through lg(p) steps. The sequence of partitions created at each step duing the merge mapps onto a a binary tree. At level r of this binary tree, there

A Parallel Merge Algorithm are 2r data values in each partition to be sorted. Each partition is sorted in parallel. Values of r x = log2 (p) . . . 4 3 2 1 1 2 3 4 5 6 ··· ··· ··· ··· 7 8 p-1 p

15

1

2

3

4

5

6

7

8

p-1

p

Figure 7 The sequence of parallel partitions during the sort maps onto a binary tree. The number of times the Transmit instruction is executed concurrently at level r is 2r + 2. The extra 2 instructions are executed whenever a copy of the value in gi is sent to R2 of fi and a copy of the value in fi is sent to R2 of gi . Each transmit instruction is followed by a compare and possibly exchange operation. Therefore, we see that the total time required to sort a partition at the rth level is (2r + 2)Tt + 2r Te . The height of the binary tree is x = lg(p). Therefore, summing over all rs gives total time.

x

TT otal =
r=1

{(2r + 2)Tt + 2r Te }

= {(p ? 1) + 2 lg(p)} Tt + (p ? 1)Te 1 = n(Tt + Te ) + 2Tt lg(n) ? (3Tt + Te ) 2 This ends the proof.

We see that the complexity of Linear Sort1 on the number of comparisons data transmissions and data exchange is O(n). Using e?ciency metrics as de?ned in [2], we see that the cost of this

16

Denham Coates-Evelyn

system is O(n2 ) and the e?ciency is O(lg(n)/n). We optimise these values by adopting Linear Sort1 to do sorting on the m × m mesh. Where m = n 2 . We illustrate an interconnection con?guration in ?gure 8.
1

We can connect f1 and gp to an external memory element such as a shift register so that the p data values in F and G are shifted out simultaneously to external memory during the last phase of the algorithm. Hence, we can do modi?cation to the system so that the cost of external data storage and retrieval is minimal. Using word parallel modules for each processor, we see that the Area complexity of the system is O(n lg(n)). Hence, this gives us an AT 2 complexity of O(n3 lg(n)). We further improve the complexity to O(n 2 lg(n)) as follows: Consider the interconnection of processors in ?gure 8.
5

P1j . . . . .

.

P1i

. . .

Pij

. . .

Pin

. . . Pmj Figure 8 Interconnection and communication of processors in the m × n mesh. We assume that each processor has at least 3 internal registers and can interact with any of its nearest neighbour as illustrated by the interconnections in ?gure 8. Assume that each processor interacts with its nearest neighbor within the ?xed time interval of Tt . We do the equivalent of a column followed by a row and RL diagonal sort as illustrated in ?gure 9: Note that the time

A Parallel Merge Algorithm complexity of steps 1: and 2: is O(m).

17

START 1: DO Linear Sort1() on all pairs of adjacent columns 2: DO Linear Sort1() on all pairs of adjacent rows 3: DO CONCURRENTLY ON ALL Pij ∈ Dij Transmit content of R2 in all Pi?1j to R2 of Pi?1j+1 ?i, j Transmit content of R2 in all Pij+1 to R2 of Pij ?i, j 4: R1 ←? Min(R1 , R2 ) R2 ←? Max(R1 , R2 ) in Pij R1 ←? Max(R1 , R2 ) R2 ←? Min(R1 , R2 ) in Pi+1j END Figure 9 Sequence of concurrent operation that implements Linear Sort1 on the m × m connected mesh.

The last two steps in the algorithm correspond to a RL diagonal sort. The complexity of this step is the same as with the ?rst two steps. From this, we see that the overall time complexity for sorting the n data values is O( (n)). This is within the theoretical limit for sorting on the 2 × p mesh.

18

Denham Coates-Evelyn

3

Conclusions

From the discussion, we see that Even Merge is non-optimum when implemented on a uniprocessor system. However, we see marked improvement in the performance of the algorithm when it is implemented on a parallel system. The performance of the algorithm on a uniprocessor system can be improved by by taking advantage of pre-sorted runs of data values between F and G. To do this, we insert the following piece of code just before the nested FOR-NEXT loop in Even Merge.

Start While(F[p-1]<G[j]) j-; while(F[i]<G[0]) i++; End

In our analysis we assume that the number of data values to be sorted n is an exact power of 2. As with many other merge-sort algorithm, the algorithm can be easily modify in a similar manner to so that it does sorting on a list of n data values for which n is not an exact power of 2.

It should be noted that whenever the algorithm is implemented on the mesh connected system to do the equivalent of a column and row sort, the e?ciency of the algorithm is not a?ected by the order in which column and row sorts are implemented. In fact, a column sort in this case creates column major sort on adjacent pairs of columns and row major sort on adjacent pairs of rows. However, because of the nature of this parallel system and its limitations, it is not possible to take advantage of these end results on this system. It is possible to do so in the case of the MIMD PRAM or the orthogonal multi-processor parallel computer systems.

A Parallel Merge Algorithm

19

References
[1] A. V. Aho, J. E. Hopcroft and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, 1974. [2] Selim G. Akl Parallel Sorting Algorithms Academic Press Inc, 1985 [3] G. H. Barnes The ILLIAC Computer IEEE Trans.Comp 17 1979, 746-757 [4] H. Lynn Beus The use of information in Sorting. Jour. ACM 17 (1970) 482[5] Francis Y. Chin and K. Samson Fok Fast Sorting Algorithms on Uniform Ladders (Multiple Shift Register Loops) IEEE Trans. Comp. 29 Feb. 1980 618-631 [6] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest Introduction to algorithms MIT Press, McGraw-Hill [7] R. B. Dewar A stable minimum storage algorithm. Inform. Process. Lett. 2 (1974) 162-164

20 [8] Kai Hwang, Ping-Sheng Tseng and Dongseung Kim An Orthogonal Multiprocessor for Parallel Scienti?c Computation IEEE Trans. Comp. 38 #1 Jan 1989 47-61 [9] Rainer Kemp; B.G. Teubner

Denham Coates-Evelyn

Fundamentals of the Average Case Analysis of Particular Algorithms John Wiley & Sons, 1984 [10] D. E. Knuth The Art of Computer Programming, Vol. 3 Addison-Wesley, Reading Massachusetts

[11] Glenn K. Manacher The Ford-Johnson Sorting Algorithm is not optimal Jour. ACM, 26 1979 441-456 [12] Fredrick Manne & Tor Sorevik Optimal partitioning of a sequence. Journ. of Algorithms. 2 . 9. 95 [13] Heikki Mannila Measures of Pre-sortedness and Optimal Algorithms IEEE Trans. Comp. 34 (1985) 318-325 [14] D. E. Muller and F. P. Preparata Bounds to Complexities of Networks for Sorting and for Switching Journ. Assoc. Comp. Mach. 22 Apr. 1975 195-201

A Parallel Merge Algorithm [15] Gregory J.E. Rowlings An introduction to the Analysis of Algorithms pp 230 - 285 [16] Kazuhiro Sado and Yoshide Igarashi Some Parallel Sorts on a Mesh Connected Processor array and their time e?ciency Journ. Parall. & Distrib. Comp 3 (1986) 398-410 [17] Isaac D. Scherson, Sandeep Sen Parallel Sorting in Two-Dimensional VLSI Models of Computation IEEE Comp. Soct. Press (1993) 159-170 [18] D. L. Shell A High-Speed Sorting Procedure Comm. ACM (Jul. 1959) 30-3 [19] H. J. Siegel A model of SIMD Machines and Comparisons of various Interconnected Network. IEEE Trans. Comp. 28 Dec. 1979, 907-917 [20] Clark D. Thompson The VLSI Complexity of Sorting IEEE Trans. Compt 32 Dec 1983 [21] C. D. Thompson, H. T. Kung Sorting on a Mesh-Connected Parallel Computer Comm. ACM 20 Apri. 1977, 263-271

21

22 [22] Valliant, L. G. Parallelism in Comparison Problems, SIAM Journal of Computing 4 #3 (1977) 348-355

Denham Coates-Evelyn

To accommodate Teradata?s massively parallel ...MERGE or do a INSERT/SELECT into the table ...a table in place of traditional secondary indexes...

### biztalk_and_ssis_comparison

of the XML with XSLT, XPATH, diff/merge, etc...to connect many different systems and applications....in parallel (by splitting the batch and ...

### SCM Interview Question & Answer

files and/or an entire software system at the same time, also includes the capability to merge these changes after they have been made in parallel. ...

### Google Software Engineering Prep Doc

n*log(n) sorting algorithm, preferably two (say, quicksort and merge ...System Design http://research.google.com/pubs/DistributedSystemsandParallel...

### 分布式系统领域经典论文翻译集

9. Sawzall:Interpreting the Data--Parallel ...in a Distributed System(译) --Leslie The Part ...13. The log-structured merge-tree (译) lock ...

### 分布式系统领域相关资料

Sawzall:Interpreting the Data--Parallel Analysis ...Timestamps in Message-Passing Systems That Preserve...The log-structured merge-tree 9. lock free data...

### 常用的图形库

GeoMerge European Space Agency VPlanet - The Virtual Skyline GeoPlace ...Parallel Simulations Stanford Volume Rendering SGI Volumizer Project The Volume ...

### 标准版和企业版ORACLE区别(中文版)

parallel query optimization based on system ...MERGE allows both tasks to be performed ...This algorithm utilizes B-tree indexes (not bit...

### 英语翻译1

video display, and both serial and parallel ports...4. In a typical computer system, a motherboard...Quick sort also competes with merge sort, another...

### 并行归并排序

SYSTEMTIME lpsystimeStr; SYSTEMTIME lpsystimeEnd; int** ppDataForSerial ... Parallel_MergeSort(ppDataForParallel, size, compare); GetLocalTime(&lpsys...