当前位置:首页 >> >> Branch Anticipation using Loop Pipelining for Resource Constrained Systems

Branch Anticipation using Loop Pipelining for Resource Constrained Systems


Branch Anticipation using Loop Pipelining for Resource Constrained Systems
Ted Zhihong Yu Nelson Luiz Passos Edwin Hsing-Mean Sha

Dept. of Computer Science & Engineering University of Notre Dame Notre Dame, IN 46556

Abstract
Loop pipelining is an e ective technique to explore parallelism found in program loops, when designing high performance application speci c systems. However, branches within loops may degrade the performance of pipelined architectures. It is an open question how applications with multiple branches can be loop pipelined in the most e cient way. This paper presents properties, theories, and experiments of a new loop pipelining algorithm, called Branch Anticipation. This new method, based on the rotation technique, carries out conditional resource sharing and reduces additional hardware requirements incurred by branches within the loops. The optimization may require propagation of additional branch decision signals along the schedule. We show that hardware constraints for such propagation must be considered during the scheduling process. We further demonstrate that the method is practical and achieves the same schedule length as the regular rotation scheduling with a smaller number of rotations and minimal additional hardware.

1

1 Introduction
Computation intensive applications usually depend on time-critical sections consisting of a loop of instructions. The schedule of the operations comprising the loop can be optimized in order to improve its execution time, by exploring the parallelism embedded in repetitive patterns of the loop. This new schedule is then applied in the design of application speci c integrated circuits (ASICs). However, the use of pipelined architectures may present a signi cant overhead if there are conditional statements in the loop. In this paper, we analyze the propagation of the outcome of a condition evaluation along the schedule. A method, called branch anticipation, which reduces additional hardware requirements by using loop pipelining, is then proposed. While considering resource constraints, our new technique also examines the consequences of propagating the branch decision such that no modi cations in the hardware will be required. Within certain hardware constraints, it is expected that pipeline stalls caused by branches within loops can be eliminated by anticipating the next branch decision with 100% accuracy. Studies concerning conditional branches have proposed several di erent solutions. In 20], a reordering of the instructions is suggested in order to ful ll the pipeline stalls. Branch prediction is another way to eliminate pipeline stalls, by using branch history or just a guess to predict the outcome of a condition test. However, these methods do not consider the extra hardware requirement for the conditions under control. In Instruction Level Parallelism research, branch predication 11, 12, 13, 14, 15] has been proposed in order to completely remove conditional branches via conditional execution of individual instructions. This method usually requires a signi cant amount of e ort in redesigning instruction formats and is restricted to small numbers of instructions following the branch. In this paper, the new schedule is generated based on the available hardware, without changes in the target architecture.
Loop pipelining can explore parallelism across iterations for cyclic applications 19, 21, 22, 23]. Two major algorithmic approaches for handling cyclic Data Flow Graphs (DFGs), where nodes represent computations and edges represent dependencies between the nodes, are discussed in the literature. One is to schedule the loop body represented by a Directed Acyclic Graph (DAG) 25]. However, this method cannot exploit the pipelining across loop boundaries. The second method involves unrolling or unfolding the whole loop, resulting in a huge acyclic DFG. This approach, nevertheless, is applicable only to loops with a small number of iterations which must be known in advance 26] to avoid an exponential solution time. Software pipelining 19] is an approach used to overlap instructions by moving them across iteration boundaries. This and most of the previous research in loop pipelining for loops with cyclic dependences produced methods which are only applicable to problems which do not include conditional statements 1, 4, 5, 7, 9]. A polynomial time rate-optimal scheduling without considering re-

1

source constraints or communication overhead for cyclic data- ow graphs was proposed by Chao and Sha 16]. This paper uses the retiming technique, a graph transformation technique in which delays are moved around the graph, to characterize the e ect of loop pipelining 6]. The retiming technique is implicitly executed by the branch anticipation scheduling method. This algorithm improves an initial resource-constrained schedule by incrementally using the rotation technique 1, 2] which explores an overlapping pattern of the loop. A rotation traditionally corresponds to incrementing by one a retiming function which is applied to a node, an equivalent way of moving the iteration boundaries in an unrolled schedule down by one control step, such that a node from the next iteration is brought in the current iteration, while trying to reschedule that node to the earliest possible control step. Previous research on resource sharing has also proposed scheduling techniques for loops containing conditional blocks 10, 17]. These results, however, do not consider the hardware requirements in order to propagate the branch decisions along the nal schedule. This paper focuses on the application of the rotation technique to eliminate control hazards, reduce the schedule length and satisfy resource constraints, including those imposed by the propagation of boolean conditions. Loops are represented by cyclic data ow graphs with nodes representing branch decisions on regular operations. Such graphs are called Conditional Data Flow Graph (CDFG). Figure 1 (a) presents a simple example of a Conditional Data Flow Graph. Nodes 1 and 5 represent conditions (also known as fork nodes 10]) while nodes 2, 3, 4, 6, 7, 8 and 11 are regular operations. Nodes 9 and 10 are dummy nodes, representing join nodes corresponding to nodes 5 and 1, respectively. An example of inter-iteration dependencies is represented by the edge coming from node 11 to node 1 in gure 1 (a). Two bar-lines, termed delays, over this edge indicate the number of iterations separating two inter-iteration dependent nodes 24]. This feedback edge represents the loop-carried dependency between those two nodes. On the other hand, intra-iteration dependencies are determined as dependent nodes in the same iteration which are represented as edges with no delays in the graph. For simplicity, we assume that every node in this graph takes one time unit to nish its execution, as indicated by the parenthesized number beside each node. Figure 2 depicts the schedule table presented as multiple copies of an iteration in order to show its repetitive context. The initial static schedule for the execution of this CDFG is presented in gure 2 (a) where nodes 6 and 8 share the same ALU because they are mutually exclusive in the execution path. This initial schedule was computed by the conditional resource sharing algorithm proposed in 17]. Figure 1 (b) shows the retimed graph after pushing one delay through node 1. By applying a regular rotation to the initial schedule table, the result 2

T (1) 2 (1) 4 6

1(1) F

3 (1) 5 (1) F

T (1) 2 (1) 4 7 (1) 8 (1) 6

1(1) F

3 (1) 5 (1) F

T (1)

T (1)

7 (1) 8 (1)

9 10 11(1) 10 11(1)

9

(a)

r(1) = 1; r(u) = 0 for u = 1 (b)

Figure 1: Example of conditional data ow graph (a) before retiming (b) after retiming
1 5 1’ 1 5 1’

(a)

(b)

(c)

Figure 2: (a)Initial schedule (b)Schedule after rotating node 1 (c)Branch anticipation schedule 3

would be the schedule presented in gure 2 (b), where \Pro1" indexes the rst control step in the prologue thus formed, and vertical arrows illustrate how long the condition results should be kept. After node 1 is rescheduled or re-mapped to the new position in the table, it is represented by the notation 10 in a box. This is feasible because node 1' came from the next iteration which does not have any dependency with any node in this current iteration. Correspondingly, in gure 1 (b), one delay from the incoming edge of node 1 is drawn and moved to each of its outgoing edges. In other words, the current iteration boundary is shifted by one control step so that node 1' from the next iteration participates with the other nodes in the current iteration. However, the new condition 10 , introduced by the above rotation, might cause a con ict because it implies that extra hardware must be considered in the rescheduling process in order to store the decision information. Nonetheless, the application of the Branch Anticipation Scheduling leads to the schedule in gure 2 (c) which does not demand extra hardware to memorize the condition decision. Using our branch anticipation algorithm, such potential con icts are analyzed and conditions are overlapped only if the required hardware is available. Figure 7 (c) shows that our algorithm can achieve the same schedule length as the regular rotation scheduling with a smaller number of rotations and minimum additional hardware. Our approach is especially e ective when there is a large number of nested conditional constructs in the CDFGs because of the intensive utilization of conditional resource sharing. The capability of anticipating branch decision outcomes by rotating and rescheduling fork nodes may also reduce hardware pipeline stalls to a great extent. In order to show how to obtain such results, the next section establishes some of the basic concepts to be used in this paper. Section 3 de nes the branch anticipation scheduling technique. Also, a heuristic method solving the scheduling problem is presented. Some examples based on existent benchmarks are discussed in section 4 to show the e ectiveness of the application of the algorithm. A nal section summarizes the theory and experimental data presented.

2 Basic Principles
In this subsection we present some basic concepts and de nitions in the interpretation and modeling of loops with conditional statements. We begin by de ning a conditional data ow graph. A conditional data ow graph (CDFG) G = (V; E; d; t; k) is a node-weighted and edgeweighted directed graph, where V represents the set of operations, E V V represents the set of dependence edges, d is a function from E to Z , representing the number of delays between two adjacent nodes, t is a function from V to the positive integers, representing the computation 4

time of each node, and k is a function from V to the set of types, eg, ffork; join; alu; mult; divg, representing whether the node is a conditional statement, the join point for a corresponding fork, or just one of the regular operations. An example of a Conditional DFG was shown earlier in gure 1 (a). The parenthesized number beside a node represents the computation time t for that node. In this simple example, all nodes carry out unit-time computations. Join nodes 9 and 10 are used to build our model and do not perform any actual operation. An iteration is the execution of each node in V exactly once. Inter-iteration dependences are represented by weighted edges. A control step consists of one time unit which is equivalent to the synchronization interval of the system. A static schedule of a loop is repeatedly executed for every iteration. A static schedule must obey the precedence relations de ned by the subgraph of the CDFG, which consists only of edges without delays. A legal CDFG must have no zero or negative-delay cycle, i.e., the summation of delays along any cycle must be greater than 0. For any iteration j , an edge e from u to v with delay d(e) conveys that the computation of node v at iteration j depends on the execution of node u at iteration j ? d(e). An edge with zero delay represents a data dependence within the same iteration (computational block).

e The notation u ? v means that e is an edge from node u to node v. The notation u p v ! ; ek?1 e1 e0 ! ! means that p is a path from u to v. The delay of a path p = v0 ? v1 ? v2 : : : ? vk is Pk?1 d(ei) and the total computation time of the path p is Pk=0 t(v! d(p) = i=0 i ). i

2.1 Retiming a Data Flow Graph
The period during which all computation nodes in an iteration are executed, according to existing data dependences and without resource constraints, is called cycle period. The cycle period C (G) of a CDFG G = (V; E; d; t; k) is the maximum computational time among all paths that have no delays, each of which is called a critical path. For example, the CDFG in gure 1(a) has C (G) = 6, which can be measured through the path p = 1 ! 3 ! 5 ! 7 ! 8 ! 11. The cycle period for a CDFG is the length of its static schedule without resource constraints. The retiming technique was rst implemented to reduce the cycle period of circuits 6]. This method attempts to evenly rearrange registers (delays) in a circuit so that the cycle period gets smaller. A retiming r is a function from V to Z that redistributes the nodes in the iteration space which is created by replication of the original CDFG G. A new CDFG Gr = (V; E; dr ; t; k) is generated, where dr (e) = d(e) + r(u) ? r(v) for every edge e, and each iteration still has one execution of each node in G. The retiming function r(u) of a node u 2 V represents the o set between the original iteration containing u, and the one after retiming. The delays d(e) are changed accordingly to preserve dependences, i.e., r(u) represents the number of delay components subtracted from each of the incoming edges w ! u, and pushed into each of the 5

outgoing edges u ! v, where u; v; w 2 V . After retiming, the execution of node u in iteration i is moved to iteration i ? r(u). For example, gure 1(b) shows the CDFG from gure 1(a) retimed by the function r whose value is shown at the bottom of the gure. The critical path of this graph includes the edges 3 ! 5 ! 7 ! 8 ! 11 with an execution time of 5 time units. On retiming, negative delays represent fetching data from future iterations and cannot be allowed. After applying a legal retiming r on a CDFG G, we obtain a nonnegative dr (e) for every edge e 2 E. A prologue is the set of instructions that are moved to the beginning of the execution, and that must be executed to provide the initial data for the iterative process, just like the row labeled \Pro1" in gures 2 (b) and (c). An epilogue is a complementary set of instructions which need to be executed to complete the iterative process. Considering that the entire problem consists of a large number of iterations, the time required to run the prologue and epilogue is negligible.

2.2 Rotation Scheduling
The down-rotation of a set X V pushes one delay from each of the incoming edges of X into each of its outgoing edges, transforming the CDFG G into GX . As illustrated in gure 1 (b), the set of node f1g, a source of the original DAG, becomes a sink in the new DAG. Thus it is intuitively rotated down. The operation of up-rotation on set X transforms G into G?X by pushing one delay in the reverse direction. In this paper, we will only use down-rotations. We say that a set X is down-rotatable if retiming X is a legal retiming for G. In other words, a set X is down-rotatable only when every edge from V ? X to X contains at least one delay. For example, in gure 1 (b), the sets f1g, f2, 4g and f3, 5g are down-rotatable sets, but f4g, f5g and f3, 4g are not. The composite of two retimings is r1 r2 (v) = r1 (v) + r2 (v). The composite of a sequence of down-rotations is the composite of the retimings of the down-rotation sets. In the realm of our research, we consider only normalized retiming functions, i.e. those which have minv r(v) = 0. This means that not every node has been rotated during the scheduling process. The advantage of associating retiming functions with rotations is that the precedence constraints, captured by dr , of a retimed graph can be easily determined from the original CDFG and the retiming r. Therefore, after each rotation, the scheduler does not need to construct a new retimed graph in order to nd a new schedule and hence much computation time can be saved. The loop, represented by a CDFG, is executed in pipeline if the execution periods of several iterations are overlapped. A static schedule describes a loop pipeline at its stable state. The minimum cycle period of a loop pipeline is also called the minimum initiation interval. If the 6

nodes in a static schedule are from p di erent iterations, there are p pipeline stages. We call such a p the depth of a loop pipeline. Take the retimed CDFG in gure 1 (b) for example, there are two stages in the pipeline: the set of node with r(v) = 1 is in the rst stage, i.e. f1g; the set of nodes with r(v) = 0 are in the second stage, i.e. f2, 3, 4, 5, 6, 7, 8, 11g. The depth of a loop pipeline represented by a retiming function r is 1 + maxv r(v) ? minv r(v). Since in our experiments we only use normalized retiming functions, this formula becomes 1 + maxv r(v). It is obvious that retiming r with a smaller maxv r(v) is favorable because the loop pipeline has a shallow depth. In gure 2 (b), we encounter the case of such a down-rotation. After node X1 = f1g has been rotated down, it is pushed up to its earliest possible control step, where it is represented by 10 , according to the precedence constraints of the DAG of GX1 . Intuitively, when a downrotatable set is rotated down, each copy of the set of nodes is pushed up by one iteration to a location in the previous iteration, and the rst copy of the set is pushed into the prologue. We will use the notation of u0 to represent the rst rotation of u, u00 for the second rotation of u, etc. Down-rotations can be implemented in an e cient way: only nodes in X and the nodes connected to X are involved in computing new priority values in list scheduling and precedences. Assume s is a static schedule of length k with range 1, k]. Let Si be the set of nodes scheduled in the rst i control steps. We know that Si is a down-rotatable set for every i in 1, k]. The number of control steps rotated down in one down-rotation is termed the size of the down-rotation.

2.3 Conditional branch decision propagation
For data ow graphs without conditional blocks, the execution of each control step depends solely on the available functional units, including their types, respective numbers, etc. and the computation time of each node. If we are given pipelined functional units, the respective depths of the pipelines and the number of machine cycles that would be wasted due to pipeline stalls must also be taken into account. But with conditional data ow graphs, there is one more factor that we must consider, namely, the extra cost of propagating branch evaluation results along the schedule. Each such signal must, at least, cover the nodes enclosed within the conditional block which is led by the fork node where this conditional bit is derived. For example, in gure 2 (a), the conditional branch decision signal for node 1 in the prologue is used to drive the execution of nodes in control steps 1 through 4. As shown in gure 2 (b), the rescheduled node 10 introduces one more branch evaluation signal which will be used to drive the execution of nodes in the next 7

iteration. Therefore this signal must also be propagated in the current iteration together with that for node 1 so that it can reach the nodes in the next iteration. Having the above consideration, we see that there is a cost associated with rotating fork nodes down and rescheduling them. In high-level synthesis, likewise, after allocation and binding steps are performed in the synthesis process, this conditional signal is also used to enable the functional unit that performs the operation, the registers, the multiplexers, and the bus drivers. This requires a similar goal of satisfying a given set of hardware cost / execution time constraints. In our approach, we make use of the available hardware to resolve potential con icts. This resolution demands anticipation in rescheduling that consumes the hardware control signals which, we refer to as Branch Anticipation Bits (babits). In gure 2 (b) it is clear that three babits are necessary to accommodate the complete iteration, while in gure 2 (c) we only need two babits since the evaluation of condition 10 emerges at the last control step in current iteration. However, one important assumption we adopt here is that once a boolean result accomplishes its task, the babit occupied by it would be reusable by another fork node.

3 Branch Anticipation
3.1 Foundations
In this subsection we provide the ground for our Branch Anticipation Algorithm. We do not show the loop-carried dependencies in the CDFG we investigate, but it should be kept in mind that such dependencies exist in those CDFG's. Considering that any computation node in the loop is repeated at every iteration, it is important for us to identify which instance of that operation is being handled by our algorithm. This will be characterized by the function rank de ned below.

De nition 3.1 Given an initial, fully unrolled static schedule s for a CDFG G = (V; E; d; t; k), the rank of a node u 2 V , denoted by rank(u), is the distance, in number of iterations, between
the iteration in which u originally resides and iteration 0.

When using the rotation technique to optimize the schedule, this measure is quite straight forward in that we can calculate rank(u) by adding the number of times that node u has been rotated to the index of the iteration in which node u currently resides. For example, in iteration 0, rank(u) = 0, and rank(u0 ) is 1 where u0 represents the rst rotation of u, rank(u00 ) is 2, etc. With this metric, we can easily determine the index of the iteration in which node u initially 8

resides. From now on, we will denote node u having rank r with u(r) , where r is greater than or equal to 0. If we use u alone, we are referring to node u in the corresponding DAG. Nodes that were not rotated are called native nodes as de ned next.

De nition 3.2 A native node within iteration r is a node that has rank r.
This de nition describes the nodes within an iteration that have not been rotated in the current static schedule. The following proposition describes an invariant property in rotated schedules.

Proposition 3.1 Given CDFG G = (V; E; d; t; k), for every edge (u; v) 2 E whose d(e) = 0,
u(r) must precede v(r) after any number of rotations.
Proposition 3.1 holds because, nodes with the same rank r must obey the precedence relations inherent in the DAG obtained from G, even if some of them were rotated and rescheduled upward to their current positions. This orderliness property is the ground for our choice of a rotation size larger than 1 in our branch anticipation algorithm. Since the consumption of babits available in hardware is essential, we quantify the e ective period of time during which an individual babit serves a speci c fork node with the following measure.

De nition 3.3 Given a static schedule s for a CDFG G = (V; E; d; t; k) and a fork node u r , u 2 V , the function Lifetime(u r ) is the interval beg, end], where beg is the control step where
( ) ( )

the decision signal for u(r) is rst derived and end is the control step where the last node in u's scope with rank r resides.

The term u's scope refers, of course, to the set of nodes controlled by u, i.e. all nodes between u and u's join node, exclusive, in the corresponding DAG. As the name implies, Lifetime represents the shortest period of time for which the branch decision signal carried by a certain babit must be saved. We can nd an illustration for this concept in gures 2 (b) and (c). For iteration 0, the fork node with maximum rank is 10 whose Lifetime is depicted by the rightmost vertical arrow. Fork node 1 in the prologue extends its Lifetime into iteration 0 at control step 4, shown by the leftmost vertical arrow. Thus, we further observe a property concerning "implicit" branch decision signals within an iteration for fork nodes not explicitly appearing in the iteration, just as the one for node 1.

Theorem 3.1 Given a static schedule s for a CDFG G = (V; E; d; t; k) whose length of a single iteration is l, and a fork node u r (u 2 V ) in iteration i, such that r is greater than i, then all
( )

9

branch decision signals of u(r ) , for any i r0 < r, must be propagated for the period during T which Lifetime(u(r ) ) spans over iteration i, i.e., i*l+1, (i+1)*l] Lifetime(u(r ) ) .
0 0 0

Proof: It is obvious that the decision signal for u(r) will be kept beyond iteration i toward its native iteration. We will rst investigate iteration i. We only need to show that the branch decision signal of node u(r ) which does not appear in iteration i must also be sustained. This is true since u(r ) is not the nal rotation for node u, therefore u(r ) must appear in a previous iteration or in the prologue, and its decision signal is carried through the period Lifetime(u(r ) ) T i*l+1, (i+1)*l].
0 0 0 0

We complete the proof by observing any iteration successive to iteration i until iteration r. Without loss of generality, we may choose iteration i+t where t is greater than 0 and i+t is no greater than r. The cell holding u(r) in iteration i now holds u(r+t) in iteration i+t, but since this iteration is at farthest the native one for u(r) , the branch decision signal for u(r) must be propagated through its Lifetime in this iteration. By comparing the relationship between r and r + t in iteration i + t with that between r0 and r in iteration i, we nish the proof. 2 Theorem 3.1 shows that the demand for babits may be greater than the total number of fork nodes in a speci c iteration because the rotation applied may have introduced implicit branch decision signals. This prompts us to reduce the maximum value of the retiming function r of fork nodes in an iteration in order to squeeze the implicit branch decision signals by rotating more than one control step each time, if possible. Another bene t of doing so is that we obtain a loop pipeline with shallow depth at the same time. To keep track of how many babits are being consumed we associate a babit counter with each control step in the static schedule. The following concept of saturation of babits leads us to nd out the critical control step that the next rotation of a fork node should consider.

De nition 3.4 A saturated

control step of a speci c schedule is a control step with the

number of consumed babits equal to the total number of babits available in hardware.

De nition 3.5 The critical control step is the last control step with the maximum value of
the babit count in the current schedule.

De nition 3.6 A saturated schedule is one which contains at least one fork node in the rst
control step, and whose critical control step is saturated and located in the nal control step of the schedule .

The length of a static schedule may be shortened by making use of more functional units, increasing the number of babits available or both. When the number of babits in hardware is 10

xed, saturation of babits is the main factor a ecting the length of the nal static schedule. This situation occurs if the total number of babits in hardware is not large enough to accommodate all possible rotations under conventional resource constraints. When we try to rotate down a control step containing at least one fork node, without regard to its rank, we can only reschedule this fork node after the existent saturated critical control step because the total number of babits available would be exceeded otherwise. We show the importance of saturation of the critical control step by stating the following theorem.

Theorem 3.2 For a saturated static schedule s, the length of s cannot be shortened.
Proof: Let us prove this theorem by contradiction. If the length of s were to be shortened by one control step, then at least one fork node would be rotated down and rescheduled among unused resources. However, even if such fork node is allocated in the last control step of the new schedule s0 , one extra babit would be required since the last control step of s is already saturated. Thus hardware provisions are exceeded. 2

We would normally consider adding functional units as a method toward relaxing resource constraints whose extreme might give us the shortest possible schedule length via rotation scheduling. However, theorem 3.1 is a paradox to this observation under additional control signal restrictions. The task before a system designer would be to strive for a balance between con gurable functional units and the amount of babits a compiler/synthesizer is allowed to make use of, so that an optimal rotation can still be carried out under babit constraints.

3.2 Algorithm
In this subsection we describe the main algorithm used to implement the branch anticipation and point out its practical advantages. First, the Priority-Based conditional resource sharing algorithm is described for the initial scheduling of G 10, 17]. We say that the height of a node u is the length of the longest path from u to the leaf node which resides on the same connected component as u. The conditional ag of u is a sequence of two bit ags, one for each of the fork nodes in G, with (10)2 tagging a false condition, (11)2 tagging a true condition and, (0X)2 tagging irrelevance with the fork node where X represents a \don't care" bit. The conditional depth of u is the depth of the nested conditional blocks in which node u resides. The height, conditional ag (cf, in binary form) and conditional depth (cdepth) values for some of the nodes in gure 1 are shown in table 1. As indicated in the table, a computation node whose descendant is a join node has the same height value as the join node because this join node will not appear in the static schedule. It is obvious that the execution of two nodes with the same height is independent of each other 11

Table 1: Conditional resource sharing basics

Figure 3: Initial scheduling algorithm because they do not have any direct or indirect dependence relationship. Our initial scheduling algorithm is depicted in gure 3. A two bit usage ag, counterpart of the conditional ag of a node, is assigned to each functional unit in order to realize conditional resource sharing. As we can see in gure 2, nodes 6 and 8 are sharing one ALU because they have complementary values of conditional ag for fork node 5, namely (11)2 for node 6, (10)2 for node 8. The ALAP step of u is computed with regard to u's height. For example, nodes 2 and 4 are placed near the bottom of the initial schedule because they have smaller height values than nodes 3 and 5. The intuition of height-based scheduling is that more resources would be available in the beginning steps of the non-pipelined schedule. Therefore, the successive rotations, especially those with size larger than one, would be likely to succeed for the pipelined schedule. In order to maximize the chance of conditional resource sharing, we carry out a global compaction operation such that the resultant schedule is expected to have more available resources than the one before compaction. This algorithm is illustrated in gure 4. 12

Figure 4: Global Compaction algorithm As we have pointed out in the previous subsection, our branch anticipation scheduling is a saturation-directed rotation in that we always check the critical control step to see if it is saturated. This leads to a deterministic scheduling of the rotated fork node which greatly expedites the rotation process. Such algorithm can be formulated as shown in gure 5. In gure 5, we use the notation cs to represent control step. When the length of the rescheduled Snew is sz control steps shorter than the length of S , we say Snew is rescheduleable. This aggressive criterion aims at shortening the length of S in as few rotations as possible. The Compact Point here serves as a guide to the prospective location where the starting control step of loop pipeline should be. However, if such attempt fails, we backtrack and continue to try a smaller number of control steps. When at last we cannot further rotate any control step down and shorten the length of S by one, we obtain the nal static schedule. In the next section, we would present more CDFGs and analyze experimental results in detail.

4 Experiments
Figure 6 is adapted from 17] and serves as a template in explaining how the above algorithms work. The loop-carried dependencies are not shown in the gure, from the original example, we assume that such dependencies have at least 4 delays. The initial non-pipelined schedule is shown in gure 7 (a) where we see that at most three babits are required. The column headed 13

Figure 5: Branch Anticipation scheduling algorithm

14

1 2 T 4 T 5 6 7 10 13 22 24 23 21 12 20 8 3 F F 9 11 19 17 T 14 F 15 18 16

Figure 6: A Conditional Data Flow Graph from 17] \babits" shows the number of babits currently being consumed at each control step. The e ect of Global Compact is directly observed in gure 7 (b) where the multiplication nodes 8 and 11 share the multiplier at control step 5 because they are mutually exclusive. This compaction is especially bene cial to successive rotations in that it leaves more available functional units to hold prospective rotated nodes. Figure 7 (c) shows the nal Branch Anticipation schedule under babit constraint. Since the critical control step in gure 7 (b) is saturated, the rotated fork nodes, 14' and 3', can only be scheduled after this control step, namely, at control step 5 in gure 7 (c). As the name implies, the Branch Anticipation scheduling tries to generate branch decision signals for fork nodes in iteration(s) prior to the iteration where most of the nodes in their scopes reside. In this situation, control signals for fork nodes 14', 3' and 4' are available at least one control step earlier than most of the computation nodes controlled by them. In this way, we can apply this method to eliminate hardware pipeline stalls by anticipating branch decisions with 100% accuracy. We see that this schedule cannot be shortened even if we are allowed to consume more babits, as shown in gure 10, which is due to lack of an unused multiplier. We have a loop pipeline of 15

(a)

(b)

(c)

Figure 7: Schedules for gure 6 (a) Initial schedule (b) Schedule after Global Compact (c)Branch Anticipation schedule

Figure 8: Schedule under 6 babits for gure 6 depth 2 compared with 4 in 17]. This shallow pipeline depth allows us more exibility because fewer delays are required to carry out rotations and fewer babits are used according to theorem 3.1. Also noticeable is that we get the nal schedule after just three down-rotations. Thus, we reach the same schedule length as the conventional rotation scheduling, six control steps in this case, and consume minimum branch decision propagation hardware, three babits. We can greatly reduce the length of the static schedule by adding more functional units and consuming more babits, as shown in gure 8 where the schedule length becomes 3 with pipeline depth of only 4. This extreme instance demonstrates the power of Branch Anticipation and shows that the need for babits is not trivial. We present our scheduling results in gure 10. The third CDFG, gure 9, is based on the 16

2

T

1

F
3

4
6

T

F
11

5
13
17

16

25
19
26

7

8

9

27

12
10
22

14

15

18

20
28

23

24

21

Figure 9: A CDFG adapted from 10] example from 10] and distinguishes addition and multiplication operations. In our transformation, we use ALU's to perform addition, subtraction and comparison. The main di erence between gure 9 and the others is that it has two independent execution threads. The \resource" category gives the numbers of available ALU's, multipliers and babits, respectively. The \schedule" category shows statistics regarding the schedules, i.e. initial length, nal length, length final improvement ratio ( initial initial ?length length )%, depth of loop pipeline, and number of rotations before reaching the nal schedule. From these results, we can verify some of the basic properties of the Branch Anticipation Scheduling. Fundamentally, the number of individual functional units and the number of babits available are equally important in determining the nal length of the static schedule. Nevertheless, in most of the resultant schedules, we are using relatively more babits than functional units and this trend would be more prominent when we have many fork nodes in the CDFG. The demand for babits would require chip designers to include abundant control signal carriers to enforce e cient static schedules derived from the Branch Anticipation in order to deal with complicated CDFGs with more than 10 fork nodes. The last schedule for gure 9 is included for reference of this notion although it has a deep pipeline depth which makes the short schedule length of two control steps not so productive as it seems to be.

17

Figure 10: Schedules for various CDFGs

5 Conclusion
We presented in this paper a new scheduling algorithm for Data Flow Graphs with Conditional branches. Our Branch Anticipation approach is computationally more e cient when compared with ones without such methodology. Also, our approach is particularly e ective when there is a large number of nested conditional constructs in the CDFGs in that we adopt conditional resource sharing intensively. The ability of anticipating branch decision outcomes by rotating and rescheduling fork nodes in iterations earlier than their native ones may reduce hardware pipeline stalls to a great extent. However, both the theory and our experiments show that such transformation leads to more demand on available babits in hardware, resulting in a large number of conditional signals for the execution of the operations. To obtain a high quality design, trade-o s between the scheduling result and the cost of additional hardware should be investigated. In this respect, the saturation-directed rotation provides a way to measure and to control such trade-o s. As subjects of future research, our algorithm should take into account sharing of interconnections among mutually exclusive data transfers in conditional branches and, further quantify the impact of distribution of babits on branch anticipation and utilization of existent functional units.

18

References
1] L.-F. Chao, A. LaPaugh, and E. H.-M. Sha, \ Rotation Scheduling: A Loop Pipelining Algorithm," Proc. 30th ACM/IEEE Design Automation Conference , Dallas, TX, June, 1993, pp. 566-572. 2] L.-F. Chao and E. H.-M. Sha, \ Static Schedulings of Uniform Nested Loops," Proceedings of 7th International Parallel Processing Symposium , Newport Beach, CA, April, 1993, pp. 1421-1424. 3] M. R. Garey and D. S. Johnson, Computers and Intractability: a Guide to the Theory of NPCompleteness , San Francisco, CA: W. H. Freeman, 1979. 4] G. Goosens, J. Wandewalle, and H. de Man \ Loop Optimization in Register Transfer Scheduling for DSP Systems," Proc. ACM/IEEE Design Automation Conference , 1989, pp. 826-831. 5] T.-F. Lee, A. C.-H. Wu, D. D. Gajski, and Y.-L. Lin, \ An E ective Methodology for Functional Pipelining". Proc. of the International Conference on Computer Aided Design, December, 1992, pp. 230-233. 6] C. E. Leiserson and J. B. Saxe, \ Retiming Synchronous Circuitry". Algorithmica, 6, 1991, pp. 5-35. 7] R. Potasman, J. Lis, A. Nicolau, and D. Gajski, \ Percolation Based Scheduling". Proc. ACM/IEEE Design Automation Conference , 1990, pp. 444-449. 8] R. Tarjan, Data Structures and Network Algorithms, SIAM, Philadelphia, Pennsylvania, 1983. 9] C.-Y. Wang and K. K. Parhi, \ High Level DSP Synthesis Using the MARS Design System". Proc. of the International Symposium on Circuits and Systems, 1992, pp. 164-167. 10] T. Kim, N. Yonezawa, W. S. Liu, and C. L. Liu, \A Scheduling Algorithm for Conditional Resource Sharing { A Hierarchical Reduction Approach". IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems, Vol. 13, No. 4, April 1994, pp. 425-438. 11] J. Vogel and Bruce Holmer, \Analysis of the Conditional Skip Instrucions of the HP Precision Architecture", 27th International Symposium on Microarchitecture, Nov. 1994, pp. 207-216. 12] M. Schlansker, V. Kathail, and S. Anik, \Height Reduction of Control Recurrences for ILP Processors", 27th International Symposium on Microarchitecture, Nov. 1994, pp. 40-51. 13] S. Mahlke, R. E. Hank, R. A. Bringmann, and J. C. Gyllenhaal, \Charaterizing the Impact of Predicated Execution on Branch Prediction", 27th International Symposium on Microarchitecture, Nov. 1994, pp. 217-227.

19

14] T. M. Conte, B. A. Patel, and J. S. Cox, \Using Branch Handling Hardware to Support Pro leDriven Optimization", 27th International Symposium on Microarchitecture, Nov. 1994, pp. 12-21. 15] Gary S. Tyson, \The E ects of Predicated Execution on Branch Prediction", 27th International Symposium on Microarchitecture, Nov. 1994, pp. 196-206. 16] L. F. Chao and Edwin Sha, \Static Scheduling for Synthesis of DSP Algorithms on Various Models", Journal of VLSI Signal Processing, Vol 10, 1995, pp. 207-223. 17] J. Siddhiwala and L. F. Chao, \Scheduling Conditional Data-Flow Graphs with Resource Sharing", Fifth Great Lakes Symposium on VLSI, 1995, pp. 94-97. 18] David J. Lilja, \Exploiting the Parallelism Available in Loops", IEEE COMPUTER, Feb. 1994, pp. 13-26. 19] Monica Lam, \Software Pipelining: An E ective Scheduling Technique for VLIW Machines", ACM SIGPLAN '88 on Programming Language Design and Implementation, June 1988, pp. 318-328. 20] John L. Hennessy and David A. Patterson, Computer Architecture, A Quantitative Approach, 2nd ed., Morgan Kaufman, San Mateo, California, 1995. 21] A. Aiken and A. Nicolau, \Optimal Loop Parallelization", Proceedings of the ACM SIGPLAN '88 on Programming Languages Design and Implementation, June 1988, pp. 308-317. 22] B. Su, S. Ding, J. Wang, and J. Xia, \GURPR - A Method for Global Software Pipelining", Proceedings of 20th Annual Workshop on Microprogramming, Dec. 1987, pp. 88-96. 23] B. Su, S. Ding, and J. Xia, \URPR - An extension of URCR for Software Pipelining", Proceedings of 19th Annual Workshop on Microprogramming, Oct. 1986, pp. 104-108. 24] L. F. Chao, \Scheduling and Behavioural Transformations for Parallel Systems", PhD Dissertation, Princeton University, Oct. 1993. 25] R. Camposano, \Path-Based Scheduling for Synthesis", IEEE Transactions on Computer Aided Design, Vol. 10, 1991, pp. 85-93. 26] K. K. Parhi and D. G. Messerschmitt, \Static Rate-Optimal Scheduling of Iterative Data- ow Programs via Optimum Unfolding", IEEE Transactions on Computers, Vol. 40, Feb. 1991, pp. 178-195.

20


更多相关文档:

...for Resource-Constrained Rate-Optimal Software Pipelining_....pdf

the resource-constrained software pipelining problem....We again use the loop and DDG in Figure 1 as...Branch Anticipation us... 暂无评价 21页 免费 ...

Loop pipelining in hardwaresoftware partitioning.pdf

Loop pipelining in hardwaresoftware partitioning_...steps used to partition an input system ...on performanceconstrained hardware cost minimization....

Maximum throughput loop pipelining with register optimization....pdf

and retiming ), a new approach for resourceconstrained loop pipelining. ...used to reduce the number of required registers after a schedule is found....

EFFICIENT LOOP SCHEDULING AND PIPELINING FOR APPLICATIONS ....pdf

EFFICIENT LOOP SCHEDULING AND PIPELINING FOR APPLICATIONS WITH NON-UNIFORM LOOPS y_专业资料。Using parallel processing systems to compute scientific applications ...

...requirements under resource-constrained rate-opt....pdf

resource-constrained rate-optimal software pipelinin...pipelining problem: given a loop and a machine ...use of the machine resources | both function ...

...FOR SOFTWARE PIPELINING WITH RESOURCE CONSTRAINT....pdf

A NEW CLASS OF ALGORITHMS FOR SOFTWARE PIPELINING WITH RESOURCE CONSTRAINTS ...(induced by the highest used resource or critical resource), the loop may...

Constrained Software Pipelining.pdf

Reserved CONSTRAINED SOFTWARE PIPELINING by Reese B...Numerous systems unroll the body of the loop ... Branch Anticipation us... 暂无评价 21页 免费 ...

Load-Store Optimization For Software Pipelining.pdf

1 Introduction Software pipelining can generate e cient schedules for loops ...In general, MII is constrained by either resource constraints (MIIres ) or...

Loop Pipelining for High-Throughput Stream Computat....pdf

Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings ABSTRACT We present a technique for increasing the throughput of stream ...

MultiLoop Efficient Software Pipelining for Modern ....pdf

MultiLoop Efficient Software Pipelining for Modern...In each case, the use of MultiLoop ...software branch prediction, 2. SIMD parallelism, ...

...Constrained Heterogeneous Adaptive Systems_免费....pdf

Time- Constrained Heterogeneous Adaptive Systems...(2) pipelining the loops to meet throughput ...in anticipation of future use (latency hiding). ...

...instruction-level parallelism for software pipelining.pdf

pipelining is determined by the exact resource ... using software pipelining for inner loops and ...Since instruction scheduling is constrained by ...

A study of loop unrolling for vliwbased dsp processor.pdf

Resource Minimum Initiation Interval (ResMII) : ...branch and the top of the loop, that the ...pipelining turned on and o using the -o3 ...

Cache Sensitive Modulo Scheduling.pdf

that is used by each reference in each loop....oating point, branch and memory. The cache-miss...Under Resource-Constrained Software Pipelining in ...

Reducing Data Hazards.pdf

stream from being executed, such as branch ... using loop pipelining strategy as a basis to ...with scheduling a DFG under resource constraints 1...

...Pipelining A Speculative Parallel Execution Model for ....pdf

such general constructs as do-while loops. ...uses the thread pipelining model to overlap thread...while loops or loops with complex branch ...

Improving software pipelining with unroll-and-jam.pdf

resource requirements of a loop and the resources...pipelining technique used in Rocket in more detail... Scheduling for Loops with Conditional Branches. ...

更多相关标签:
网站地图

文档资料共享网 nexoncn.com copyright ©right 2010-2020。
文档资料共享网内容来自网络,如有侵犯请联系客服。email:zhit325@126.com