RICE UNIVERSITY
Register Allocation via Graph Coloring
Preston Briggs
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree
by
Doctor of Philosophy
Approved, Thesis Committee:
Keith D. Cooper, Associate Professor, Chair Computer Science Ken Kennedy, Noah Harding Professor Computer Science Linda Torczon, Senior Research Associate Computer Science John Bennett, Assistant Professor Electrical and Computer Engineering Robert Michael Lewis, Research Associate Mathematical Sciences
Houston, Texas April, 1992
Register Allocation via Graph Coloring
Preston Briggs
Abstract
Chaitin and his colleagues at IBM in Yorktown Heights built the rst global register allocator based on graph coloring. This thesis describes a series of improvements and extensions to the Yorktown allocator. There are four primary results: Optimistic coloring Chaitin's coloring heuristic pessimistically assumes any node of high degree will not be colored and must therefore be spilled. By optimistically assuming that nodes of high degree will receive colors, I often achieve lower spill costs and faster code; my results are never worse. Coloring pairs The pessimism of Chaitin's coloring heuristic is emphasized when trying to color register pairs. My heuristic handles pairs as a natural consequence of its optimism. Rematerialization Chaitin et al. introduced the idea of rematerialization to avoid the expense of spilling and reloading certain simple values. By propagating rematerialization information around the SSA graph using a simple variation of Wegman and Zadeck's constant propagation techniques, I discover and isolate a larger class of such simple values. Live range splitting Chow and Hennessy's technique, prioritybased coloring, includes a form of live range splitting. By aggressively splitting live ranges at selected points before coloring, I am able to incorporate live range splitting into the framework of Chaitin's allocator. Additionally, I report the results of experimental studies measuring the e ectiveness of each of my improvements. I also report the results of an experiment suggesting that prioritybased coloring requires O(n2) time and that the Yorktown allocator requires only O(n log n) time. Finally, I include a chapter describing many implementation details and including further measurements designed to provide an accurate intuition about the time and space requirements of coloring allocators.
Acknowledgments
I wish to thank my committee: Keith Cooper, John Bennett, Ken Kennedy, Michael Lewis, and Linda Torczon. It is exceptionally di cult to acknowledge the help of my principal advisor, Keith Cooper, without simultaneously acknowledging the help of Linda Torczon. Their advice, support, and friendship helped make my graduate school career the most productive and enjoyable period of my life.1 Ken Kennedy helped directly, with advice and encouragement, and indirectly, with his role in the establishment and direction of the Department of Computer Science at Rice. John Bennett and Michael Lewis served as outside members of my committee; I thank them for their interest and their criticism. Naturally, I often talked to other members of the faculty and I wish to thank Bob Hood, JohnMellor Crummey, and Scott Warren for their help and advice. I also wish to thank my fellow students, especially Alan Carle, Steve Carr, Ben Chase, Mary Hall, Kathryn McKinley, and ChauWen Tseng for interesting conversation of all kinds. Vernon Lee gets special recognition for his courageous role as my \user community." Furthermore, I thank all those who worked on the IRn programming environment; without the foundation provided by their e orts, none of my experiments would have been possible. I also thank the systems support group, especially Vicky Dean, for handling all my problems over the years. Finally, I thank Ivy Jorgensen for her careful proofreading and for all her help with the required administrative details. There are many people outside Rice, in both academia and industry, who have taken time to read and comment on my work. I'd like to thank David Callahan, Greg Chaitin, Fred Chow, Marty Hopkins, Brian Koblenz, and Chuck Lins for their help and encouragement. I also thank Randy Scarborough for his attention over the course of many years; his questions and comments prompted much of the work in this thesis.
Therefore, the use of \we" throughout the text should be recognized as a further acknowledgement of their contribution.
1
iv Unlike many of the graduate students at Rice, I grew up in Houston and have a number of friends in the community. I would like to thank Paul and Kathryn Ba es, Jerry and Theresa Fuqua, Dave and Renee Odom, and Jamie, Donna, and Catherine Rickenbacker. Their friendship has eased my graduate years. Financially, I have been supported from several sources. Support for my rst year was provided by a Presidential Fellowship, as part of a program initiated by Norman Hackerman during his service as President of Rice University. Thereafter, I was supported as part of projects funded by IBM and DARPA. Finally, I gratefully acknowledge my family's love and encouragement. Surely I owe them more than I can say.
v
Contents
Abstract Acknowledgments List of Illustrations List of Tables ii iii ix xi
1 Introduction
1.1 Compilers and Optimization : : : : : : : 1.2 Optimization and Register Allocation : : 1.3 Register Allocation and Graph Coloring 1.3.1 Minimizing Register Usage : : : : 1.3.2 Minimizing Spill Code : : : : : : 1.4 Overview : : : : : : : : : : : : : : : : : : 1.4.1 Improved Coloring and Spilling : 1.4.2 Coloring Pairs : : : : : : : : : : : 1.4.3 Rematerialization : : : : : : : : : 1.4.4 Live Range Splitting : : : : : : : 2.1 Register Allocation via Graph Coloring 2.2 The Yorktown Allocator : : : : : : : : 2.2.1 Discovering Live Ranges : : : : 2.2.2 Interference : : : : : : : : : : : 2.2.3 The Interference Graph : : : : : 2.2.4 Coalescing : : : : : : : : : : : : 2.2.5 Spilling : : : : : : : : : : : : : 2.2.6 Coloring : : : : : : : : : : : : : 2.3 History : : : : : : : : : : : : : : : : : : 2.3.1 Memory Allocation : : : : : : : 2.3.2 Register Allocation : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
1
1 2 3 4 4 4 5 5 6 6
2 Background
: : : : : : : : : : :
7 8 10 12 13 14 16 17 20 20 21
7
vi
3 Improved Coloring and Spilling
3.1 Problems : : : : : : : : : : : : : 3.1.1 The Smallest Example : 3.1.2 A Large Example : : : : 3.2 An Improvement : : : : : : : : 3.3 Limited Backtracking : : : : : : 3.4 Alternative SpillChoice Metrics 3.5 Summary : : : : : : : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : : : : : : : : :
: : : : : : : : : : : : : :
: : : : : : : : : : : : : :
: : : : : : : : : : : : : :
: : : : : : : : : : : : : :
: : : : : : : : : : : : : :
: : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : :
23
23 24 24 26 29 30 32
4 Coloring Register Pairs
4.1 Why Are Pairs Hard to Color? : : : 4.1.1 Unconstrained Pairs : : : : 4.1.2 Adjacent Pairs : : : : : : : 4.2 The Optimistic Coloring Heuristic : 4.3 Unaligned Pairs : : : : : : : : : : : 4.4 Using Pairs for Memory Access : : 4.5 Summary : : : : : : : : : : : : : :
: : : : : : :
: : : : : : :
33
33 34 35 38 39 40 41
5 Rematerialization
5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : 5.2 Rematerialization : : : : : : : : : : : : : : : : : : : 5.2.1 Discovering Values : : : : : : : : : : : : : : 5.2.2 Propagating Rematerialization Information : 5.2.3 Inserting Splits : : : : : : : : : : : : : : : : 5.2.4 Removing Unproductive Splits : : : : : : : : 5.3 Implementation : : : : : : : : : : : : : : : : : : : : 5.3.1 Renumber : : : : : : : : : : : : : : : : : : : 5.3.2 Conservative Coalescing : : : : : : : : : : : 5.3.3 Biased Coloring : : : : : : : : : : : : : : : : 5.4 Results : : : : : : : : : : : : : : : : : : : : : : : : : 5.5 Summary : : : : : : : : : : : : : : : : : : : : : : :
42
42 44 44 45 46 47 47 47 49 50 51 51 54 54
6 Aggressive Live Range Splitting
6.1 Live Range Splitting : : : : : : : : : : : : : : : : : : : : : : : : : : : 6.1.1 Theoretical Di culties : : : : : : : : : : : : : : : : : : : : : :
53
vii 6.2 6.1.2 Practical Di culties : : : : : : Aggressive Live Range Splitting : : : : 6.2.1 Splitting : : : : : : : : : : : : : 6.2.2 Spilling : : : : : : : : : : : : : 6.2.3 Cleanup : : : : : : : : : : : : : Implementation : : : : : : : : : : : : : Splitting : : : : : : : : : : : : : : : : : 6.4.1 LoopBased Splitting : : : : : : 6.4.2 Splitting Based on Dominance : 6.4.3 Mechanics : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : :
6.3 6.4
6.5
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
55 55 55 56 58 60 61 61 64 68 69
7 Measurements and Comparisons
7.1 Measuring Allocation Quality 7.1.1 Methodology : : : : : 7.1.2 Results : : : : : : : : : 7.2 PriorityBased Coloring : : : 7.2.1 Allocation Quality : : 7.2.2 Allocation Time : : : : 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 The Internal Representation : Set Representations : : : : : : Liveness : : : : : : : : : : : : Live Ranges : : : : : : : : : : The Interference Graph : : : : Coalescing : : : : : : : : : : : Spill Costs : : : : : : : : : : : Coloring : : : : : : : : : : : : CompileTime Characteristics Discussion : : : : : : : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
72
72 73 76 82 82 85
8 Engineering
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
94 97 98 100 101 104 107 110 112 114
93
9 Conclusion
9.1 Register Allocation and Optimization : : : : : : : : : : : : : : : : : : 116 9.2 Register Allocation and Graph Coloring : : : : : : : : : : : : : : : : 117
116
viii 9.3 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 117 9.4 Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 119
A Detailed Results
A.1 Forsythe, Malcolm, and Moler A.1.1 fmin : : : : : : : : : : A.1.2 rkf45 : : : : : : : : : : A.1.3 seval : : : : : : : : : : A.1.4 solve : : : : : : : : : : A.1.5 svd : : : : : : : : : : : A.1.6 urand : : : : : : : : : A.1.7 zeroin : : : : : : : : : A.2 SPEC : : : : : : : : : : : : : A.2.1 doduc : : : : : : : : : A.2.2 fpppp : : : : : : : : : A.2.3 matrix300 : : : : : : : A.2.4 tomcatv : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
120
120 121 121 121 122 122 122 122 123 123 132 135 136
Bibliography
137
ix
Illustrations
2.1 2.2 2.3 2.4 The Yorktown Allocator Renumbering : : : : : : E ects of Coalescing : : Coloring a Simple Graph
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: 8 : 11 : 15 : 18
24 25 27 43 46 53 57 58 59 60 63 67 75 88 90 91 95 95 96
3.1 A Simple Graph Requiring Two Colors : : : : : : : : : : : : : : : : : 3.2 The Structure of SVD : : : : : : : : : : : : : : : : : : : : : : : : : : 3.3 The Optimistic Allocator : : : : : : : : : : : : : : : : : : : : : : : : : 5.1 Rematerialization versus Spilling : : : : : : : : : : : : : : : : : : : : 5.2 Introducing Splits : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6.1 6.2 6.3 6.4 6.5 6.6 6.7 7.1 7.2 7.3 7.4 Splitting : : : : : : : : : : : : : : : Spilling Partners : : : : : : : : : : Splitting and Spilling : : : : : : : : Globally Unnecessary Stores : : : : The Splitting Allocator : : : : : : : Splitting Unmentioned Live Ranges Splitting at Dominance Frontiers :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
ILOC and C : : : : : : : : : : : : : : : : : : : : : Allocation Times : : : : : : : : : : : : : : : : : : Observed Complexity of PriorityBased Coloring : Observed Complexity of the Yorktown Allocator :
8.1 Blocks and Edges : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8.2 Instructions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8.3 Instruction Kinds : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
x 8.4 Set Representation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 98 8.5 Passing Parameters in Registers : : : : : : : : : : : : : : : : : : : : : 106
xi
Tables
7.1 7.2 7.3 7.4 7.5 E E E E E ects of Optimistic Coloring : : : : : : : ects of Limited Backtracking : : : : : : ects of Rematerialization : : : : : : : : ects of SSABased Splitting : : : : : : : ects of Splitting at Dominance Frontiers
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
78 78 79 80 81
8.1 CompileTime Characteristics : : : : : : : : : : : : : : : : : : : : : : 113
1
Chapter 1 Introduction
1.1 Compilers and Optimization
Classically, an optimizing compiler is divided into three stages: The frontend translates the source language into an intermediate form. This translation may be accomplished in one or more passes over the code, depending on the structure of the source language. Compiletime error checking is usually performed at this stage. Ideally, the frontend is language dependent and machine independent. The optimizer consists of several passes, each performing speci c transformations on the intermediate form. While the transformations are intended to improve performance of the nal code, there is no question of achieving any sort of real optimality. We are interested in global optimizations; that is, optimizations that use information gathered from an entire routine to guide transformations. Common global optimizations include strength reduction, loopinvariant code motion, and common subexpression elimination 4]. The optimizer is intended to be both language and machine independent. The backend translates the intermediate form into a machinespeci c form, usually object code. This translation, also called code generation, may require several passes, including instruction selection, instruction scheduling, and register allocation. The backend is largely language independent and machine dependent. This division provides a useful separation of concerns, simplifying the development and maintenance of each stage. Additionally, there is the possibility of reusing each stage in several compilers. For example, a completely machineindependent frontend for FORTRAN might be used in compilers for many di erent machines. Of course, this is an idealized view. In practice, each stage tends to exhibit both language and machine dependencies. These dependencies inhibit reuse and maintenance and are therefore the target of compiler designers. Typically, we see such reuse only when compiling closely related languages (e.g., FORTRAN and C) for closely related machines (e.g., the common 32bit RISC processors).
2
1.2 Optimization and Register Allocation
A register is one of a small number of highspeed memory locations in a computer's CPU. Registers di er from ordinary memory locations in several respects. The register set is small; a register may be directly addressed with a few bits. Memory can be quite large; a memory location is usually speci ed indirectly, using an \addressing mode" that includes one or more register references. Registers are fast; typically, two registers can be read and a third written { all in a single cycle. Memory is slower; a single access can require several cycles. The limited size and high speed of the register set makes it one of the critical resources in most computer architectures. A register allocator, typically one phase of the backend, controls utilization of the register set by a compiled program. Registers, and therefore register allocators, must serve many purposes. In the simplest case, operands for primitive machine instructions must appear in registers. Intermediate results, arising during the evaluation of complex expressions, are held in registers by even naive compilers. More sophisticated compilers attempt to place frequently used variables in registers to avoid repeated fetches and stores. For an optimizing compiler, registers are the ideal place to hold values for reuse after common subexpression elimination or loopinvariant code motion. It is in connection with optimization that register allocation becomes crucially important. During the development of the rst FORTRAN compiler, John Backus suggested that the optimization of subscript expressions should be considered separately from the question of allocating index registers 7]. This idea has since been extended beyond the problems of optimizing subscript expressions; our approach to the design of optimizing compilers says: During optimization, assume an in nite set of registers; treat register allocation as a separate problem. This important, perhaps essential, separation of concerns enables optimization to proceed in a relatively simple fashion, optimistically avoiding di cult choices caused by limited resources. This point of view was promoted by John Cocke and led to the development of the in uential PL.8 compiler and 801 computers 50, 6]. When there are enough registers, this separation of concerns looks like a good idea. When there are not enough registers, the assumption underlying the separation of concerns breaks down and we see cases where optimization causes degradation due to lack of registers.
3 The task of register allocation may be attacked at one of several levels: Register allocation may be performed over expressions. This technique is a form of instruction scheduling, with the goal of reducing register requirements. Work by Aho, Johnson, Sethi, and Ullman considers how to minimize register requirements by careful ordering of expression evaluation 60, 2]. More aggressive allocators can manage registers over a complete basic block. Work by Freiburghouse suggests one practical approach 37]. Further work by Aho, Johnson, and Ullman proves the di culty of generating optimal code in the presence of common subexpressions 3]. Global allocators work over an entire routine. Chaitin's allocator operates at this level. Other examples include work by Chow and Hennessy and work by Callahan and Koblenz 26, 16]. Interprocedural register allocation works over a collection of routines, usually an entire program. Examples include work by Wall and work by Santhanam and Odnert 63, 58]. We believe that global register allocation is required to support global optimization.
1.3 Register Allocation and Graph Coloring
Unfortunately, good register allocation is di cult. Idiosyncratic machine details complicate even the simplest allocators. Robust allocators must also deal gracefully with complex programs and inadequate numbers of registers. Furthermore, attempts to achieve optimal solutions for any of these problems invariably lead to combinatorial explosion. Graph coloring o ers a simplifying abstraction. By building and coloring an interference graph representing the constraints essential to register allocation, we are able to handle many apparently disparate details in a uni ed fashion. Nodes in the interference graph represent live ranges (e.g., variables and temporaries) and edges represent interferences between live ranges. Roughly, if two live ranges are both live at some point in the routine, they are said to interfere and cannot occupy the same register.1 If the nodes in the graph can be colored in k or fewer colors, where any pair of nodes connected by an edge receive di erent colors and k is the number of registers available on the machine, then the coloring corresponds to an allocation. If
1
Several de nitions of live and interfere are possible. See Section 2.2.2 for more discussion.
4 a kcoloring cannot be discovered, then the code must be modi ed and a new coloring attempted. A global register allocator based on this approach was developed by Greg Chaitin and his colleagues at IBM 20].
1.3.1 Minimizing Register Usage
Informally, the goal of register allocation is to minimize the number of loads and stores that must be executed. Reducing the register allocation problem to the graph coloring problem subtly changes the goal; instead of minimizing memory tra c, the \reduced" goal is minimizing register usage. In other words, by shifting our attention to graph coloring, we are attacking a nearby problem. Note though, that this new goal is well suited to the style of optimization advocated by Backus and Cocke (i.e., optimistically assuming an in nite register set during optimization). Many transformations are justi ed only if there is a register available to hold a temporary value. Proceeding optimistically, the optimizer will always assume such transformations are pro table. If the register allocator is able to map all of the registers used by the optimizer onto the nite set of machine registers, then all of the optimizer's assumptions will be correct. Therefore, the goal of minimal register usage is desirable, as is a large register set { each support the optimizer.
1.3.2 Minimizing Spill Code
Even with optimal coloring and a large register set, it will sometimes be necessary to spill certain values to memory. Several di cult problems arise. Overall, we wish to minimize the dynamic cost of inserted spill instructions (loads and stores). We must somehow choose live ranges to spill that are cheap to spill and that relax pressure in the graph, allowing coloring to progress. Furthermore, we must choose where to place spill instructions. These problems are all complex and highly interrelated; nevertheless, good solutions are required.
1.4 Overview
This thesis presents a series of extensions to Chaitin's work on register allocation via graph coloring. Chapter 2 presents an introduction to Chaitin's work and a brief history of the eld. The main results of the thesis are contained in Chapters 3 though 6; they are brie y introduced in the sections below. Chapter 7 describes
5 the framework used to compare allocation techniques and summarizes the results of a series of experiments testing the e cacy of our improvements. Additionally, we brie y compare the Yorktown allocator with prioritybased coloring, a competitive approach developed by Chow and Hennessy 26]. Chapter 8 describes many of the important details required for e cient implementation of a graph coloring allocator. Additionally, a variety of measurements are included to help provide intuition about the expected costs, in time and space, of a coloring allocator.
1.4.1 Improved Coloring and Spilling
Optimal graph coloring is unlikely to be practical. The problem of determining the minimal number of colors needed to color an arbitrary graph is NPcomplete 48]. Furthermore, the problem of nding a kcoloring for some xed k 3 is also NPcomplete 39]. Finally, Garey and Johnson have shown that unless P = NP, no polynomialtime heuristic can guarantee using less than twice the minimal number of colors 38]. Due in part to these pessimistic results, there has been little study in the area of optimal coloring for global register allocation. Instead, researchers have concentrated on nding e cient heuristic approaches 18, 46, 9]. Useful heuristics extend naturally to help solve the spill problem; that is, they provide guidance when live ranges must be spilled. In Chapter 3, we present a re nement to Chaitin's coloring heuristic. Our heuristic may be considered optimistic in contrast to Chaitin's pessimistic approach. The optimistic heuristic spills a subset of the live ranges spilled by the pessimistic heuristic.
1.4.2 Coloring Pairs
On many important architectures, a pair of singleprecision oatingpoint registers may be treated as a doubleprecision register. Additionally, some machines provide instructions that load and store pairs and quadruples in a single instruction. Unfortunately, there is no adequate way to take advantage of these features using Chaitin's allocator. Chapter 4 discusses the problem in some detail. We show why Chaitin's coloring heuristic overspills in the presence of register pairs and why our optimistic coloring heuristic avoids overspilling.
6
1.4.3 Rematerialization
Many important details must be handled correctly for best results from a global allocator. Stated more strongly, they must be handled correctly to achieve simply acceptable results. One example, mentioned brie y by Chaitin et al., is the idea of rematerialization. This is a technique required for acceptably clean spilling of live ranges de ned by constants and other simple expressions. In Chapter 5, we introduce an extension to Chaitin's allocator allowing precise spilling and rematerialization of a wider class of live ranges.
1.4.4 Live Range Splitting
Fabri and Chow independently observed that splitting a single live range into several pieces and considering the new, smaller live ranges separately can produce an interference graph that colors with fewer colors 34, 25]. Chow and Hennessy used this idea, called live range splitting, as the basis for a new allocator that avoided spilling when splitting was possible. Live range splitting has several merits. If an entire live range is spilled, as in Chaitin's work, its value will reside in a register only for short periods around each de nition and use. Splitting allows the value to stay in a register over longer intervals { often an entire block or over several blocks. Unfortunately, live range splitting is di cult. There are two fundamental problems: picking live ranges to split and picking places to split them. While optimal solution of either of these problems is certainly NPhard, Chapter 6 extends the ideas introduced in Chapter 5 to attack both of these problems.
7
Chapter 2 Background
The rst implementation of a graph coloring register allocator was described by Chaitin et al. 20]. This chapter explains their allocator in some detail. The rst section introduces the general concept of register allocation via graph coloring. The second concentrates on the Yorktown allocator, including explanations of the individual phases. The last section gives a brief history of the area.
2.1 Register Allocation via Graph Coloring
We assume that the allocator works on lowlevel intermediate code, similar to assembly. The code has been shaped by an optimizer, addressing modes have been determined, and an execution order has been xed. Of course, these assumptions ignore the possibility of cooperation between allocation and other parts of the compiler; see Chapter 9 for a discussion of these opportunities. For simplicity when discussing the generation of spill code, we assume a loadstore architecture; however, provisions can be made for more complex target architectures (see Chapter 8). Before allocation, the intermediate code can reference an unlimited number of registers. We refer to this unrestricted set of \preallocation" registers as virtual registers. The goal of allocation is to rewrite the intermediate code so that it uses only the registers available on the target machine { the machine registers. Note that both virtual registers and machine registers serve simply as names, much like variables in C and FORTRAN. In a manner common to other portions of the compiler, we care little about names per se; instead, we care about the named objects. In the case of register allocation, we are concerned with values and live ranges. A value corresponds to a single de nition. A live range is composed of one or more values, connected by common uses. On input to the allocator, all the values comprising a single live range will be named by the same virtual register. Furthermore, a single virtual register may also name several other live ranges. Similarly, any machine register will usually name several live ranges after allocation.
8 To model register allocation as a graph coloring problem, the compiler rst constructs an interference graph G. The nodes in G represent live ranges and the edges represent interferences. Thus, there is an edge in G from node i to node j if and only if live range li interferes with live range lj ; that is, they are simultaneously live at some point and cannot occupy the same register. The live ranges that interfere with a live range li are called neighbors of li in the graph; the number of neighbors in the graph is the degree of li { denoted li . To nd an allocation from G, the compiler looks for a kcoloring of G { an assignment of colors to the nodes of G such that neighboring nodes always have distinct colors. If we choose k to match the number of machine registers, then we can map a kcoloring into a feasible register assignment. Because nding a kcoloring of an arbitrary graph is NPcomplete, the compiler uses a heuristic technique to search for a coloring; it is not guaranteed to nd a kcoloring for all kcolorable graphs. Of course, some routines are su ciently complex that no kcoloring is possible, even with an exhaustive coloring algorithm; their interference graphs are simply not kcolorable. If a kcoloring cannot be found, some live ranges are spilled; that is, kept in memory rather than registers.
2.2 The Yorktown Allocator
The rst implementation of a global register allocator based on graph coloring was done by Chaitin and his colleagues as part of the PL.8 compiler at IBM Research in Yorktown Heights 6]. Further work by Chaitin yielded an improved coloring heuristic that attacked the problems of coloring and spilling in an integrated fashion 18]. This thesis builds directly upon the work of Chaitin and his colleagues; therefore, it is important to establish a clear understanding of (our interpretation of) their work. Figure 2.1 illustrates the overall ow of the Yorktown allocator.
spill code
? ? 
renumber
build

coalesce

spill costs

simplify

select

Figure 2.1 The Yorktown Allocator
9 Renumber This phase nds all the live ranges in a routine and numbers them uniquely. In the papers on the PL.8 compiler, this type of analysis is referred to as getting \the right number of names." Build The next step is to construct the interference graph G. For e ciency, G is simultaneously represented in two forms: a triangular bit matrix and a set of adjacency vectors. Coalesce The allocator removes unneeded copies, eliminating the copy instruction itself and combining the source and target live ranges. A copy may be removed if the source and target live ranges do not interfere. We denote the coalesce of li and lj as lij . Since the removal of a copy instruction can change the interference graph, we repeat build and coalesce until no more copies can be removed. However, when the allocator combines li and lj , it can quickly construct a conservative approximation to the set of interferences for lij . The conservative update of G lets the allocator perform many combining steps before rebuilding the graph; in practice, we make a complete pass over the code before rebuilding. Spill Costs In preparation for coloring, a spill cost estimate is computed for every live range l. The spill cost for l is an estimate of the cost of load and store instructions that would be required to spill l. The cost of each instruction is weighted by 10d where d is the instruction's loop nesting depth, giving a simple approximation of the actual impact at runtime. Simplify This phase, together with select, cooperates to color the interference graph. Simplify repeatedly examines the nodes in G, removing all nodes with degree < k. As each node is removed, its edges are also removed (decrementing the degree of its neighbors) and it is pushed on a stack s. If we reach a point where every node remaining in G has degree k, a node is chosen for spilling. Rather than spilling its corresponding live range immediately (requiring updates of the code and recomputation of the interference graph), it is simply removed from G and marked for spilling. Eventually, G will be empty. If any nodes have been marked for spilling, they are spilled in spill code and the entire allocation process is repeated. Otherwise, no spill code is required and s is passed on to select. Select Colors are chosen for nodes in the order determined by simplify. In turn, each node is popped from s, reinserted in G, and given a color distinct from its neighbors. The success of simplify ensures that a color will be found for each node as it is inserted.
10 Spill Code In a single pass over the routine, spill code is inserted for each spilled live range. Since we are assuming a loadstore architecture, spilling requires (approximately) a load instruction before each reference to a spilled live range and a store after each de nition of a spilled live range. Re nements to this simple policy are introduced below. Note that values are spilled to locations in the current stack frame. There are several reasons for this policy. First, many values have no natural location in memory; e.g., compilergenerated temporaries. Furthermore, by spilling to the stack, we are able to handle recursive and reentrant routines. Finally, locations in the stack frame can typically be accessed quickly. The following sections give further detail about the various phases of allocation.
2.2.1 Discovering Live Ranges
In a given routine, a variable i may be used many times, for many di erent tasks. Similarly, a routine expressed in intermediate form after optimization may use the same virtual register for several purposes. However, there is no need for the allocator to assign each disjoint use of some virtual register to the same machine register. In fact, such behavior is undesirable since it constrains the possible colorings. Each disjoint use of a virtual register is a unique live range and it is the live ranges in a routine that are colored by the allocator. Therefore, the rst task of the allocator is to discover the live ranges in a routine. This procedure is called getting the right number of names by Chaitin and web analysis by Johnson and Miller 46]. In our implementation, each live range is given a unique index and the intermediate code is rewritten in terms of live range indices instead of the original virtual register numbers { hence the term renumber. Conceptually, live ranges are discovered by nding connected groups of defuse chains. A single defuse chain connects the de nition of a virtual register to all of its uses. When several defuse chains share a single use (in other words, when several de nitions reach a single use), we say they are connected by the use. Of course, all the chains originating at a given de nition are considered connected. Consider the example shown in Figure 2.2. The upper half is an abstract controlow graph with a few lowlevel statements representing code before renumbering. The lower half illustrates the same code, but rewritten to illustrate the e ect of renumbering. Notice that four di erent live ranges have been discovered, all originally represented by r50. The simple cases (r0 and r2) are restricted to a single basic block.
11
r50 r50
? ? ? ?
r50 + 1
@ @ @ @ R @
1
r50 r50
r50 + 1 r50 + 1
@ @ @
?
r50
? ? ?
r50 + 1
r50
@ R @
r50 + 1
? ?
r0 r1
? ? ? ?
1 r0 + 1
@ @ @ @ R @
r2 r1
r1 + 1 r2 + 1
@ @ @
?
r1
? ? ?
r1 + 1
@ R @
r3
r1 + 1
? ?
Figure 2.2 Renumbering
12 The live range represented by r1 is more complex { three de nitions in di erent basic blocks are connected by uses in three other basic blocks. It is precisely this sort of code that makes global allocators more powerful than allocators that are restricted to expressions or basic blocks. The e cient implementation of renumber is discussed in Section 8.4.
2.2.2 Interference
The concept of interference is an important key to understanding graph coloring allocators. Intuitively, if the allocation of two live ranges to the same register changes the meaning of the program, they interfere. Chaitin et al. give a precise set of conditions for interference, noting that two live ranges interfere if there exists some point in the procedure and a possible execution of the procedure such that: 1. both live ranges have been de ned, 2. both live ranges will be used, and 3. the live ranges have di erent values. Each of these conditions is generally undecidable, as is their intersection. Chaitin's approach is to approximate interference by noting which live ranges are both live and available (in the data ow sense) at each assignment. We say that a live range l for a variable v is live at some statement s if there exists a path from s to some use of v and there is no assignment to v on the path. Similarly, l is available at s if there is a path from a de nition of v leading to s. Note that availability and liveness correspond to conditions 1 and 2 above; they are conservative approximations of the exact but undecidable conditions required for interference.2 By handling copy instructions specially, Chaitin is also able to achieve a conservative approximation of condition 3. Since the source and destination live ranges will certainly have the same value at a copy, they need not interfere. In fact, for there to be any possibility of coalescing, they must not interfere. Of course, if they interfere for other reasons (perhaps one is incremented in a loop), then an interference will be added at another point in the code and coalescing will be correctly inhibited.
Condition 2 speci es that a value will be used; liveness says that it may be used. It is the absolute guarantee of \will be used" that makes condition 2 undecidable in the general case. Similar arguments hold for conditions 1 and 3.
2
13 An alternative approach is to add interferences at each block by making all live ranges that are both live and available at the end of each basic block interfere with each other. However, this approach is less precise than Chaitin's idea of adding interferences at each assignment due to Chaitin's careful handling of copy instructions. Furthermore, the blocklevel approach can require much more time, since it can require adding O(n2) interferences at each block versus O(n) at each assignment. The exact tradeo is di cult to determine, since it depends on the number of assignments and the average number of live ranges (n) alive across each point in the routine.
2.2.3 The Interference Graph
One of the central data structures in the Yorktown allocator is the interference graph. Viewed as an abstract data type, the interference graph must provide ve operations: new(n) Return a graph with n nodes, but no edges. add(g; x; y) Return a graph including g with an edge between the nodes x and y. interfere(g; x; y) Return true if there exists an edge between the nodes x and y in the graph g. degree(g; x) Return the degree of the node x in the graph g. neighbors(g; x; f ) Apply the function f to each neighbor of node x in the graph g. In practice, the interference graph is implemented using two representations: a triangular bit matrix and a set of adjacency vectors. The bit matrix supports constanttime implementations of add and interfere while the adjacency vectors support the e cient implementation of neighbors. While initialization of the bit matrix requires O(n2) time, the constant is small in practice (see Sections 7.2.2 and 8.5). Note that the dual representation is important for e ciency. Without the bit matrix, the speeds of interfere and add degrade sharply. Alternatively, without the adjacency vectors, the cost of visiting all the neighbors of a node increases, raising the cost of coloring from O(n + e) to O(n2). While e is theoretically bounded by n2, in practice, e n2. An alternative implementation, based on a hash table of interfering pairs, offers the same asymptotic e ciencies and avoids the O(n2) space requirements of the bit matrix. In practice, the time considerations favor the bitmatrix representation. Space considerations also favor the bitmatrix representation for small graphs; for large graphs, the hashtable representation may become desirable.
14 Early versions of the Yorktown allocator constructed the interference graph in a single pass, storing adjacencies in a linked list of short vectors. Later versions adopted a twopass approach, storing adjacencies in a continuous vector.3 1. Initially, the bit matrix is allocated and cleared. A pass is made over the code, lling in the bit matrix and accumulating each node's degree. 2. After the degree of every node is known, adjacency vectors are allocated and the bit matrix is reinitialized. In a second pass over the code, interferences are recorded in the bit matrix and the adjacency vectors. After each pass of coalesce, the graph must be reconstructed. In practice, only step 2 must be repeated, since the degree of each node can be incrementally maintained while coalescing. In our implementation, each pass runs backward over each basic block in the control ow graph, incrementally maintaining a set s of all live ranges that are currently live and available. At each de nition, edges are added between the de ned value and all members of s. See Section 8.5 for more details on the e cient construction of the interference graph. After completing the buildcoalesce loop, the memory required for the bit matrix may be deallocated. The adjacency vectors are required for further use during coloring, both by simplify and select.
2.2.4 Coalescing
After building the interference graph, we are in a position to perform coalescing (also called subsumption and copy propagation). The code is traversed in any convenient order. At each copy instruction, we check to see if the source and target live ranges interfere; if not, they may be coalesced and the copy instruction may be deleted. To coalesce two live ranges lx and ly forming a third lxy , we simply replace every mention of either live range with a reference to the result; that is, we replace every mention of lx and ly with lxy . To perform coalescing e ciently, we establish equivalence classes for each live range. As live ranges are coalesced, their equivalence classes are unioned, using a fast disjointset union algorithm 1, pages 129{139].
3
The reasoning was that the vectors o ered quicker traversal and better locality.
15
x y z
?
xy x x
?
z
xy
?
y z
?
xy z
Figure 2.3 E ects of Coalescing
When two live ranges lx and ly are coalesced, we must update the interference graph so that lxy interferes with the neighbors of lx and with the neighbors of ly . While we are able to perform this update accurately, we cannot accurately re ect updates due to copy instructions being removed. Recall that interferences are added at each assignment. When we remove a copy instruction (which is an assignment), some interferences may also be removed. Figure 2.3 illustrates such a case. In the left column, y and z interfere since y is live across the de nition of z. While x is live across the de nition of y, we are certain that there is no interference since it is clear that x and y have the same value. The right column shows the result of coalescing x and y. We have rewritten the code in terms of xy and remove the nowuseless copy. When the interference graph is updated, xy will interfere with z since the result of a coalesce is made to interfere with all the neighbors of the coalesced ranges. However, it is clear that xy and z do not interfere, so the interference graph is imprecise. Therefore, we must rebuild the interference graph from scratch to ensure accuracy. In practice, coalesce makes a complete pass through the intermediate code, coalescing wherever possible and updating the interference graph in the conservative fashion described above. If any copies are removed, the interference graph is rebuilt and more coalescing attempted. This cycle repeats until no more copies can be removed. While the buildcoalesce cycle is bounded by the number of copies in the code, convergence is usually quite rapid { typically two or three iterations su ce. See Section 8.6 for measurements on real code.
16
Uses of Coalescing
Much of the power and generality of the Yorktown allocator is due to the wide applicability of coalescing. Uses suggested by Chaitin and our own experience include: Removing copies introduced during optimization allows use of simpler forms of some optimizations. For example, coalescing is extremely useful in cleaning up the copies resulting from the removal of nodes after using SSAform 29]. Coalescing can be used to achieve targeting, which attempts to compute arguments in the correct register for passing to a called procedure. In a called procedure, coalescing enables easy handling of incoming arguments passed in registers. Similarly, coalescing enables easy handling of the operands and results of machine instructions with special register requirements; e.g., a multiply instruction that requires its operands to be in a particular pair of registers. Coalescing enables natural handling of common 2address instructions; e.g., instructions of the form rx rx + ry where the destination is constrained to match the rst operand.
2.2.5 Spilling
The roughest possible version of spill would spill a live range l by inserting a store after every de nition of l and a load before every use of l. Chaitin et al. give two important re nements to this coarse approach. First, they note that certain live ranges are easy to recompute; for example, live ranges de ned by constants. These live ranges should not be stored and reloaded; instead, they should be recomputed before each use. Of course, it is trivial to \recompute" a constant, but the technique also applies to certain expressions involving the frame pointer and the constant pool. Second, it is not necessary to spill around every mention of a live range. Chaitin describes several situations that should be handled using local analysis. If two uses of a spilled live range are close together, it is unnecessary to reload for the second use; simply use the same register for both uses. If a use follows closely behind the de nition of a spilled live range, there is no need to reload before the use. Similarly, if all uses of a live range are close to the de nition, the live range should not be spilled.
17 In Chaitin's work, two references are considered \close" if no live range goes dead between them. Alternatively, if the last use of any interfering live range occurs between two references, those references are considered distant. See Section 8.7 for details and the accurate computation of spill costs.
2.2.6 Coloring
The core of the Yorktown allocator is the coloring algorithm. Since the problem of nding a kcoloring for an arbitrary graph is NPcomplete, we rely on nonoptimal heuristic techniques.4 When reading modern descriptions of graph coloring heuristics, it is easy to forget the di culty of devising good heuristic approaches to di cult problems. Schwartz presents two algorithms (one attributed to Cocke, the other to Ershov) illustrating some of the early attempts 59, pages 327{329]. The heuristic employed in the Yorktown allocator was devised by Chaitin 19]. It requires O(n + e) time, where n is the number of live ranges to be colored and e is the number of edges in the interference graph. Chaitin also shows how his heuristic can be used to accomplish both coloring and spilling in an integrated fashion 18]. In our framework, Chaitin's coloring heuristic is distributed between simplify and select. Why does it work? Simplify repeatedly removes nodes from the graph and pushes them on a stack. In select, the nodes are popped from the stack and added back to the graph. A node li is only moved from the graph to the stack if li < k. Therefore, when select moves li from the stack back to the graph, li will still have less than k neighbors. Clearly there will be a color available for li in that graph. Simplify only removes a node when it can prove that the node will be assigned a color in the current graph. As each live range is removed, the degrees of its neighbors are lowered. This, in turn, may prove that they can be assigned colors. In select, the nodes are assigned colors in reverse order of removal. Thus, each node is colored in a graph where it is trivially colorable { simplify ordered the stack so that this must be true. In one sense, the ordering forces the allocator to color the most constrained nodes rst { li gets colored before lj precisely because simplify proved that lj was colorable independent of the speci c color chosen for li. As an example, consider nding a threecoloring for the simple graph shown in Figure 2.4. The left column (working from the top down) illustrates one possible sequence of simpli cations. In the initial graph, l1 < 3, so we are assured that a
4
Chaitin et al. give a proof that any graph can be generated by some routine.
18
Simplify
? ? @ @
Select
b
? ? @ @ H H @ HH @ HH
3
1
H H @ HH @ HH ? ?
4
5
g
g
r
2 3
6
4 5
r b
? ?
H H @ HH @ HH ? ?
H H @ HH @ HH
g
r
2 3
r
? ?
H H @ HH @ HH
?
5 5 5
b
4
6
H H @ HH @ HH
g
r
4
g
r
r
Figure 2.4 Coloring a Simple Graph
19 color will be available during select. When l1 is removed, the degrees of l2 and l3 are lowered. Since l2 is now < 3, it is removed in turn. The process is repeated until the graph is completely empty. No spilling is required in the case since there are nodes with degree < 3 available at every step. The right column (working back up) shows how select reconstructs the graph, coloring each node as it is added to the graph. If simplify encounters a graph containing only nodes of degree k, then a node is chosen for spilling (see next section). The spill node is removed from the graph and marked for spilling. One alternative at this point is to immediately insert spill code for the spill node, rebuild the interference graph, and attempt a new coloring. This approach is precise but expensive since some routines may require spilling many live ranges. Chaitin uses a less precise approach, continuing simpli cation after choosing a spill node, potentially marking many nodes for spilling in a single pass.
Choosing Spill Nodes
The metric for picking spill nodes is important. Chaitin suggests choosing the node with the smallest ratio of spill cost divided by current degree. costn (2:1) mn = degree n Note that degreen is the current degree of the node n; that is, the degree of n in the partial graph remaining after removing all nodes of low degree. This metric re ects a desire to minimize total spill costs coupled with a desire to simplify the graph by lowering the degree of many nodes (the neighbors of node n). Later work by Bernstein et al. at Haifa explores other spill choice metrics 9]. They present three alternative metrics: mn = costn2 (2.2) degreen costn mn = degree area (2.3) n n costn (2.4) mn = degree2 arean n In equations 2.3 and 2.4, arean represents an attempt to quantify the impact n has on live ranges throughout the routine. arean = 5depth widthi (2:5)
X
i
i n
2 instructions
is alive at i
20 Here depthi is the number of loops containing the instruction i and widthi is the number of live ranges live across the instruction i. The experiments of Bernstein et al. suggest that no single spill metric completely dominates the others. Therefore, they propose using a \best of 3" technique. They repeat simplify three times, each time with a di erent spill metric, and choose the version giving lowest total spill cost. The reason behind the practical success of the \best of 3" heuristic is perhaps subtle. Choosing the best node to spill is NPcomplete; therefore, we expect counterexamples for any spill metric. By using a combination of three, we gain some measure of protection from the worstcase examples. To an extent, we view the \best of 3" heuristic as a lter that helps to smooth some of the NPnoise in our results.5
2.3 History
The idea of solving allocation problems by abstracting to graph coloring has a surprisingly long history. Ershov notes that early interest in graph theory among programmers (at least in the Soviet Union) was due to the reduction of storage packing problems to the graph coloring problem 33, page 174]. Historically, we see two partially overlapping threads: work in memory allocation and work in register allocation.
2.3.1 Memory Allocation
By memory allocation, we mean the problem of laying out storage for variables (scalars and arrays) in main memory so that they require minimal space. On early machines, this was an important concern, given their small memories. Apparently, the earliest work on memory allocation and graph coloring was published by Lavrov in 1961 52]. The work is di cult to understand, hampered somewhat by translation, but more signi cantly by the lack of common vocabulary (e.g., there are no instructions, basic blocks, live ranges, or control ow graphs { instead we see operators, routes, carriers, data paths, and areas of e ect). Nevertheless, it is clear that the de nition of an incompatibility graph is key to his approach.6 Inspired by Lavrov, Ershov also explored the correspondence between memory allocation and graph coloring 30]. A coloringbased memory allocator was described
The term NPnoise was coined by Linda Torczon to describe the annoyingly large variations in spill costs that occur with even the smallest adjustments to the coloring algorithm. 6 Those interested in reading Lavrov should certainly consult Ershov as a guide 33, pages 170{173].
5
21 as part of the Alpha compiler 31, 32]. It is also interesting to read the Editor's Note appended to Ershov's JACM paper. The attainment of \global memory economy" by means of the \inconsistency matrix" is a novel scheme for minimizing the number of storage cells assigned to variables. An incidence matrix (inconsistency matrix) is constructed which shows which variables may not occupy the same cell. This permits extreme compression of the storage area for variables. Ascher Opler An extensive description of Ershov's work on memory allocation and graph coloring is included in his book 33]. Extensions to Ershov's work in this area were reported by Fabri 34, 35]. However, this general approach to conserving memory seems to be of less interest recently. There are perhaps several reasons: relatively large, cheap main memories now available, increased reliance on stackbased allocation, with its naturally conservative approach (however, see 56]), and almost exclusive use of separate compilation, making the wholeprogram analysis required for memory allocation seem painfully awkward. Nevertheless, the increasing importance of memory locality together with approaches to convenient wholeprogram analysis 27] may lead to a renewed interest in the problems of packing main memory. It is important to note that the work of Lavrov, Ershov, and Fabri attacked the problem of packing arrays in memory. This is not a trivial extension of the scalar packing problem nor is it naturally expressed as graph coloring. On the other hand, register allocation is a much closer t to coloring.
2.3.2 Register Allocation
The idea of managing global register allocation via graph coloring is apparently due to Cocke 49, 19]. We nd some limited discussion of graph coloring in the early 1970's; however, it seems to concentrate more on the search for useful coloring heuristics than on the problems arising in register allocation 59]. Chaitin points out that early work was fatally hampered by the relatively small memories available at the time 19].
22 The rst complete global register allocator based on graph coloring was built by Chaitin and his colleagues at IBM 20]. Chaitin later described an improved coloring heuristic that handled the problems of coloring and spilling in a natural and coordinated fashion 18]. There has been a fair amount of work building on the foundation provided by Chaitin. We have reported an improvement to Chaitin's coloring heuristic 12]. Other improvements were introduced by Bernstein et al. 9]. Extensions to enable allocation of register pairs have been discovered by Nickerson and as part of our own work 57, 13]. Recently, we have described a technique for improving the accuracy of spill code estimation and placement 14]. An alternative form of global register allocation via graph coloring is described by Chow and Hennessy 22, 25, 26]. Their work, while based on coloring, di ers in many respects from the work of Chaitin and his successors. They introduce the concept of live range splitting as an alternative to the spilling techniques originally used by Chaitin et al. The idea of live range splitting was independently discovered by Fabri in connection with her work on memory allocation 34]. Extensions and improvements have been reported by Larus and Hil nger, Chow, and by Gupta, So a, and Steele 51, 23, 41]. A recent paper by Callahan and Koblenz describes a hierarchical approach to global register allocation via coloring 16]. Their approach decomposes the controlow graph into a tree of tiles, colors each tile individually, and merges the results in two passes over the tree. It represents an attempt to gain the precision of Chaitin's approach to allocation together with a structured approach to live range splitting. In this thesis, we have avoided extensive comparisons with the work of Callahan and Koblenz. This is not because we are unaware of their work or because we do not appreciate its value; rather it is because their work was done largely in parallel with ours { we have little perspective. As the community gains experience with their work and ours, we expect to be better able to understand how they compare.
23
Chapter 3 Improved Coloring and Spilling
At the heart of a graph coloring allocator is the algorithm used for coloring. Since the problem of nding a kcoloring is NPcomplete, the coloring algorithm must employ a heuristic. One of the many strengths of the Yorktown allocator is Chaitin's coloring heuristic, which attacks the problems of coloring and spill choice in an integrated fashion. However, since it is a heuristic approach to an NPcomplete problem, we are not surprised to nd examples where its performance is not optimal. In the next section, we present two examples where Chaitin's coloring heuristic is not optimal. The examples inspired a variation to Chaitin's heuristic, reported in Section 3.2, which o ers signi cant improvements. The nal two sections describe two further variations enabled by our new heuristic.
3.1 Problems
As a part of the IRn programming environment, we built an optimizing FORTRAN compiler 27, 17]. When the project was begun (many years ago), Chaitin's approach was new and elegant, so we decided to use it in our new compiler. While discussing details of the allocator, Ken Kennedy constructed a small example showing how Chaitin's heuristic could be forced to spill when a kcoloring was actually possible. Later, when our allocator was working, we discovered a second interesting example { this time resulting from real code. In this case, spill code would always be required; however, Chaitin's heuristic obviously forced more spills than necessary.7 The two examples, one small and one large, demonstrate a single weakness in Chaitin's heuristic. The next two subsections present and explain both examples; Section 3.2 shows how to overcome the problem.
7
The fact that the extra spills were obvious made it possible to detect that there even was a problem. Usually the sheer bulk of assembly code makes it di cult to detect such mistakes. After all, the code is correct; it simply runs a little slower than it might.
24
3.1.1 The Smallest Example
Suppose we want to nd a 2coloring for the graph shown in Figure 3.1. Clearly one exists; for example, x and y could be colored red and w and z could be colored green. If we apply Chaitin's heuristic though, simplify is immediately forced to spill { there are no nodes with degree < 2. If we assume for the example that all spill costs are equal, then some arbitrary node can be chosen for spilling; for example x. After x is removed from the graph and marked for spilling, the remaining nodes are removed by simplify. This example illustrates that Chaitin's heuristic does not always nd a kcoloring, even when one exists. Of course, we are not surprised, since the problem is NPcomplete. The small size of the example is perhaps surprising. Of course, we might wonder how often such examples arise in real code, given the relatively large register sets typically available on modern processors.
3.1.2 A Large Example
In the process of isolating a bug elsewhere in the compiler, we carefully examined the code generated for a routine named SVD from the software distributed with Forsythe, Malcolm, and Moler's book on numerical methods 36]. The routine implements the singular value decomposition of Golub and Reinsch. It has 214 noncomment lines of code and 37 DOloops organized into ve di erent loop nests. The rst loop nest is a simple array copy, shown at the top of Figure 3.2. Close examination of the code generated for SVD revealed that the allocator was spilling a large number of short live ranges in deference to the larger, longer live
x w
? ? @ @ @ @ ? ?
z
y
Figure 3.1 A Simple Graph Requiring Two Colors
25
subroutine SVD(M, N, : : : ) do I = 1, N do J = 1, M A(I, J) = B(I, J) enddo enddo do enddo do
:::
M N I
J
many deeplynested loops
many long live ranges
enddo do
:::
enddo do
:::
enddo
Figure 3.2 The Structure of SVD
ranges. The loop indices and limits of the arraycopy loop were spilled despite the fact that there were several unused registers at that point in the code. After some study, we were able to understand why the register allocator overspilled so badly and what situations provoked this behavior. After optimization, about a dozen long live ranges (parameters specifying arrays and their sizes) extend from the initialization portion, through the array copy, and into the large loop nests. During coloring, these live ranges restrict the graph so much that some registers must be spilled. The estimated spill costs for I, J, M, and N (the indices and limits on the arraycopy loops) are smaller than those for the longer live ranges { quite properly, since I, J, M, and N are only used over a small range that is less deeply nested than the rest of the routine. Unfortunately, spilling I, J, M, and N does not lower register pressure in the large loop nests and more live ranges must be spilled. Eventually, most of the longer live ranges have been spilled and coloring proceeds. The nal result: the code has almost no register utilization during the array copy.
26
3.2 An Improvement
The two examples highlight di erent problems: 1. The allocator fails to nd a twocoloring for the simple diamond graph. By inspection, we can see that the graph is twocolorable. The problem here is fundamental: the allocator uses too weak an approximation to decide whether or not x will get a color. In looking for a kcoloring, the allocator approximates \x gets a color" by \x < k." This is a su cient condition for x to get a color but by no means a necessary condition. For example, x may have k neighbors, but two of those neighbors may get assigned the same color. This is precisely what happens in the diamond graph. 2. In SVD, the allocator must spill some live ranges. The heuristic for picking a spill candidate selects the small live ranges used in shallow loop nests because they are less expensive to spill. Unfortunately, spilling them is not productive { it does not alleviate register pressure in the major loop nests. When the spill decisions are made, the allocator cannot recognize that the spills do not help.8 Similarly, the allocator has no way to retract the decisions. Thus, these live ranges get spilled and stay spilled. While discussing the simple diamond graph, Kennedy pointed out that the coloring heuristic proposed by Matula and Beck will nd a twocoloring of the diamond graph 54]. Their algorithm di ers only slightly from Chaitin's approach. To simplify the graph, they repeatedly remove the node of smallest current degree, versus Chaitin's approach of removing any node n where n < k. After all nodes have been removed, they select colors in the reverse of the order of deletion, in the same fashion as Chaitin. Applied to the diamond graph, this heuristic generates a twocoloring. Chaitin's heuristic fails because it pessimistically assumes that all the neighbors of a node will get di erent colors. Using Matula and Beck's heuristic, we have the opportunity to discover when some of the neighbors of a node n receive the same color, leaving a spare color for n itself. Unfortunately, this scheme simply nds a coloring; there is no notion of nding a kcoloring for some k, and therefore no mechanism for producing spill code. For a register allocator, this is a serious problem. Many procedures require spill code { their interference graphs are simply not kcolorable.
8
Rather, it cannot without expensive lookahead.
27 Thus, what is needed is an algorithm that combines Matula and Beck's stronger coloring heuristic with Chaitin's mechanism for costguided spill selection. To achieve this e ect, we made two modi cations to Chaitin's original algorithm: 1. Simplify removes nodes with degree < k in an arbitrary order. Whenever it discovers that all remaining nodes have degree k, it chooses a spill candidate. That node is removed from the graph; but instead of marking it for spilling, simplify optimistically pushes it on the stack, hoping a color will be available in spite of its high degree. Thus, nodes are removed in the same order as Chaitin's heuristic, but spill candidates are included on the stack for coloring. 2. Select may discover that it has no color available for some node. In that case, it leaves the node uncolored and continues with the next node. Note that any uncolored node would also have been spilled by Chaitin's allocator. If all nodes receive colors, the allocation has succeeded. If any nodes are uncolored, the allocator inserts spill code for the corresponding live ranges, rebuilds the interference graph, and tries again. The resulting allocator is shown in Figure 3.3. Spill decisions are now made by select rather than simplify. The rest of the allocator is unchanged from Chaitin's scheme. In this form, the allocator can address both of our example problems. Deferring the spill decision has two powerful consequences. First, it eliminates some nonproductive spills. In Chaitin's scheme, spill decisions are made during simplify, before any nodes are assigned colors. When it selects a node to spill, the corresponding live range is spilled. In our scheme, these nodes are placed on the stack as spill candidates. Only when select discovers that no color is available is the live range actually spilled. This mechanism, in e ect, allows the allocator to reconsider spill decisions.
spill code
? ? 
renumber
build

coalesce

spill costs

simplify

select

Figure 3.3 The Optimistic Allocator
28 Second, late spilling capitalizes on speci c details of the color assignment to provide a stronger coloring heuristic. In selecting a color for node x, it examines the colors of all x's current neighbors. This provides a direct measure of \does x get a color?" rather than estimating the answer with \is x < k?" If two or more of x's neighbors receive the same color, then x may receive a color even though x k.9 On the diamond graph, this e ect allows the allocator to generate a twocoloring. Recall SVD. The live ranges for I, J, M, and N are early spill candidates because their spill costs are small. However, spilling them doesn't alleviate register pressure inside the major loop nests. Thus, the allocator must spill some of the large live ranges; this happens after the small live ranges have been selected as spill candidates and placed on the stack. By the time the small live ranges come o the stack in select, some of these large live ranges have been spilled. The allocator can easily determine that colors are available for these small live ranges in the early arraycopy loops; it simply looks at the colors used by their neighbors. Optimistic coloring is a simple improvement to Chaitin's pessimistic scheme. Assume that we have two allocators, one optimistic and one pessimistic, and that cost both use the same spill choice metric { for example, Chaitin's metric of degree . The optimistic allocator has a stronger coloring heuristic, in the following sense: it will color any graph that the pessimistic allocator does and it will color some graphs that the pessimistic allocator will not. When spilling is necessary, both allocators will spill the same set of live ranges, except when optimistic coloring helps. In those cases, our allocator will spill a proper subset of the live ranges spilled by Chaitin's allocator. Note that the comparisons between the optimistic heuristic and Chaitin's heuristic are predicated on both versions of simplify removing the same nodes in a given situation. This won't necessarily happen, but the assumption is necessary for comparison. For our experimental comparisons (see Chapter 7), we have been careful to implement both versions so they remove nodes in the same order. 10
Early versions of prioritybased coloring considered only degree when assigning colors, despite having a singlepass algorithm where the actual colorings are available 22, 25]. Later descriptions correct this mistake 26]. 10At least, the order will be identical on the rst trip through the buildcolorspill loop. Later iterations will present di erent graphs for coloring.
9
29
Results
Optimistic coloring helps generate better allocations. In a few cases, this eliminates all spilling; the diamond graph is one such example. In many cases, the total spill cost for the procedure is reduced. In Section 7.1, we report results of a study comparing our optimistic allocator with our implementation of the Yorktown allocator. In a test suite of 70 FORTRAN routines, we observed improvements in 27 cases and a single loss (an extra load and store were required). Improvements ranged from tiny to quite large, sometimes reducing spill costs by over 40%. The single loss was disappointing, since we have claimed that the optimistic coloring heuristic can never spill more than Chaitin's heuristic. However, we must recall the structure of the allocator. After each attempt to color, spill code is inserted and the entire builtcoalescecolor process is repeated. The optimistic coloring heuristic will perform as well as Chaitin's heuristic on any graph; but after spilling, the two allocators will be facing di erent problems. At least one independent con rmation of our results exists. Addition of the optimistic coloring heuristic to the backend of the IBM XL compiler family for the RS/6000 machines resulted in a decrease of about 20% in estimated spill costs over the SPEC benchmark suite 44]. The optimistic heuristic is superior theoretically since it can never spill more than Chaitin's heuristic on a given graph. It is no more complex asymptotically than Chaitin's heuristic. Furthermore, it is no more di cult to implement than Chaitin's heuristic. Finally, our experimental results show that the improvement is signi cant on a large number of routines.
3.3 Limited Backtracking
Once we have the optimistic coloring heuristic, another re nement is possible. Recall select. Given a stack of nodes created by simplify, each node n is removed from the stack and added to the graph. After adding n to the graph, select rst examines n's neighbors, noting which colors have been used, then chooses a di erent color for n. If no color remains from the kpalette, then n is left uncolored and will be spilled. As an alternative to simply leaving n uncolored, we can sometimes attempt a limited form of backtracking, recoloring a neighbor of n and thus freeing a color for
30
n. We note that such backtracking must be carefully constrained to avoid the chance of combinatorial explosion. By limiting backtracking to a single level, we can maintain our linear timebound for coloring. While examining n's neighbors, we can accumulate the number of uses of each color (rather than simply noting when each color is used) and note which neighbor uses each color. If no colors remain for n, we check for colors that have only been used by a single neighbor. If m is the only neighbor of n using a color c, the we try recoloring m. If successful, we can then give n the color c. Another possibility is trying to recolor several neighbors that all use the same color c. This seems to have less potential. For instance, it would be annoying to successfully recolor three neighbors only to have the fourth block the use of c.
Results
The advantages of limited backtracking are its low cost, easy implementation, and the fact that it rarely loses.11 On the other hand, the results have been disappointing; limited backtracking almost never pays o . In Section 7.1, we compare an allocator with limited backtracking to the optimistic allocator. In (only) three routines out of seventy, limited backtracking o ers a slight improvement. In the best case, there is a 1.2% reduction in spill cost. Why so little improvement? When a node is selected as a spill candidate, it is selected from a graph where every node in the graph has at least k neighbors (otherwise, simplify would have continued removing them before being forced to choose a spill candidate). Therefore, when select is unable to color a node n (where n is always a spill candidate), n has at least k neighbors and those neighbors have at least k neighbors. Therefore, any neighbor of n is relatively constrained and there is only a small chance that limited backtracking will be able to free a color.
3.4 Alternative SpillChoice Metrics
In Section 2.2.6, we introduced the \best of 3" heuristic originally suggested by Bernstein et al. 9]. In essence, they run simplify three times, each time using a di erent spill choice metric, and choose the ordering that gives the lowest spill cost.
We have never seen it lose, though such situations are conceivable. Saving an expensive spill now versus possibly saving lessexpensive spills in the future is usually a safe bet.
11
31 This idea extends naturally to our optimistic coloring heuristic. We simply run the combination of simplify and select three times, being careful with our accounting, and choose the best result. The three speci c metrics used by Bernstein et al. (Equations 2.2 through 2.4) are not important; we can invent many similar metrics, perhaps using more extensive combinations to give a \best of 10" approach. Of course, there is a tradeo of increased compiletime against the diminishing returns o ered by such an attack. An attractive idea is to experiment with di erent spill cost metrics, attempting to take advantage of the optimistic coloring heuristic. Rather than trying to remove nodes with high degree (hoping to greatly simplify the graph), we can try removing nodes of low degree. The hope is that a node of low degree (but still greater than k ) will be more likely to color since only a few of its neighbors need to overlap (or spill) to create space in the kpalette. There seem to be several possibilities: Search for a node n such that n < k + t, where t is some small constant. If many such nodes are found, choose the one with lowest spill cost. If no such nodes are found, use one of the traditional spill metrics. Search for a node n that minimizes the product costn degreen. Search for a node that minimizes costn degreen arean We have performed a few limited experiments with these ideas; however, the results have been unimpressive. Why? Each time simplify must choose a spill candidate, any one of the spill metrics might indicate the best choice. For an entire sequence of spill choices (that is, for an entire simpli cation), it is unlikely that any one spill metric would be best for every choice. Each run of simplify in a \best of 3" or \best of 10" sequence is therefore a compromise { a series of acceptable choices that work out reasonably well together. Additionally, there is a problem of diminishing returns. By adding a \best of 2" choice, we expect some amount of improvement. With \best of 3", we expect a further (though smaller) improvement. As we continue, the rate of improvement will continue to slow. Furthermore, the improvements will appear on fewer and fewer routines. Nevertheless, there seems to be some possibility of improvement in this area for those patient enough to explore it thoroughly.
32
3.5 Summary
The primary contribution of Chaitin's second allocation paper was a coloring heuristic able to make spill decisions based upon the structure of the interference graph. In this chapter, we have presented an improvement to Chaitin's heuristic. The improved heuristic is able to color more graphs with no spilling and able to color many other graphs with less spilling. The key di erence is that the new heuristic optimistically attempts to color, even when faced with apparently complex graphs. We also presented two additional improvements made possible by the same insight that motivates our optimistic coloring heuristic. The optimistic heuristic is valuable. We have reported experimental results showing that the optimistic heuristic pays o in many real routines and can reduce spill costs by up to 40%. In contrast, limited backtracking and the alternative spillchoice metrics were less valuable.
33
Chapter 4 Coloring Register Pairs
Register pairs are a fundamental part of many machine architectures. For example, many machines use pairs of singleprecision registers to hold doubleprecision values. The two most common constraints imposed on register pairs are requiring that the registers be adjacent (named with consecutive integers) and that an adjacent pair be aligned (typically requiring the rst register to have an even number). Previous descriptions of this problem in the literature have been rather limited. In 1986, Hopkins described a method to handle the register pair constraints that arise in the shift instructions on the ROMP microprocessor { the engine in RT/PC workstations 43]. In 1990, Nickerson published a method for allocating structures into an aggregate set of adjacent registers. He observed that an allocator based on our 1989 paper (see Chapter 3) produced good allocations under his scheme 57, 12]. This chapter describes work performed independently of Nickerson and completed at approximately the same time. We consider various ways to represent register pairs in the interference graph. We show why Chaitin's coloring heuristic overestimates demand for registers and how the heuristic introduced in Chapter 3 naturally avoids this problem. Finally, we provide a simple rationale for deciding how many edges are required to correctly represent an interference between a single register and an aggregate set of registers.
4.1 Why Are Pairs Hard to Color?
In Chapter 3, we showed examples where Chaitin's coloring heuristic overspilled. When the program includes operations requiring pairs of adjacent registers, this overspilling is exaggerated. The reason is simple { introducing pairs of registers requires modifying the interference graph or changing the way the allocator interprets it. In Chaitin's scheme, such modi cation distorts the allocation { the allocator consistently overestimates register demand. This causes it to spill values in many cases where registers are available to hold them.
34 To help in the discussion, we use the following simple example to illustrate our points. Imagine four live ranges, a, b, c, and d; where a and b are singleprecision values and c and d are doubleprecision values. Assume that b interferes with a and c and that c also interferes with d. The interference graph for this example looks like:
a b c d
We have drawn the nodes for c and d larger than those for a and b as a reminder that they require pairs; this re ects a fundamental fact that the allocator must handle. Such a graph can be viewed as a weighted graph; each node has an integer weight associated with it. In our problem, the weights correspond to the number of colors (or registers) needed for each node. From a graph coloring perspective, a weighted graph is fundamentally harder to color than an unweighted graph. For example, consider the interference graphs that result from straightline code { they are interval graphs. A minimal coloring for an unweighted interval graph can be found in linear time 40, page 14]. In contrast, the problem of nding a minimal coloring for a weighted interval graph (which is identical to the shipbuilding problem) has been shown to be NPcomplete even when the weights are constrained to be either 1 or 2 40, page 204]. This is exactly the situation arising when allocating register pairs. In any case, a global register allocator must be prepared to handle procedures having more complex control ow.12 Fabri explored variations of this problem in the context of packing arrays in memory 34, 35]. While the work is interesting, it is not directly applicable to our problem. For example, her problem has no analog for the problems of spill choice and spill placement. Thus, for the purposes of register allocation, we have continued to work with allocators styled after Chaitin's work.
4.1.1 Unconstrained Pairs
Initially, consider the simplest case. Assume that the target machine places no adjacency or alignment restrictions on pairs. In this case, the graph shown previously
12
Recall that Chaitin et al. showed that any arbitrary interference graph can be generated by some procedure.
35 overconstrains the coloring; instead, we can simply handle the two halves of each register pair separately. This yields the following graph:
a b
HH H H
c
c
0
HH H H
d
d
0
This simple graph su ces because the machine places no restrictions on a register pair beyond the obvious requirement that the two halves of the pair occupy di erent registers (the edges from c to c0 and from d to d0 embody this constraint). Because the simple constraints can be encoded directly into the interference graph without any additional interpretation, this graph can be colored directly with Chaitin's algorithm. It may overspill, but the overspilling will be limited to the kind found in programs containing no register pairs. Chaitin's method always nds a fourcoloring for the example graph.
4.1.2 Adjacent Pairs
Extensions to handle adjacent register pairs correctly are more di cult. For the moment, assume that the target machine requires that pairs be allocated to aligned adjacent registers. The problem arises during color selection; the allocator must coordinate the colors for two nodes that appear unrelated. If it assigns a color to one and cannot assign an adjacent color to the other node, it must either reconsider the colors that it has already assigned or report complete failure. Neither of these is a good alternative. For select to reconsider colors requires backtracking, which can require exponential time. Reporting failure seems unhelpful; it provides no clear direction for recovery. Changing our representation for pairs appears to be the best alternative. We should consider treating the pairs as indivisible units and assigning the pair two colors in select. This gives us the following graph (more accurately multigraph):
a b
H HH H
c
d
It resembles our original graph from Section 4.1, with additional edges to represent necessary interferences. The simplicity of this representation is appealing. Intuitively,
36 the extra edge between b and c re ects the additional constraint placed on b. Similarly, the extra edge between c and d balances their extra width.
Multigraph Representation
So, why have we moved to a multigraph representation, besides the intuitive appeal of the pictures? For stronger justi cation, we must consider the role edges play in the coloring process. First, edges represent interferences { they are critical to the correctness of the resulting allocation. Second, they trigger spilling in simplify. Recall that simplify examines the graph and repeatedly removes nodes with fewer than k neighbors, where k is the number of available colors. A node having fewer than k neighbors always receives a color independent of context. If, during the simpli cation process, a node always has k or more neighbors, simplify marks it for spilling. The number of neighbors is the node's degree. Thus, for the allocator to work correctly, a node's degree should accurately re ect its colorability. For register pairs, we must add enough edges to ensure proper behavior. Too few edges lead to a situation where simplify fails to reserve enough registers; too many edges leads to excessive spilling. The graph shown above correctly models the colorability of each of node. Any interference that involves a value stored in a pair of registers adds two edges to the graph. Thus, the interference between b and c creates a pair of edges, as does the interference between c and d. This rule makes sense. On a fourregister machine, two single registers that interfere with a register pair raise the pair's degree to four. Placed correctly, the singles could block allocation of registers to the pair. Similarly, three register pairs that all mutually interfere create a situation where all three have a degree of four. This re ects the fact that three register pairs cannot t into four registers. Sometimes it is convenient to introduce an interference between a single register and one half of a pair. Often, one half of a register pair may be used as the source or destination of an operation. For example, the real half of a complex pair might be copied to another register. In this case, the target of the copy should interfere only with the imaginary half of the complex pair. An interference between the target register and the real half of the pair would prevent coalesce from combining them and eliminating the copy.
37
A Problem
Unfortunately, Chaitin's coloring heuristic performs poorly on graphs of this form. The graph is more constrained because of the pairs of edges between pairs and their neighbors; this triggers earlier application of the spill mechanism. To elucidate this problem requires a somewhat more complex example. Consider a machine with eight singleprecision oatingpoint registers that requires doubleprecision values to be allocated to adjacent pairs of registers. If we have a doubleprecision value, four singleprecision values are su cient to force a spill with Chaitin's allocator. The following picture shows why.
0 1
w
2 3 4
x y
5 6 7
z
b " lb D\ ", lb D \ ", ", lb D \ l bb "" , lDD "b , b \ l "\ ,
d
Assume that w, x, y, and z each have at least six other interferences. Faced with this situation, Chaitin's algorithm invokes the spill metric to choose a value for spilling. cost It selects the node that minimizes degree and spills it. Of course, many of the possible assignments would leave space for all ve values; for example, placing w, x, y, and z in the rst four registers leaves the last four to hold d. Unfortunately, the early decision on spilling prevents the allocator from nding such an allocation. Why does this happen? There exist colorings of w, x, y, and z, like the one suggested above, that preclude d's allocation. The \extra" edges in the interference graph, the second edge from d to each of the other nodes, account for this possibility. In Chaitin's scheme, simplify constructs an order in which select is guaranteed to succeed. Such an order does not exist for the example, so simplify spills one of the values. Thus, in any region where there is strong competition for registers and a mixture of single registers and register pairs, the allocator will consistently overestimate the demand for registers and spill values for which registers are available.
38
4.2 The Optimistic Coloring Heuristic
Originally, we thought that this problem was endemic to all coloring allocators. Fortunately, Randy Scarborough asked us a question that caused us to consider the problem of adjacent pairs in the context of our optimistic coloring heuristic. The optimistic allocator behaves di erently than the pessimistic allocator with respect to spilling adjacent pairs. Because it defers spill decisions into select, it only spills a node when it discovers that it cannot color the node. With adjacent pairs, the behavior is the same; it only spills a pair when it discovers that no adjacent pair is available. This change eliminates the overspilling that arises with Chaitin's heuristic. To see this more clearly, reconsider the graph that caused problems in Section 4.1.2.
0 1
w
2 3 4
x y
5 6 7
z
b " lb D\ ", lb D \ ", ", lb D \ l bb "" , lDD "b , b l "\ , \
d
The pessimistic allocator would spill because the singleprecision values w, x, y, and z might be assigned to registers in a way that precludes successful coloring of d. The optimistic allocator would simply push a node, say d, on the stack, because the actual coloring of w, x, y, and z may leave a pair available for d. This can happen in many ways; for example, w and x might be assigned the same color, y might be spilled, or y and z might be assigned to consecutive registers. In fact, the only way that d could be blocked is by an even spacing of the sort suggested by the gure. The optimistic allocator often succeeds on graphs where the pessimistic allocator fails. Simplify determines an order for assigning colors; it treats single nodes and pairs identically. The di erence between them is encoded in the number of edges. Select makes the actual spill decisions; it spills a node only after discovering that it cannot nd the needed color(s).
39
4.3 Unaligned Pairs
A few architectures allow the use of unaligned adjacent pairs of registers. The interference graph required for this situation is slightly more complex than for the aligned case. For our continuing example, the following interference graph captures all of the needed properties:
a b
HH H H
c
d
The extra edge (between c and d) is required to correctly trigger the selection of a spill candidate in simplify. The next two graphs help show why this third edge is necessary between unaligned pairs. Note that three pairs (x, y, and z) can be colored so that there is no adjacent pair of colors for d. The fact that d has nine edges will trigger the spill heuristic in simplify, causing it to select a spill candidate. Of course, the candidate will not necessarily be spilled { this is the key di erence between the optimistic and pessimistic approaches.
0
x
1 2 3
y
4 5 6
z
7
QQ Q QQQ QQ Q QQQ QQ Q QQQ QQ Q
d
Even two pairs and a single may be placed so that d cannot be colored, as shown below. Again, d's eight edges are su cient to warn that d may have to be spilled during select. 0 1 2 3 4 5 6 7
x y z
QQ Q ? QQQ QQ Q ? QQQ ? QQ Q QQQ ? QQ Q ?
d
40 Finally, we note that requiring four edges between each interfering pair would be too conservative. This would suggest that only two pairs (say x and y) would su ce to preclude coloring d. These examples show that the extra exibility o ered by eliminating alignment restrictions complicates the allocation process by enlarging the graph. Furthermore, because of the additional constraints that it adds, it can actually lead to worse allocations.
4.4 Using Pairs for Memory Access
In some cases, it is desirable to use adjacent registers for loads and stores. For example, complex numbers are often represented in storage as a pair of adjacent singleprecision oatingpoint numbers. On some machines, it is advantageous to load these values into an adjacent register pair using a doubleprecision oatingpoint load. Unfortunately, tying all subsequent uses of the component parts of the complex number to the adjacent registers restricts the allocator's freedom. Our compiler handles this issue by carefully shaping the code before the allocator sees it. It generates a doubleprecision load into an adjacent pair of virtual registers, and then immediately copies the component values into single registers.13 This allows the allocator to keep the values in adjacent registers at points where they are loaded and stored, while o ering it the chance to keep them in nonadjacent registers during the rest of their lifetimes. It decides between these choices during coalescing, based on the structure of the interference graph. The allocator is the proper place to make this decision { it relies on information that cannot be made available earlier in compilation. These ideas may become more important in the future. Architects can make more memory bandwidth accessible through the use of wider load and store instructions. For example, on Intel's i860 XP, the quadword, oatingpoint load fld.q loads four registers at a time, allowing a program to move twice as much data as the doubleword version fld.d and four times as much as the singleword version fld.l 45]. Naturally, any appreciable use of this feature ties down a large number of registers and allocating them carefully becomes important.
These extra copies are only inserted when loading and storing complex numbers. Ordinary doubleprecision values are managed with no extra copies, since extra copies would provide no additional exibility.
13
41 Previous work by Nickerson focuses on allocating aggregate data structures to adjacent sets of registers 57]. Nickerson assumes that the data items must remain adjacent throughout their lifetimes and that the components of the aggregate can have di erent lifetimes. In our work, we handle cases where the components of an aggregate have di erent lifetimes by enforcing adjacency only at those instructions where it is required. Pairs are used at instructions that require them; when a component is used for some di erent lifetime, we separate out that nonadjacent use. This can be accomplished by copying the component into another register; this lets coalesce, simplify, and select determine whether or not to preserve adjacency. Note that this is only important for values having components with di erent lifetimes.
4.5 Summary
This chapter examines the problem of dealing with register pairs in a graph coloring register allocator. To work with register pairs, the allocator needs a way to represent them in the interference graph; this entails either changing the interference graph or its interpretation. We have shown how to represent di erent sets of constraints: unconstrained pairs, adjacent pairs, and unaligned adjacent pairs. The key issue is determining the number of edges to add between an aggregate node and each of its neighbors. Our scheme extends in a straightforward way to larger aggregate register groupings. Unfortunately, when presented with a multigraph, Chaitin's allocator consistently overestimates the demand for registers. This results in allocations that underuse the register set, spilling values even when registers are available to hold them. An optimistic allocator avoids such overspilling. This allows the compilerdesigner to use a simple representation for adjacent register pairs without provoking underuse of the register set. The previous chapter has shown that the optimistic coloring heuristic produces better allocations than Chaitin's pessimistic heuristic; this chapter shows that optimism also improves the treatment of register pairs.
42
Chapter 5 Rematerialization
This chapter examines a speci c problem that arises in global register allocation { rematerialization. When a value must be spilled, the allocator should recognize those cases when it is cheaper to recompute the value than to store and retrieve it from memory. While our discussion is set in the context of the Yorktown allocator, the same questions arise in all global allocators. The next section introduces the problem and suggests why it is important. We give a highlevel view of our approach in Section 5.2 and describe the necessary lowlevel modi cations to the allocator in Section 5.3. Results are discussed in Section 5.4.
5.1 Introduction
Chaitin et al. discuss several ideas for improving the quality of spill code 20]. They point out that certain values can be recomputed in a single instruction and that the required operands will always be available for the recomputation. They call these exceptional values neverkilled and note that such values should be recalculated instead of being spilled and reloaded. They further note that an uncoalesced copy of a neverkilled value can be eliminated by recomputing it directly into the desired register. Together, these techniques are called rematerialization. Many opportunities for rematerialization arise in practice, including: immediate loads of integer constants and oatingpoint constants, computing a constant o set from the frame pointer or the static data pointer, loads from a constant location in the stack frame or the static data area, and loading nonlocal frame pointers from a display 4, Section 7.4]. The values must be cheaply computable from operands that are available throughout the procedure.
43 Source
p
?
Ideal
Yorktown
Label store p
p
?
Splitting
Label store p
p
?
Label
p y p
?
y
?
y + p]
Label y + p]
reload p y y + p]
?
reload p y y + p] reload p
?
?
?
Label reload p p p+1 store p
? ?
p
?
p+1
p
?
p+1
p
?
p+1
?
?
?
Figure 5.1 Rematerialization versus Spilling
Consider the code fragments shown in Figure 5.1.14 Examining the Source column, we note that p is constant in the upper loop, but varying in the lower loop. The register allocator should take advantage of this situation. Imagine that high demand for registers in the upper loop forces p to be spilled; the Ideal column shows the desired result. In the upper loop, p is loaded just before it is needed (using some sort of \loadimmediate" instruction). For the lower loop, p is loaded just before the loop, again using a loadimmediate. The third column illustrates the code that would be produced by the Yorktown allocator. The entire live range of p has been spilled to memory, with loads inserted before the uses and stores inserted after the de nitions. The nal column shows code we would expect from a \splitting" allocator 26, 51, 41, 16]; the actual code might be worse.15 Unfortunately, examples of this sort are not discussed in the literature on splitting allocators and it is unclear how best to extend these techniques to achieve the Ideal solution.
The notation p] means \the contents of the memory location addressed by p." In fact, our work on rematerialization was motivated by problems observed during our experiments with splitting.
14 15
44
5.2 Rematerialization
It is important to understand the distinction between live ranges and values. A live range may comprise several values connected by common uses. In the Source column of Figure 5.1, p denotes a single live range composed from three values: the address Label, the result of the expression p + 1, and (more subtly) the merge of those two values at the head of the second loop. The Yorktown allocator correctly handles rematerialization when spilling live ranges with a single value, but cannot handle more complex cases; e.g., the variable p in Figure 5.1. Our task is to extend the Yorktown allocator to take advantage of rematerialization opportunities for complex, multivalued live ranges. Our approach is to tag each value with enough information to allow the allocator to handle it correctly. To achieve this, we 1. split each live range into its component values, 2. propagate rematerialization tags to each value, and 3. form new live ranges from connected values having identical tags. This approach allows correct handling of rematerialization, but introduces the new problem of minimizing unnecessary splits. The following sections describe how to nd values, how to propagate tags, how to split the live ranges, and how to remove unproductive splits.
5.2.1 Discovering Values
To nd values, we construct the procedure's static single assignment (SSA) graph, a representation that transforms the code so that each use of a value references exactly one de nition 29]. To achieve this goal, the construction technique inserts special de nitions called nodes at those points where control ow paths join and di erent values merge. We actually use the pruned SSA, with dead nodes ( nodes with no uses) eliminated 21]. A natural way to view the SSA graph for a procedure is as a collection of values, each composed of a single de nition and one or more uses. Each value's de nition is either a single instruction or a node that merges two or more values. By examining the de ning instruction for each value, we can recognize neverkilled values and propagate this information throughout the SSA graph.
45
5.2.2 Propagating Rematerialization Information
To propagate tags, we use an analog of Wegman and Zadeck's sparse simple constant algorithm 64].16 We modify their lattice slightly to represent the necessary rematerialization information. The new lattice elements may have one of three types: > Top means that no information is known. A value de ned by a copy instruction or a node has an initial tag of >. inst If a value is de ned by an appropriate instruction (neverkilled), it should be rematerialized. The value's tag is simply a pointer to the instruction. ? Bottom means that the value cannot be rematerialized. Any value de ned by an \inappropriate" instruction is immediately tagged with ?. Additionally, their meet operation u is modi ed in an analogous fashion. The new de nition is: any u > = any any u ? = ? insti u instj = insti if insti = instj insti u instj = ? if insti 6= instj Note that insti = instj compares the instructions on an operandbyoperand basis. Since our instructions have at most 2 operands, this modi cation does not a ect the asymptotic complexity of propagation. During propagation, each value will be tagged with an inst or ?. Values de ned by a copy instruction will have their tags lowered to inst or ?, depending on the value that ows into the copy. Values de ned by nodes will be lowered to inst if and only if all the values owing into the node have identical inst tags; otherwise, they are lowered to ?. This process tags each value is the SSA graph with either an instruction or ?. If a value's tag is ?, spilling that value requires a normal, heavyweight spill. If, however, its tag is an instruction, it can be rematerialized by inserting the instruction speci ed by the tag. The tags are used in two phases of the allocator: spill costs uses the tags to compute more accurate spill costs and spill code uses the tags to emit the desired code.
16
The more powerful sparse conditional constant algorithm is unnecessary; by this point in the compilation, all constant conditionals have been folded.
46 Source
p
?
SSA
p0
?
Splits
p0
?
Minimal
p0
?
Label
Label
Label
Label
y
?
y + p]
y
?
y + p0 ]
y p1
?
y + p0 ] p0 p2
y p12
?
y + p0 ] p0 p12 + 1
?
?
p
?
p+1
p1 p2
?
(p0 ; p2) p1 + 1
p2
?
p1 + 1 p1
p12
?
?
?
?
?
Figure 5.2 Introducing Splits 5.2.3 Inserting Splits
After propagation, the nodes must be removed and values renamed to recreate an executable program. Consider the example in Figure 5.2. The Source column simply repeats the example introduced in Figure 5.1. The SSA column shows the e ect of inserting a node for p and renaming the di erent values comprising p's live range. The Splits column illustrates the copies necessary to distinguish the di erent values without nodes. The nal column (Minimal) shows the single copy required to isolate the neverkilled value p0 from the other values comprising p. We avoid the extra copy by noting that p1 and p2 have identical tags after propagation (both are ?) and may be treated together as a single live range p12. Similarly, two connected values with the same inst tag would be combined into a single live range. For the purposes of rematerialization, the copies are placed perfectly { the neverkilled value has been isolated and no further copies have been introduced. The algorithm for removing nodes and inserting copies is described in Section 5.3.1. In Chapter 6, we discuss the possibility of including all the copies suggested in the Splits column.
47
5.2.4 Removing Unproductive Splits
Our approach inserts the minimal number of copies required to isolate the neverkilled values. Nevertheless, coloring can make some of these copies super uous. Recall the Minimal column in Figure 5.2. If neither p0 nor p12 are spilled and they both receive the same color, the copy connecting them is unnecessary. Because it has a real runtime cost, the copy should be eliminated whenever possible. Of course, coalesce would remove all of the copies, losing the desired separation between values with di erent tags. So, we use a pair of limited coalescing mechanisms to remove unproductive copies: Conservative coalescing is a straightforward modi cation of the standard coalesce phase. Conceptually, we add a single constraint to coalesce { only combine two live ranges if the resulting single live range will not be spilled. Biased coloring increases the likelihood that live ranges connected by a copy get assigned to the same register. Conceptually, select tries to assign the same color to two live ranges connected by a copy instruction. Taken together, these two mechanisms remove most of the unproductive copies.
5.3 Implementation
The Yorktown allocator can be extended naturally to accommodate our approach. The highlevel structure depicted in Figure 3.3 is unchanged, but a number of lowlevel modi cations are required. The next sections discuss the enhancements required in renumber, coalesce, and select.
5.3.1 Renumber
Chaitin's version of renumber (termed \getting the right number of names") was based on defuse chaining. Long before our interest in rematerialization, we adopted an implementation strategy for renumber based on the pruned SSA graph. The old implementation has four conceptual steps: 1. Determine liveness at each basic block using a sparse data ow evaluation graph 21]. 2. Insert nodes based on dominance frontiers. Avoid inserting dead nodes.
48 3. Renumber the operands in every instruction to refer to values instead of the original virtual registers. At the same time, accumulate availability information for each block. The intersection of live and avail is needed at each block to allow construction of a precise interference graph. 4. Form live ranges by unioning together all the values reaching each node using a fast disjointset union. The disjointset structure is maintained while building the interference graph and coalescing (where coalesces are further union operations). In our implementation, steps 3 and 4 are performed during a single walk over the dominator tree. Using these techniques, renumber completely avoids the use of bitvectored data ow analysis. Despite the apparent complexity of the algorithms involved, it is fast in practice and requires only a modest amount of space (see Section 8.4 for more details on the implementation of renumber as well as measurements and discussion of required compile time and space). Because renumber already uses the SSA graph, only modest changes are required to support rematerialization. The modi ed renumber has six steps: 1. Determine liveness at each basic block using a sparse data ow evaluation graph. 2. Insert nodes based on dominance frontiers, still avoiding insertion of dead nodes. 3. Renumber the operands in every instruction to refer to values. At the same time, initialize the rematerialization tags for all values. 4. Propagate rematerialization tags using the sparse simple constant algorithm as modi ed in Section 5.2.2. 5. Examine each copy instruction. If the source and destination values have identical inst tags, we can union them and remove the copy. 6. Examine the operands of each node. If an operand value has the same tag as result value, union the values; otherwise, insert a split (a distinguished copy instruction) connecting the values in the corresponding predecessor block.17 Steps 5 and 6 are performed in a single walk over the dominator tree.
During the initial construction of the control ow graph, we insert extra basic blocks to ensure a unique predecessor wherever splits may be required.
17
49
5.3.2 Conservative Coalescing
To prevent coalescing from removing the splits that have been carefully introduced in renumber, we must limit its power. Speci cally, it should never coalesce a split instruction if the live range that results may be spilled. In normal coalescing, two live ranges li and lj are combined if lj is de ned by a copy from li and they do not otherwise interfere. In conservative coalescing, we add an additional constraint: combine two live ranges connected by a split if and only if lij has < k neighbors of \signi cant degree," where signi cant degree means a degree k. To understand why this restriction is safe (indeed, conservative), recall Chaitin's coloring heuristic. Before any spilling, nodes of degree < k are removed from the graph. When a node is removed, the degrees of its neighbors are reduced, perhaps allowing them to be removed. This process repeats until the graph is empty or all remaining nodes have degree k. Therefore, for a node to be spilled, it must have at least k neighbors with degree k in the initial graph. In practice, we perform two rounds of coalescing. Initially, all possible copies are coalesced (but not split instructions). The graph is rebuilt and coalescing is repeated until no more copies can be removed. Then, we begin conservatively coalescing split instructions. Again, we repeatedly build the interference graph and attempt further conservative coalescing until no more splits can be removed. In theory, we should not intermix conservative coalescing with unrestricted coalescing, since the result of an unrestricted coalesce may be spilled. For example, li and lj might be conservatively coalesced, only to have a later coalesce of lij with lk provoke the spilling of lijk (since the signi cant degree of lijk may be quite high). In practice, this may not prove to be a problem, permitting a slight simpli cation of the entire process. Conservative coalescing directly improves the allocation. Each coalesce removes an instruction from the resulting code { a split instruction that was introduced by the allocator. In regions where there is little competition for registers (a region of low register pressure), conservative coalescing undoes all splitting. It cannot, however, undo all of the nonproductive splits by itself.
50
5.3.3 Biased Coloring
The second mechanism for removing useless splits involves changing the order in which colors are considered for assignment. Before coloring, the allocator nds partners { values connected by splits. When select assigns a color to li, it rst tries colors already assigned to one of li's partners. With a careful implementation, this is no more expensive than picking the rst available color; it really amounts to biasing the spectrum of colors by previous assignments to li's partners. The biasing mechanism can combine live ranges that conservative coalescing cannot. For example, li might have 2k neighbors of signi cant degree; but these neighbors might not interfere with each other and thus might all be colored identically. Conservative coalescing cannot combine li with any of its partners; the resulting live range would have too many neighbors of signi cant degree. Biasing may be able to combine li and its partners because it is applied after the allocator has shown that both live ranges will receive colors. At that late point in allocation, combining them is a matter of choosing the right colors. By virtue of its late application, the biasing mechanism uses a detailed level of knowledge about the problem that is not available any earlier in the process { for example, when coalescing is performed.
Limited Lookahead and Backtracking
Of course, biased coloring will not always succeed in assigning adjacent partners to the same register. The vagaries of the coloring process ensure that cases will arise when li and lj will be adjacent partners and be assigned to di erent registers. To help cope with these cases, we can add a nal improvement to select. When selecting a color for li, the allocator can try to select a color that is still available for each of its adjacent uncolored partners. This increases the likelihood that biased coloring will succeed. We call this technique limited lookahead. Similarly, when selecting a color for li, the allocator may discover that none of the available colors matches its already colored adjacent partners. In this case, the allocator can try to change the colors assigned to those partners. We call this technique limited backtracking. Notice that this form of backtracking di ers from the style suggested in Section 3.3. In that case, we were attempting to avoid spills; in this case, we are attempting to remove splits. Unfortunately, neither form of backtracking cooperates well with biased coloring. Having carefully selected a color for a node, perhaps matching a partner's
51 color, it would be disappointing if some sort of backtracking disturbed our artfully arranged coloring. While it is possible to account for the direct e ects of recoloring, the extended case involves exponential exploration of the graph. Therefore, in the presence of biased coloring, we attempt no backtracking. In practice, we try each heuristic in succession. First, we try to nd a color matching a colored partner. If that fails, we try limited lookahead, seeking to avoid colors that uncolored partners cannot use. If all else fails, we take any free color.
5.4 Results
Our recognition and exploration of this problem was prompted by poor spilling observed in our experimental splitting allocator (see Chapter 6). However, Randy Scarborough pointed out that our approach was really orthogonal to the splitting question. Therefore, it seems natural to compare our new approach to the simpler scheme used in the Yorktown allocator and our optimistic allocator. In Section 7.1, we present a comparison of the optimistic allocator with the enhanced allocator described here. In our test suite of 70 routines, we observed improvements in 28 cases and two cases of degradation. One loss was small (2 loads, 2 stores, and an extra copy); the other was somewhat larger. Improvements ranged from tiny (after all, some routines may o er no opportunities for rematerialization) to reasonably large (typically reducing spill costs by 10 to 20%). It is possible to see a pattern of trading loads for loadimmediates: we often see a fairly large reduction in load instructions o set by an increased number of loadimmediate instructions. Since loads are usually more expensive, we win in the balance. Typically, the number of stores and copies is also reduced. The reduction in copy instructions suggests that our various heuristics for removing unhelpful splits are \good enough."
5.5 Summary
The primary contribution of this chapter is a natural extension of Chaitin's ideas on rematerialization. We show how to handle complex live ranges that may be completely or partially rematerialized. We describe a technique for tagging the component values of a live range with correct rematerialization information. We introduce heuristics, conservative coalescing and biased coloring, that are required for good results. Finally, we report experimental results showing the e ectiveness of our extensions.
52 Our work extends the work described by Chaitin et al. and recalls an approach suggested by Cytron and Ferrante 28]. Chaitin et al. introduce the term rematerialization and discuss the problem brie y. Because their allocator cannot split live ranges, they handle only the simple case where all de nitions contributing to a live range are identical. Our work is a direct extension and is able to handle each component of a complete live range separately and correctly. Cytron and Ferrante suggest splitting based on (the equivalent of) the SSA. Their goal is minimal coloring in polynomialtime { achieved at the cost of introducing extra copies. There is no direct discussion of rematerialization; indeed, they do not consider the possibility of spilling. In contrast, we are concerned primarily with quality of spillcode. Nevertheless, their work might be considered a direct ancestor of our approach. It is also interesting to compare our approach to other published alternatives; for example, the splitting allocator of Chow and Hennessy and the hierarchical coloring allocator of Callahan and Koblenz 26, 16]. The published work does not indicate how they handle rematerialization. It is possible that they make no special provisions, trusting their splitting algorithm to do an adequate job.18 Some colleagues have suggested the possibility of more extensive rematerialization, perhaps recomputing entire expressions to avoid excess spilling. The di culty is avoiding the introduction of additional register pressure in the attempt to save a spill (which was due to excess pressure in the rst place).
18
Inspired by a draft of our paper 14], Brian Koblenz has added rematerialization to their allocator.
53
Chapter 6 Aggressive Live Range Splitting
Consider the example shown in Figure 6.1. The left side sketches a pair of loops, each updating a variable. If we assume that each loop has only one register available for either x or y, then the right column illustrates the ideal allocation. The Yorktown allocator can never produce this ideal allocation; a value is either held in a register for its entire lifetime or it is spilled for its entire lifetime, with appropriate loads and stores inserted immediately before and after each use and de nition. Since neither x nor y can be held in a register across both loops, the Yorktown allocator will spill both variables and the resulting code will require many more loads and stores than the ideal allocation. This chapter explores ways of extending the Yorktown allocator to handle problems similar to those illustrated in Figure 6.1. We propose an aggressive approach to splitting and consider alternative implementations.
x y ::: :::
?
x y
::: ::: store y x
?
x
x+ :::
x+ :::
store x reload y
y
?
?
y + :::
y
?
y +:::
?
?
Figure 6.1 Splitting
54
6.1 Live Range Splitting
As an alternative to the \spill everywhere" approach used in the Yorktown allocator, Chow and Hennessy describe a scheme called live range splitting 26]. They observe that breaking a live range into several pieces and considering the pieces separately can produce an interference graph that colors with less spilling.19 Thus, when their allocator cannot assign a color to some live range li, it splits li into smaller live ranges, one for each basic block in which li appears. These new, smaller live ranges become independent candidates for coloring; eventually, they will be colored or spilled. To decrease the amount of fragmentation introduced by splitting, Chow and Hennessy also included a method for combining some of these small live ranges. After splitting a live range, their allocator examines the resulting set of smaller live ranges. If it nds two adjacent live ranges that would have degree < k when combined, they are pasted together. Live range splitting has several merits. The splitting process often creates live ranges of lower degree and the limitation on combining keeps degrees low. If an entire live range is spilled, as in Chaitin's work, its value will reside in a register only for trivial periods around each de nition or use. Splitting allows the live range to stay in a register over longer intervals { often an entire block or, if combinations are possible, over several blocks. With luck, the new live range can be large enough to extend over all of an important construct, like an inner loop.
6.1.1 Theoretical Di culties
Two theoretically hard problems arise in splitting: choosing the right live ranges to split and the right places to split them. Chow and Hennessy use simple heuristics to attack both problems. They choose live ranges to split based upon failure of their coloring heuristic. Unfortunately, there is no assurance that this scheme will select the best live ranges to split. For example, in the code from Figure 6.1, their technique will fail to produce the ideal allocation. One of x and y will be colored successfully and the other will be split. However, the ideal allocation requires that both live ranges be split. Their allocator never backtracks to consider splitting a successfully colored live range.
19
Fabri used this same observation to improve the coloring in her work on storage optimization 34].
55 Splits are recombined based solely on degree; this can lead to unfortunate locations for split points. For example, Chow might build a live range that extends into a loop, but does not encompass the whole loop, thus requiring a split on the loop's back edge. Note that split points tend to become spill points; therefore, the correct placement of split points is crucial.
6.1.2 Practical Di culties
There is a third problem we must consider { e ciency. Chow and Hennessy are able to perform live range splitting relatively e ciently; but to do so, they must sacri ce many of the desirable features of the Yorktown allocator (see Section 7.2). Retaining the precision and algorithmic e ciency of the Yorktown allocator is a challenge. The key problem seems to be the di culty and expense of maintaining the interference graph as live ranges are split.
6.2 Aggressive Live Range Splitting
Our approach to all these problems is to aggressively split live ranges before attempting to color. This idea, combined with our earlier ideas for undoing excess splits (recall Sections 5.3.2 and 5.3.3), seems to o er a useful tack. The following sections provide more detail on splitting and the complications it introduces for the rest of the allocator.
6.2.1 Splitting
In our search for a splitting technique that produced good results with a reasonable running time, we were forced to reconsider the fundamental basis of the coloring approach to register allocation. The key insight is that the interference graph captures none of the structure of the control ow graph. In reducing the allocation problem to a coloring problem, the compiler loses almost all information about the topography of the code. There is no representation for locality. Estimates of execution frequency get factored into estimated spill costs, but because the information is computed over the whole procedure, it gives equal weight to both near and distant references. Thus, a live range that is heavily used in some critical inner loop may get spilled in deference to a value that is live across the loop and used in one or more distant but deeply nested loops.
56 In an e ort to recapture geographic locality, we advocate: 1. nding those points in the code where we would like to spill, if spilling is actually required, and 2. splitting every live range at those points. Thus, we avoid the di cult problem of picking the optimal live ranges to split by splitting all the live ranges that cross a split point. We choose split points based on the structure of the control ow graph. Of course, such a general statement admits many speci c interpretations. Possibilities include: splitting at every basic block,20 splitting around highlevel control structures, such as loops and ifthenelse statements, or as suggested in Chapter 5, splitting based on the SSAgraph. In Section 6.4, we consider several alternatives in detail.
6.2.2 Spilling
Our approach to splitting has some subtle consequences. In the original interference graph, all of the live ranges are independent. After splitting, some of the live ranges are related { they are partners. Recognition and proper handling of partners is critical if the allocator is to produce highquality spill code. For example, each set of partners should spill to the same location. Consider the example in Figure 6.2. The single live range in (a) is split in (b) by the introduction of a copy. The resulting live ranges, lj and lk , are partners. If lj is spilled, we should get (c). Alternatively, (d) illustrates the result of spilling lk . Note that each partner spills to the same location. Finally, (e) shows the result of spilling both partners. Now consider the costs for the sequence from (b) through (c) to (e). Moving from (b) to (c) costs one store and one load, but saves one copy. The transition from (c) to (e) saves one load at the split point and costs one load at the use point. No new instructions are required; instead, the load is e ectively moved. Therefore, the cost
This can be even more e ective if we arti cially limit the size of basic blocks to some relatively small size 51].
20
57
(a)
i j lj li k
?
(b)
j
(c)
store j
lj spilled
?
(d)
j lj j
(e)
store j
lj spilled
j
reload k
lk
?
store j
lk spilled lk spilled
? ?
?
lk
?
i
?
k
k
reload k
k
reload k
k
Figure 6.2 Spilling Partners
of spilling lk at (c) is determined by the relative loop nesting depth of the split point and the use. If the split point is nested more deeply than the use, it will be pro table to spill lk . We account for these situations while computing spill costs and inserting spill code. Additionally, we update spill costs incrementally during simplify. In terms of the example in Figure 6.2, if lj cannot be colored and must be spilled, the cost of spilling its partner is immediately adjusted, increasing the probability of spilling lk . Furthermore, if any live range has a negative spill cost, it will be spilled immediately and its partners' costs updated appropriately. The incremental adjustment to spill costs during simplify is really just a simple heuristic that has proven e ective in practice. Since a node chosen as a spill candidate may not actually be spilled (indeed, we hope it is not), the adjustments are in some sense premature. We have also experimented with adjusting spill costs during select, when we actually mark nodes for spilling. However, the results were nearly always disappointing. The e ect of this careful handling of partners is important. Aggressive splitting can divide long live ranges into long chains of partners. If one partner is spilled, it tends to drag its immediate partners along. Conversely, when a partner is kept in a register, it tends to hold its immediate partners in registers. Together with register pressure from competing live ranges, this works to force spill points out of loop nests.
58
before splitting x
A A A
after splitting a
A U A
after spilling
a
and c
store a
b a
A U A
a
A A U A
reload b
b
A U A
A U A
x
c
?
a c
b
b
store b reload c
c
?
?
x
?
c
Figure 6.3 Splitting and Spilling 6.2.3 Cleanup
Consider the example in Figure 6.3. The left illustrates the code for a small DOloop, where all details except for references to x have been omitted. The center illustrates the e ect of splitting x into three ranges labeled a, b, and c. The right illustrates the e ect of spilling a and c outside the loop. Note that the splits (copy instructions) are converted into loads and stores, just outside the loop as desired. Unfortunately, the store of b at the loop exit is unnecessary. While the speci c case shown in Figure 6.3 is easy to recognize; the general problem is global in nature. Consider the example sketched in Figure 6.4. On the left side, a single live range has been split into four components separated by copy instructions. On the right side, the fourth component lz has been spilled. The question arises: How do we handle the copy z y? One possibility is to convert it to a store instruction; certainly the spill location in memory must have the correct value when z is nally loaded. However, suppose lw has also been spilled. In this case, the value in memory would already be initialized and the copy instruction can simply be deleted. The di culty is that the correct handling of z y doesn't depend on ly , lz , or even their immediate partners.
59
w lw x
?
w lw possibly spilled x
?
w
w
lx y
?
lx not spilled y
?
x
x
ly z
?
ly not spilled
y
store y or not?
lz spilled
?
?
lz
?
z
reload z
z
Figure 6.4 Globally Unnecessary Stores
For best results, we would like to see such facts re ected immediately in the spill costs for each component, similar to the heuristic used to maintain spill costs for immediate partners. However, it seems di cult to accomplish this updating on the y. Therefore, we simply insert redundant stores when they may be necessary, accepting the imprecision. However, we are able to detect and eliminate redundant stores in a separate pass by solving a global data ow problem for each set of partners (recall that all the partners split from a single live range share the same spill location). Redundant stores are discovered and removed immediately after spill code has been inserted, in a separate phase called cleanup. Partially redundant stores are still a problem; see Figure 6.6 for an example. Note that this problem is apparently shared by other allocators that attempt splitting 26, 16]. Actually, the other allocators seem to accept the extra stores, with no attempt at later cleanup.21 It seems to be a di culty inherent in splitting.
21
In conversation, David Callahan mentioned that they were aware of the problem and were considering solutions, though he knew of nothing better than our batch approach.
60
? 
renumber

build

coalesce

split
?
renumber
6

build
6

conservative coalesce

spill costs cleanup

simplify spill code

biased select

Figure 6.5 The Splitting Allocator
6.3 Implementation
Integrating these ideas into our allocator was a major task. Figure 6.5 shows a highlevel view of the resulting allocator. Several points have changed from the optimistic allocator depicted in Figure 3.3. While there are many components, they may be partitioned into three major phases: before splitting The portion is lifted directly from our earlier implementations of the Yorktown allocator. The intent here is to use the unrestricted coalesce to remove any extraneous copies from the code before introducing new splits. Thus, before the splitter is ever invoked, the allocator will reduce the number of live ranges to some canonical set. In the simpli ed code, any remaining copy instructions are meaningful. splitting This is a generic splitting phase. For our experiments, we can employ any one of several possible splitting heuristics. Each splitter does some combination of data ow analysis and control ow analysis and introduces splits (distinguished copies) to guide the remainder of the allocation. after splitting This portion of the allocator is nearly identical to the optimistic allocator shown in Figure 3.3. The major di erences are the use of conservative coalesce and biased select (already introduced to support rematerialization) and global cleanup after spilling.
61 The nal phase of the allocator must carefully maintain the distinction between live ranges (discovered by the rst phase) and partners (determined by splitting). Furthermore, the allocator must maintain the relation between partners and their original live ranges, since all partners split from a single live range should spill to the same location. This relation is also required by cleanup.
6.4 Splitting
We have considered many possible approaches to splitting. The two next sections describe several heuristic approaches that appeared attractive. Section 6.4.3 describes some important details common to all our implementations.
6.4.1 LoopBased Splitting
Our exploration of live range splitting was motivated by the intuition that we often want to split live ranges around loops.
Splitting Around All Loops
Our initial approach was simple and aggressive: 1. Find all loops in the code using Tarjan's algorithm for testing reducibility 62]. For our experimental purposes, this approach is adequate; production implementations will require extensions to handle irreducible control ow graphs. 2. Edges leading in and out of loops are marked as split points. We insert empty basic blocks at each split point (we re ne this notion in Section 6.4.3). 3. Split all the live ranges that cross each split point by inserting copies in the new basic block. Our early implementations were exploratory; many of the heuristics we now employ were discovered in the course of our experiments. For instance, our work on rematerialization was prompted by examples observed in practice { for example, in handling the many COMMON blocks in the SPEC program doduc. Despite some encouraging results, our overall impression is that splitting around all loops gives unstable results; that is, it performs well in some cases and poorly in other cases. Even more disappointing were the compiletime costs. In some cases, space and time requirements were increased by a factor of ten. While some increase in
62 compiletime or space might have been acceptable if we could count on better results; more expensive allocations and inferior results were not attractive. We have tried other, more conservative, possibilities. For example, we can split around outermost loops only, approximately splitting the routine into manageable pieces. Alternatively, we can split only around innermost loops, attempting to isolate computationally intensive areas in the code. However, all these approaches share the common weakness of ignoring all information about the location of uses and de nitions.
Splitting Based on Inactivity
An attractive alternative is to split only inactive live ranges around loops. The intuition here is that some live ranges extend across a loop without being mentioned (used or de ned) in the body of the loop. It seems clear that they should be spilled rst if there is excess pressure in the loop. Therefore, we can do a simple scan of the code in each loop, accumulating the set of live ranges mentioned in the loop. Given this set, we can split any unmentioned live range at the entrance and exit of the loop. This simple approach extends naturally to loop nests. For each loop, we accumulate the same information. Then, working from the outermost loop inward, we split unmentioned live ranges around a loop { but only if the live range was not split around an enclosing loop. Figure 6.6 illustrates the desired e ect on a nest of two DOloops. In this case, x is referenced in the innermost loop; therefore, it is not split at all. There is a use of y in the outermost loop, but it is unreferenced in the innermost loop; therefore, y is split on the edges leading in and out of the innermost loop. There are no mentions of z in either loop; therefore, z is split around the outermost loop (but not the innermost). We implemented a version of our allocator that split unmentioned live ranges around loops. During limited tests, the results were almost uniformly disappointing. Reconsidering Figure 6.6 with a more sceptical eye, we can see possible reasons for the poor results: Splitting z There was little bene t gained by splitting z. If the unsplit z is spilled, the results will be nearly identical to the result of spilling z0. Of course, if the outer loop is never executed, the split is preferred; but these cases are perhaps uncommon in practice (especially with FORTRAN routines). Splitting y If y0 is spilled, the results will actually be worse than simply spilling the unsplit y. The two split points will be converted into a store and a load inside
63
before splitting x y z
A A A A A U A
after splitting x y z
A U A
z
0
z
A U A
y
A A A A A
y
A U A
y
0
y
A U A
U A
x y
x
y
0
?
x z z
0
?
x
?
z
?
z
Figure 6.6 Splitting Unmentioned Live Ranges
64 the outer loop, whereas spilling the unsplit y would only require a load in the loop. This case is annoying since y is not modi ed inside the loop; all but the rst store will be useless. Note that cleanup is unhelpful in this case, since the store is required on the initial iteration. Handling this case optimally is di cult { it requires cloning the entire loop nest.22 Again, if the inner loop is never executed, splitting y is helpful. Of course, one loop nest is not an entire routine. It is certainly easy to construct cases where splitting y or z as shown in Figure 6.6 can be pro table. These points are simply mentioned to help illustrate the di culties of proper splitting.23
6.4.2 Splitting Based on Dominance
An alternative is to split live ranges based on the location of nodes in the pruned SSA graph. This idea was suggested by several people in discussions about splitting (including Jeanne Ferrante and Mike Lake). Furthermore, it is a natural extension to our work with rematerialization. Exploration of this alternative leads to several approaches based on the fundamental idea of dominance.
Dominance and Dominance Frontiers
Cytron et al. give a fast method for building the SSAgraph based on the idea of dominance frontiers 29]. Dominance frontiers, as the name suggests, are based on the idea of dominance. In a ow graph (a directed graph with a designated node start), a node x dominates a node y if all paths from start to y include x. Additionally, if x 6= y, then x strictly dominates y. In the control ow graph for a large routine, a given basic block often dominates many other blocks; for example, the header of a loop will dominate all members of the loop. The dominance frontier of a node x (DFx) is the set of nodes y such that x dominates a predecessor of y but does not strictly dominate y. Notice that the last clause allows x to be a member of its own dominance frontier. Intuitively, the dominance frontier of x does not include nodes dominated by x; rather, it includes the nodes just outside the dominion of x.
22
This discussion suggests the possibility of splitting only unde ned live ranges around loops; that is, live ranges that are not de ned inside the loop. This is a conservative approach that avoids some of the problems of cleaning up extra loads. We have not yet explored this avenue. 23Certainly these di culties were not obvious to us when we began work on this problem.
65 To insert nodes for a variable v, we nd basic blocks containing de nitions of v. For each such basic block b, we insert a node for v in each block of DFb. Since we are building a pruned SSAgraph, we actually insert a node in a block d only if v is live on entrance to d. Of course, a node represents a new de nition of v; therefore, further nodes are inserted in DFd . Cytron et al. give an e cient worklist algorithm for placing nodes in this iterated dominance frontier.
Splitting at Dominance Frontiers
Recall the discussion in Section 5.2.3. In that case, we were attempting to isolate neverkilled values by inserting a minimum number of splits. We inserted splits on edges leading into a node if the incoming value had a di erent tag than the value de ned by the node. In the present case, we simply insert splits on all edges leading to a node. The e ect is to isolate all the values in each live range, allowing the allocator to handle each value individually. This approach is attractive for several reasons. Of course, we already have all the required machinery as a result of our work on rematerialization. Furthermore, we avoid the awkwardness of nding loops (for loopbased splitting) in irreducible controlow graphs. Finally, this approach is intuitively appealing because it depends on the structure of the control ow graph and the location of de nitions in the routine. It seems to achieve much of the e ect of loopbased splitting with much less actual splitting. Additionally, we can hope to obtain some bene t in other regions of the control ow graph; for example, around IF and CASE constructs. We were able to build a splitting allocator that split live ranges based on the pruned SSAgraph. In Section 7.1, we compare the SSAbased allocator with the rematerializing allocator discussed in Chapter 5. From a total of 70 routines, we found a di erence in 35 cases, with 21 improvements and 14 degradations. These numbers are pessimistic in that the improvements appear larger in magnitude than the degradations. On the other hand, the relatively large losses on twldrv and tomcatv are disappointing. We also note that the number of copy instructions almost always increases when using SSAbased splitting. This suggests that our heuristics for removing excess copies, while adequate for rematerialization, are less satisfactory in this case.
66
Splitting at Reverse Dominance Frontiers
Splitting at nodes (or dominance frontiers) is appealing since it accounts for both control structure and the location of de nitions; however, there is no allowance for the location of uses. Recall the example loop nest shown on the left side of Figure 6.6. If we attempt to split based on the location of nodes, we get no splits at all. Remember that nodes are inserted based on the location of de nitions. In this example, the de nitions of x, y, and z are all located in the rst block, a block that dominates every other block in the graph. Again, this is only a small example, but we can imagine realistic cases. For instance, this same situation arises when parameters are de ned in the entrance to a subroutine and remain constant throughout the routine. Since the entrance block of a routine certainly dominates the entire routine, none of the constant parameters will be split. This seems unhelpful { an e ective splitting allocator should have some provision for splitting these live ranges. These considerations suggest additional splitting based on the location of uses and reverse dominance frontiers. The reverse dominance frontier for a node is de ned to be the dominance frontier, but computed on the reverse control ow graph; that is, the control ow graph with all edges reversed. The algorithm for placing reverse nodes would be similarly analogous to the algorithm invented by Cytron et al. for placing nodes. To achieve the desired e ect (splitting at forward and reverse dominance frontiers), we maintain two worklists simultaneously, with each node placement contributing to both worklists. Figure 6.7 shows two rather simple examples that help illustrate the e ect of splitting at both forward and reverse dominance frontiers. In the upper example, we show a loop containing a use (and no de nitions) of a live range x that is live across the loop. The lower example illustrates the e ect if x is dead at the end of the loop (any further uses would be masked by the second de nition of x). Note that we avoid splits when the result would be unused. In both cases we have completely separated the uses and de nitions; if desired, it would be possible to allocate a, b, and c to di erent registers. The intuition behind splitting at dominance frontiers is not to achieve complete separation of all uses and de nitions (indeed, it does not); instead, it relates to the traditional use of dominators in optimization. Consider the lower half of Figure 6.7. If a is spilled, the load of b occurs at the split point, at a point leading inevitably to a use of b.
67
before splitting x
A A A A
after splitting a
A U A
b
a
A U A
A U A
x
c
?
a c
b
b
?
x
?
c
before splitting
after splitting
x
A A A A A U A
a b x
A U A
a
A U A
b
x
?
c
?
Figure 6.7 Splitting at Dominance Frontiers
68 We built another version of our splitting allocator that split live ranges at both forward and reverse dominance frontiers. Section 7.1 contains a comparison with our rematerializing allocator. The results are mediocre. Out of 70 routines, there were di erences in 42 cases: 13 improvements and 29 degradations. Furthermore, many of the degradations were quite large. On the other hand, there was a 22% reduction in spill cost for tomcatv; in raw cycles, this improvement probably dominates all other e ects in the entire thesis. Once again, we see signi cant losses due to excess copies.
6.4.3 Mechanics
There are many mechanical details that must be handled correctly while splitting to achieve best results. They are largely independent of any speci c splitting heuristic.
Inserting New Basic Blocks
Conceptually, we would split live ranges on edges in the control ow graph. In practice, splits may require actual instructions: copies or perhaps loads and stores. To accommodate the split instructions, we must create basic blocks along some of the edges in the control ow graph. While it may be possible to create the new blocks when and where desired, we simply create them when the control ow graph is constructed. After allocation, empty blocks can be quickly deleted.24 Recall that splits are always inserted before a join or after a fork. If the edge leading into a join has a source with no other successor, then we simply insert the split at the end of the source block. Similarly, if the edge leading from a fork has a destination with a single predecessor, than we insert the split at the beginning of the successor block. Therefore, new blocks are only required on edges whose source has more than one successor and whose destination has more than one predecessor. It may be di cult (in a practical sense) to split some edges. For example, in a FORTRAN routine, it seems di cult to split edges leading from an ASSIGNed GOTO to a join point. Fortunately, the use of ASSIGN is apparently rare. FORTRAN's computed GOTO statement, Pascal's CASE statement, and C's switch statement are all manageable, given an adequate intermediate representation.
Deletion of empty blocks is required anyway, since coalesce is sometimes able to delete every instruction in a basic block.
24
69
Splits to Avoid
It is clear that we ought to avoid splitting a live range immediately after it is de ned or immediately before it is used. In fact, if we recall the details of spill code generation, it becomes clear that we ought not split a live range if there have been no deaths between its de nition and the end of the block. Similarly, we avoid splitting a live range if there are no deaths between the beginning of the block and its use. Finally, we must be careful not to split a live range at both the beginning and end of a block if there are no deaths within the block.
Ordering Splits
Many split instructions may be inserted in a single block. Before spilling, the ordering of individual splits is unimportant; however, after some live ranges have been spilled, the order of resulting loads, stores, and the remaining splits is signi cant { stores should be scheduled rst, then splits (copies), and nally loads. The insight here is that a store is the end of a live range and a load is the beginning of a live range { by placing loads after stores, we minimize interferences. Unfortunately, we discovered this idea too late. None of our experimental implementations take advantage of the insight; instead, splits are inserted in some arbitrary order and the order remains unchanged while spilling.
6.5 Summary
The approach to spilling used in the Yorktown allocator (and in the variations discussed in earlier chapters) is quite coarse. It is easy to give examples where some form of live range splitting is desirable: Figure 6.1 illustrates a typical arti cial example; a more realistic example is provided by SVD (Figure 3.2). In this chapter, we have described an approach that allows ner control over spilling within the general context of the Yorktown allocator. The fundamental problem with the Yorktown allocator is that the reduction of the register allocation problem to graph coloring throws away a large amount of information; for example, the structure of the control ow graph. Of course, an advantage of the reduction to coloring is the distillation of large amounts of programmatic detail to the essential concept of interference. When there are adequate registers, the loss
70 of structure information is unimportant. When there are insu cient registers, the information about control structure could have been used to help minimize spilling. Our general approach is to split live ranges into ner components before attempting to color. Thus, instead of being forced to spill an entire live range, the allocator can simply spill the troublesome components individually. The bulk of our work has been exploring the consequences of our aggressive approach to splitting. We were able to identify a number of important details that must be handled correctly for best results. These have never been published before and it seems likely that other splitting allocators may be improved by correctly handling these cases. Our approach is sensitive to the heuristic employed for splitting. In Section 6.4, we considered several alternative heuristics. Despite the intuitions and rationalizations supporting each heuristic, none were completely satisfactory. While there were some notable successes on individual routines, each splitting heuristic seemed too unstable for production use. Furthermore, our heuristics for removing excess splits seemed inadequate. Of course, our implementations also su er from uncontrolled placement of splits, as discussed in the previous section. Future work will certainly begin with new implementations, taking advantage of our new ideas for split placement. Additionally, there are many unexplored heuristics for splitting; given the di culty in removing excess copies, we plan to explore splitting heuristics that perform less unnecessary splitting. In our implementation, splitting is only performed once; therefore, we can a ord to spend a fair amount of time deciding which live ranges to split and where to split them. The sensitivity of our allocator to the splitting heuristic (as shown by the widely varying results) emphasizes the need for accurate splitting. There are other approaches to live range splitting. Chow and Hennessy are able to accomplish splitting by sacri cing much of the precision and e ciency o ered by Chaitin's approach (see Section 7.2). Callahan and Koblenz also describe an approach based on a hierarchical decomposition of the control ow graph 16]. However, our experience suggests that other approaches to splitting may not be achieving the highquality code they desire. While studying the (many) problems discovered during the course of our experiments, we have often referred to published descriptions of other splitting allocators { neither the problems nor their solutions are discussed.25
25
The problems solved by rematerialization and cleanup are two good examples.
71 One possibility is that the problems have not been noticed, perhaps due to lack of comparison with other allocators. In our work, we have been able to compare against a highquality allocator; therefore, cases of poor performance are immediately exposed. Without reference to the standard set by the Yorktown allocator and our improvements, we would be unable to evaluate the performance of our approach.
72
Chapter 7 Measurements and Comparisons
There are many qualities to consider when comparing register allocators { implementation di culty, compiletime and space requirements, and allocation quality. Since our work largely concentrates on improving the Yorktown allocator, we are interested in comparing allocation quality; the implementation di culty and compiletime and space requirements are relatively constant across our variations. The next section presents a series of experiments used to compare the Yorktown allocator and several of the improvements discussed in the previous chapters. A second wellknown approach to global register allocation, prioritybased coloring, was introduced by Chow and Hennessy 26]. The nal section compares the Yorktown allocator with prioritybased coloring and presents an experiment used to study their compiletime behavior.
7.1 Measuring Allocation Quality
When building a register allocator for an optimizing compiler, we are primarily concerned with the speed of the generated object code. A high quality allocation will result in faster code than a low quality allocation. Usually, slower code will be the result of extra instructions; e.g., load and store instructions required for spill code.26 In this section, we describe a series of experiments designed to compare the allocation quality of various versions of our allocator. We perform comparisons by measuring the number of instructions executed by routines compiled using each allocator. The following sections present a description of the experiments followed by a summary and discussion of the results. The complete results are given in Appendix A.
26
On pipelined machines, register allocation may sometimes have an adverse impact on instruction scheduling; however, these considerations are beyond the scope of this work.
73
7.1.1 Methodology
To support our research, we have written an optimizing compiler for FORTRAN. The compiler is part of the IRn programming environment and includes support for interprocedural analysis and a variety of traditional optimizations 27]. We currently generate code for the IBM RT/PC and have experimental code generators for the Sparc, i860 and RS/6000. To experiment with register allocation, we have written a series of allocators that are independent of any particular architecture. The optimizer is organized as a collection of independent programs, each accomplishing a distinct transformation. During optimization, a routine is represented in a lowlevel intermediate form, ILOC. Each pass of the optimizer reads in a routine via stdin, performs the necessary analysis and transformations, and writes the result to stdout. Since each pass consumes and produces ILOC, they may be organized in any order, with speci c passes added, repeated, or omitted as desired. After optimization, which includes passes to accomplish the e ects of instruction selection, we run register allocation. The register allocator is also organized as an independent pass, consuming and producing ILOC, but it is always run after the optimization passes and may not be omitted or repeated. ILOC is a lowlevel intermediate language designed to allow extensive optimization. It resembles the assembly language for a simple RISC machine, with the addition of hooks for interprocedural information and certain higherlevel operations representing FORTRAN intrinsics. During optimization, the highlevel operations may be expanded to a sequence of lowerlevel operations or subroutine calls if required. The design of ILOC and the entire optimizer was heavily in uenced by the PL.8 compiler 6]. Each register allocator is built around the same basic framework. An ILOC routine that assumes an in nite register set is rewritten in terms of the target machine's register set, with spill code added as necessary. The stack frame size may be adjusted to accommodate spilled values and copy instructions may be added and deleted. The target register set is speci ed in a small table and may be varied (within limits) to allow convenient experimentation with a wide variety of register sets.
74 After register allocation, each ILOC routine is translated into a complete C routine. Each C routine is compiled and the resulting object les are linked into a complete program. There are several advantages to this approach: By inserting appropriate instrumentation during the translation to C, we are able to collect accurate, dynamic measurements. Compilation to C allows us to test a single routine (perhaps from a library) in the context of a complete program running with real data. We are able to perform our tests in a machineindependent fashion, potentially using a variety of register sets. Simply timing actual machine code is inherently machinedependent and tends to bury the important issues (loads and stores) under the sometimes massive costs of incidental oatingpoint computation. The idea was originally suggested to us by Hans Boehm as a way to avoid having to retarget the compiler for every new architecture we encountered (this approach was used by researchers at Xerox PARC to aid in porting Cedar 5]). We were further in uenced by Mills et al., who describe a compiled instruction set simulator 55]. During the translation into C, we are able to add instrumentation to accumulate the dynamic occurrences of any class of ILOC instruction. For the purposes of comparing register allocators, we are concerned with the number of loads, stores, copies, loadimmediates, and addimmediates.27 At the entrance of an instrumented routine, ve integer variables are initialized. At each instruction, the appropriate counter is incremented. When the routine exits, the values of all ve counters, along with the name of the routine, are printed to stderr. Figure 7.1 shows a small sample of ILOC code and the corresponding C translation. Usually there is a onetoone mapping between the ILOC statements and the C translations, though some additional C is required for the function header and declarations of the \register" variables; e.g., r14 and f15. Also note the simple instrumentation appearing immediately after several of the statements. The number of loadimmediates is accumulated in the variable i, the number of copies in c, the number of loads in l, the number of addimmediates in a, and the number of stores in s (not shown in the example). Of course, this code is simple, but the majority of ILOC is no more complex. However, some care is required in handling procedure call and return.
27
The numbers of loadimmediates and addimmediates are a ected by rematerialization.
75
LLE3: LLA4: nop ldi add mvf bc LLA4: LLE3: r11
r14 8 r9 r15 f15 f0 L0023 f14 f14 f15 r14 r7 ge r14 f14 f15 r14 r10 r7
r14 = (int) (8); i++; r9 = r15 + r11; f15 = f0; c++; goto L0023;
L0023: lddrr dabs dadd addi sub br
r9 f14 8 r14 N6
N7
L0023: f14 = *((double *) (r14 + r9)); l++; f14 = fabs(f14); f15 = f15 + f14; r14 = r14 + (8); a++; r7 = r10  r14; if (r7 >= 0) goto N6; else goto N7;
Figure 7.1 ILOC and C The Target Machine
For the tests reported here, our target machine is de ned to have sixteen integer registers and sixteen oatingpoint registers. Each oatingpoint register is capable of holding a doubleprecision value, so no distinction is made between singleprecision and doubleprecision values once they are held in registers. Integer register 0 and oatingpoint register 0 are both de ned to be 0. Integer register 1 is reserved as the frame pointer. Up to four integer registers may be used to pass arguments (recall that arguments are passed by reference in FORTRAN; therefore, the argument registers hold pointers to the actual values); any remaining arguments are passed in the stack frame. Function values are returned in an integer or oatingpoint register, as appropriate. Ten of each register class are designated as calleesaves; the remaining six (including the argument registers) are not preserved by the callee. When reporting costs, we assume that each load and store requires two cycles; all other instructions are assumed to require one cycle. Of course, these are simply convenient assumptions that re ect no actual machine architecture. The raw data is given in Appendix A to allow the interested reader to recalculate results for di erent relative instruction weightings.
Spill Costs
Since our instrumentation reports dynamic counts of all loads, stores, etc., we need a mechanism for isolating the instructions that arise due to register allocation decisions { after all, a typical routine will perform some loads and stores even with an in nite
76 register set. A di culty is that some spills are pro table. In other cases, the allocator removes instructions; e.g., copy instructions. Therefore, we tested each routine on a hypothetical \huge" machine with 128 registers, assuming this would give a nearly perfect allocation. The di erence between the \huge" results and the results for one of the allocators targeted to our \standard" machine should therefore equal the number of cycles added by the allocator to cope with insu cient registers. In reality (or when dealing with NPcomplete problems), problems arise. A few routines from our test suite require more instructions with the \huge" machine than the \standard" machine. In these cases (notably yeh, from the program doduc), we simply present raw results, with no attempt to determine spill contributions.
The Test Suite
Our test suite is a collection of seventy routines contained in eleven programs. Eleven routines are from Forsythe, Malcolm, and Moler's book on numerical methods 36]. They are grouped into seven programs with simple drivers. The remaining ftynine routines are from the SPEC benchmark suite 61]. Four programs were used: doduc (41 routines), fpppp (12 routines), matrix300 (5 routines), and tomcatv (1 routine). The two other FORTRAN programs in the suite (spice and nasa7) require language extensions not yet supported by our frontend.
7.1.2 Results
In earlier chapters, we have described several improvements to the basic Yorktown allocator, giving a sequence of allocators. Similarly, we have organized our results as a sequence of comparisons: each variation is compared with the previous approach. We present ve tables comparing six allocators: Table 7.1 shows a comparison of test results for the Yorktown allocator (Chaitin) and the optimistic allocator described in Chapter 3 (Optimistic). Table 7.2 shows the results of adding limited backtracking (as described in Section 3.3) to the optimistic allocator. In this case, the improvements were small, so the percentages are reported with a extra digit of precision. Table 7.3 shows the results of our improved approach to rematerialization (see Chapter 5). In this case, we compare the optimistic allocator using the Yorktown allocator's limited rematerialization to a version of the optimistic allocator using our enhanced approach to rematerialization.
77 Experimental results for our splitting allocator (see Chapter 6) are shown in Table 7.4 and Table 7.5. In each case, we compare against our best previous allocator (the optimistic allocator with our approach to rematerialization). Table 7.4 shows the e ect of splitting based on the SSAform. Table 7.5 shows the results of splitting at both forward and reverse dominance frontiers. In each case, the tables show only routines where a di erence was found (some routines were so short as to require no spill code). Furthermore, we have omitted routines when the di erences comprised less than one percent of the spill code (approximately ten cases total, over the ve tables). Each table is organized identically. The rst two columns give the program and subroutine name. The third and fourth column give the observed spill costs for the two allocators being compared. These costs are calculated from dynamic counts of instructions as described earlier. The last column (total) gives the percent improvement in spill costs from the old allocator to the new one, where old ? new % improvement = 100 old Therefore, large positive numbers indicate signi cant improvements. Middle columns show the contribution of each instruction type to the total. All percentages have been rounded to the nearest integer. Insigni cant results are reported as 0 (or, for an insigni cant but negative result, ?0). In cases where the result is zero, we simply show a blank. Since results are rounded, the total entry may not equal the sum of the individual instruction entries. For an example, consider the rst row in Table 7.1. This row presents results for the routine fmin from the program fmin. The Yorktown allocator generated an allocation requiring 551 cycles of spill code; the optimistic allocator required only 370 cycles. 16% of the savings came from having to execute fewer loads and 16% arose from fewer stores. A further insigni cant fraction from fewer loadimmediates. The total improvement was 32%. A few rows are distinguished by a ? in the total column. In these cases, we were unable to determine the spill contribution of the total cost since the allocation for the \huge" machine required more cycles than the allocations for the \standard" machine. In these few cases, we simply show the total costs (all loads, stores, etc.) and report the actual cycles saved for each instruction type.
78
Cycles of Spill Code Percentage Contribution routine Chaitin Optimistic load store copy ldi addi total fmin 551 370 16 16 0 32 spline 125 117 5 2 6 decomp 362 305 10 6 0 16 svd 2,509 1,977 16 7 ?2 21 colbur 25 19 8 8 8 24 dcoera 29 15 21 21 7 48 dde u 443 335 13 12 ?0 24 deb u 1,939 1,131 20 20 42 debico 463 459 0 0 1 deseco 5,500 4,957 4 2 ?1 3 0 10 drepvi 252 218 7 6 14 ihbtr 452 400 6 6 12 inithx 714 579 11 4 4 19 integr 526 502 5 5 lectur 257 221 11 2 2 14 paroi 1,780 1,433 9 6 5 20 prophy 1,954 1,531 13 9 0 21 repvid 651 599 4 4 8 supp 146 149 ?1 ?1 0 ?2 yeh 353 297 32 24 ? fpppp d2esp 51 35 16 16 ?2 2 31 e ll 173 94 21 23 2 46 fpppp 1,472 1,444 1 1 2 twldrv 13,731,802 11,311,624 12 7 0 18 matrix300 lbmk14 136 132 1 1 3 sgemm 12,321 9,905 10 10 20 sgemv 3,027 1,808 40 0 ?0 40 tomcatv tomcat 394,397,732 367,995,733 3 3 ?0 7 program fmin seval solve svd doduc
Table 7.1 E ects of Optimistic Coloring
Cycles of Spill Code Percentage Contribution program routine Optimistic Backtrack load store copy ldi addi total doduc dde u 335 327 1.2 1.2 2.4 drepvi 218 214 0.9 0.9 1.8 matrix300 sgemv 1,808 1,804 0.1 0.1 0.2
Table 7.2 E ects of Limited Backtracking
79
Cycles of Spill Code Percentage Contribution routine Optimistic Rematerial load store copy ldi addi total fehl 68 50 26 7 ?7 27 spline 117 102 10 2 2 ?1 13 decomp 305 286 4 3 ?1 6 svd 1,977 1,966 1 0 ?0 1 zeroin 236 234 2 ?1 1 bilan 1,046 966 5 3 8 bilsla 16 15 6 6 colbur 19 24 ?11 ?11 ?5 ?26 dde u 335 375 ?5 ?7 1 1 ?12 debico 459 418 6 0 1 2 9 deseco 4,957 4,636 7 2 ?2 0 7 drepvi 218 175 4 14 0 2 20 drigl 32 31 3 3 heat 34 31 6 1 9 ihbtr 400 395 1 0 ?0 1 inideb 50 48 4 4 inisla 31 28 6 3 10 inithx 579 437 17 10 ?2 25 integr 502 372 18 12 ?3 26 lectur 221 166 2 23 25 orgpar 39 35 5 ?3 8 10 paroi 1,433 1,383 8 0 ?1 ?4 4 pastem 289 220 20 10 13 ?19 24 repvid 599 404 9 13 11 33 yeh 297 290 4 4 2 ?3 ? fpppp d2esp 35 34 6 ?3 3 main 210 199 0 5 5 twldrv 11,311,624 11,198,058 2 0 ?1 1 matrix300 sgemm 9,905 8,398 12 6 ?3 15 tomcatv tomcat 367,995,733 355,039,258 4 0 ?0 4 program rkf45 seval solve svd zeroin doduc
Table 7.3 E ects of Rematerialization
80
Cycles of Spill Code Percentage Contribution routine Rematerial SSA load store copy ldi addi total fmin 370 359 10 ?7 3 spline 102 101 1 ?0 1 decomp 286 278 3 3 ?3 ?1 3 svd 1,966 1,883 ?0 4 1 ?0 4 zeroin 234 143 39 12 ?12 39 cardeb 83 85 ?2 ?2 colbur 24 25 ?4 ?4 dde u 375 257 12 17 ?2 4 32 deb u 1,131 1,290 ?3 ?5 ?7 1 ?14 debico 418 431 ?3 ?3 deseco 4,636 4,363 2 3 0 0 ?0 6 drepvi 175 160 6 5 ?2 9 drigl 31 32 ?3 ?3 heat 31 26 6 13 ?3 16 ihbtr 395 319 21 2 ?3 ?0 19 inideb 48 25 48 48 inithx 437 435 1 ?0 1 integr 372 383 1 ?4 ?3 lectur 166 145 ?2 14 13 paroi 1,383 1,366 2 2 ?0 ?4 1 pastem 220 218 12 1 ?12 1 prophy 1,525 1,678 ?7 ?3 ?1 2 ?10 repvid 404 429 ?6 ?6 saturr 106 114 ?4 ?4 ?8 sigma 12 13 ?8 ?8 sortie 20 24 ?10 ?10 ?20 supp 149 137 4 4 8 yeh 290 283 2 4 1 ? fpppp d2esp 34 38 ?6 ?6 ?12 e ll 94 70 2 26 ?2 26 main 199 198 1 1 twldrv 11,198,058 13,165,137 ?11 ?1 ?5 ?18 matrix300 sgemm 8,398 6,308 21 7 ?3 25 saxpy 46 38 9 9 17 tomcatv tomcat 355,039,258 406,856,259 ?7 ?7 ?0 ?15 program fmin seval solve svd zeroin doduc
Table 7.4 E ects of SSABased Splitting
81
program fmin rkf45 seval solve svd zeroin doduc
fpppp
matrix300 tomcatv
Cycles of Spill Code routine Rematerial Dominance fmin 370 427 fehl 50 66 spline 102 53 solve 85 90 decomp 286 308 svd 1,966 2,991 zeroin 234 128 bilan 966 1,038 cardeb 83 155 coeray 133 141 colbur 24 29 dcoera 15 11 dde u 375 298 deb u 1,131 1,273 debico 418 606 deseco 4,636 5,079 drepvi 175 194 drigl 131 134 dyeh 106 91 ihbtr 395 528 inideb 48 117 inithx 437 558 integr 372 447 inter 22 12 lectur 166 171 paroi 1,383 1,738 pastem 220 451 prophy 1,525 2,662 repvid 404 537 saturr 106 141 sortie 20 26 supp 149 110 yeh 290 269 d2esp 34 28 e ll 94 133 fmtgen 143 251 gamgen 2,083 4,478 twldrv 11,198,058 25,826,498 sgemm 8,398 6,015 sgemv 1,808 2,109 saxpy 46 42 tomcat 355,039,258 276,432,455
Percentage Contribution load store copy ldi addi 31 ?21 ?40 14 ?32 4 ?4 53 6 ?10 ?2 1 ?2 ?2 ?1 9 ?3 ?13 ?1 ?15 ?19 ?17 ?1 36 8 ?43 45 ?5 ?2 0 ?2 ?84 ?6 ?2 ?8 ?4 ?8 53 13 ?33 7 13 15 ?11 4 0 0 ?13 ?0 0 6 ?51 4 2 ?3 ?11 ?2 ?1 ?2 ?4 ?3 ?2 6 6 3 ?1 ?5 ?32 4 ?144 1 ?5 ?23 20 8 ?14 ?34 27 18 17 2 ?8 ?21 7 ?6 ?4 ?9 ?2 ?4 12 1 ?57 ?61 ?9 ?9 ?43 ?14 ?0 13 7 ?25 ?28 ?13 ?5 ?7 ?10 ?20 17 17 ?9 18 2 ?6 12 6 ?9 9 ?9 23 ?56 ?24 ?22 ?47 ?5 0 ?115 ?33 ?37 ?45 ?15 0 21 7 ?0 ?0 ?17 9 9 ?9 8 15 ?0 ?0
total ?15 ?32 48 ?6 ?8 ?52 45 ?8 ?87 27 21 13 ?45 ?10 ?11 ?2
?21
?
?34 ?144 ?28 ?20 ?3 ?26 ?105 ?75 ?33 ?33 ?30
?
45
26
? ?
18 ?41
?115 ?130 ?17
28 9 22
Table 7.5 E ects of Splitting at Dominance Frontiers
82
7.2 PriorityBased Coloring
An important alternative approach to global register allocation via graph coloring was developed by Chow and Hennessy 25, 26]. Given the existence of competitive techniques, it is naturally interesting to compare them. Unfortunately, it is di cult to perform useful comparisons of global register allocators. Ideally, each would run on the same machine, target the same machine, use the same input language produced by the same optimizer, and be implemented and tuned with equal care. While we have performed a large number of comparisons under exactly these strictures (see Section 7.1), our comparisons are all between relatively similar allocators; adding a signi cantly di erent allocator (e.g., prioritybased coloring) to the mix would require a large amount of time. In addition, it is di cult to achieve the best possible performance without extensive testing and tuning. Nevertheless, some comparisons are possible. In the next two sections, we present a brief discussion comparing the prioritybased coloring with the Yorktown allocator in terms of allocation quality and the results of an experiment comparing the two allocators in terms of speed.
7.2.1 Allocation Quality
While both allocators are based on graph coloring, they di er in most respects. Chow and Hennessy have published another comparison of the Yorktown allocator and prioritybased coloring; the interested reader should consult their paper for another viewpoint 26, pages 513{517]. already been massaged into its nal form; all optimization, address mode selection, and instruction scheduling has been completed (though instruction scheduling will be repeated after allocation) 6]. Chow's implementation of prioritybased coloring runs much earlier in the compilation process, on a relatively highlevel intermediate language 22]. This seems like a mere detail of implementation rather than a requirement of either allocation scheme. Certainly Larus and Hil nger demonstrate that it is possible to do prioritybased coloring on lowlevel code 51]. The advantage to using the lowlevel form is greater accuracy in allocation; the advantage to using the highlevel form is allocation speed (typically many less live ranges will have to be handled) and greater machineindependence.
Intermediate Representation The Yorktown allocator works on code that has
83 The Yorktown allocator accepts code referencing an unlimited number of virtual registers. The virtual registers are used to hold temporaries generated by the optimizer and variables allocated by the frontend. Note that the frontend must load all values into registers before they may be referenced (recall our assumption of a loadstore architecture). This exposes the load instructions to optimization; for example, common subexpression elimination and loopinvariant code motion. The result is that even a global variable may appear temporarily (perhaps across a loop) in a virtual register and therefore be subject to allocation. As a result of allocation, some of the virtual registers may be spilled to memory. Prioritybased coloring (as described by Chow and Hennessy) takes a di erent approach. Before allocation, their intermediate code has all variables and temporaries in memory. As a result of allocation, some of these values are promoted to registers. The e ect of these di erent viewpoints on allocation quality is unclear. From a practical standpoint, Chow and Hennessy note that their scheme produces working code without ever running the register allocator.28 Again, this is an implementation issue; Larus and Hil nger's version of prioritybased coloring is much closer to Chaitin's work in this respect. live range at once (modulo the local spilling considerations described in Section 2.2.5). There is no possibility of splitting a live range into two or more pieces, some spilled and others kept in registers. The great power of prioritybased coloring is exactly this ability to perform live range splitting. To support live range splitting, several tradeo s seem to be required. The interference graph constructed by the Yorktown allocator is precise whereas the interference graph used during prioritybased coloring is relatively coarse. In prioritybased coloring, two live ranges appear to interfere if they share a common basic block. This is conservative, since one might be live only at the beginning and the other live only at the end, with no actual overlap at the instruction level. As a result, the graphs built by the Yorktown allocator will tend to have a smaller coloring, leading to less spill code. On the other hand, the coarse graphs constructed by prioritybased coloring are relatively easy to update during live range splitting.
On a loadstore architecture, operands must appear in registers. In Chow's compiler, unallocated variables are loaded into temporary registers by the code generator whenever they are required as operands.
28
Spilling and Splitting When the Yorktown allocator must spill, it spills an entire
84 rate coalescing into prioritybased coloring. Recall that coalescing examines pairs of live ranges connected by a copy. If the live ranges do not interfere with each other, they can be coalesced and the copy removed. However, in prioritybased coloring, two live ranges connected by a copy always interfere (since they share a copy instruction, they must share a basic block and therefore interfere in the coarse graph). We believe that coalescing is important. It is a powerful mechanism providing natural solutions to several real problems arising during code generation. Parameters passed in registers are naturally managed by coalescing. In prioritybased coloring, these parameters are handled by a specialpurpose mechanism called precoloring 26, Section 7.1]. Coalescing removes unnecessary copies; Chow relies on a separate optimization called copy propagation 22, pages 39{41]. We note that coalescing is copy propagation. Furthermore, coalescing is a particularly e ective form of copy propagation because it is performed at a very late stage of compilation. Finally, we note that Chow's approach to global copy propagation uses iterative dataow analysis to attack a problem that is inherently not rapid 47]. In this case, the problem is formulated so that a conservative solution is found quickly. A more precise solution could require longer analysis times. Chaitin et al. describe how to use coalescing to handle 2address instructions. Similar extensions can be used to manage idiosyncratic instructions that require their operands in particular registers. Prioritybased coloring seems less adaptable to these situations. allocators employ an e cient heuristic to determine an approximate solution. The Yorktown coloring heuristic is guided entirely by the structure of the graph; but, spills are determined by a combination of graph structure and estimated spill cost. Prioritybased coloring is guided by a combination of estimated pro t and live range size. An attempt is made to color highpriority live ranges rst, where priority is determined by the apparent pro t of allocating a live range to a register, with preference given to shorter live ranges. This approach seems weaker because it ignores the structure of the graph. As a result, we would expect the Yorktown coloring heuristic to nd better colorings with less spilling. Of course, this tendency is mitigated (or perhaps overwhelmed) by the bene ts of splitting instead of spilling.
Coalescing Given such a coarse interference graph, it seems impossible to incorpo
Coloring Since the problem of nding a kcoloring for a graph is NPcomplete, both
85 of the entire register set. With prioritybased coloring, the machine's register set is divided into three subsets: locals Live ranges that include only one basic block are allocated, before coloring, into the set of local registers. globals These are reserved for use by the coloring algorithm for live ranges that initially include more than one basic block. temporaries A nal group of registers is reserved for use by the code generator, since a few temporaries may be required when expanding the highlevel intermediate language statements into their lowlevel assembly language components. The size of each subset is xed, potentially forcing further ine ciencies in register usage.
Coverage Finally, we note that the Yorktown allocator controls the assignment
7.2.2 Allocation Time
While it is certainly possible to determine the asymptotic complexity of many of the subcomponents of each allocator, it is di cult to draw useful conclusions about the relative speeds of each allocator in practice. In an e ort to address this question, we have conducted an experiment attempting to determine the expected behavior of each allocator.
Methodology
The overall plan is to measure the time required to perform register allocation on a wide variety of routines, using both the Yorktown allocator and an allocator based on prioritybased coloring. While we have written our own implementation of the Yorktown allocator, we rely upon Chow's own implementation of prioritybased coloring that is embedded in the MIPS f77 compiler. This situation is not completely ideal. While we can give each compiler the same FORTRAN source, the code seen by the allocators is surely di erent in many respects. Perhaps more importantly, Chow's implementation allocates registers on fairly highlevel intermediate code, whereas our allocator operates on code at the level of assembly language. The result is that our allocator must consider many more live ranges for a given routine. Note that this di erence is a matter of implementation choice rather than an essential characteristic of the techniques.
86 Attempting to minimize unnecessary di erences, we ported our code to a MIPS R3000, where all timings were performed. We con gured our allocator to allocate registers for the MIPS register model (32 integer registers and 16 doubleprecision oatingpoint registers). Because our optimizer performs no loop unrolling, we suppressed loopunrolling in the MIPS optimizer to avoid a gross mismatch in the amount of work presented to each allocator. The command used for the MIPS compiler was f77 c O2 Wo,loopunroll,0 Wo,l,log le source le where Wo,loopunroll controls loopunrolling and Wo,l enables logging of various optimizer statistics. We used the Olimit ag when necessary to force optimization of exceptionally large routines. All tests used version 2.20 of the MIPS compiler. For test data, we collected 181 FORTRAN routines. The routines were drawn from several sources, including the SPEC benchmark suite, the LAPACK library, and the RiCEPS benchmark suite. The routines ranged in size from 48 bytes of object code to more than 41,000 bytes of object code.
Results
Figure 7.2 is a scatter plot summarizing the results of the experiment. Note immediately that both axes are logscales. The vertical axis represents time required for allocation; the horizontal axis re ects the object code size (the size of the .text segment, as reported by the size command). The curves are created from the points of the scatter plot, smoothed by the lowess function provided as part of the S data analysis package 8, pages 497{498]. Each test was repeated ten times on an unloaded machine and the resulting times were averaged; each point in the scatter plot represents the average allocation time for a single routine. The timer granularity was 10 milliseconds. Reported times include cache miss times, but exclude paging 24]. The times shown for the Yorktown allocator include all control ow and data ow analysis, construction of the interference graph, coalescing, coloring, and spilling. The times shown for prioritybased coloring are the sum of two times reported by the MIPS compiler: reg alloc preparation This is the time spent in data ow analysis and building all the data structures for register allocation, including the interference graph. global coloring This is the time required for coloring and splitting.
87 Note that we have not included the cost of copy propagation when reporting times for prioritybased coloring, though the Yorktown allocator achieves much of the same e ect via coalescing. There are two reasons. First, it was impossible to separate the cost of local copy propagation from other local optimizations. Second, Chow's copy propagator actually propagates expressions, achieving a larger e ect than that achieved by simply coalescing 22, pages 39{41].
Discussion
While there are a variety of points to be made about the data shown in Figure 7.2, the speed comparisons are the prime concern. We see that prioritybased coloring is far faster on the smallest routines, sometimes as much as a factor of eight (remember we are using a logarithmic scale). At the other end of the scale, the Yorktown allocator is about six times faster than prioritybased coloring on the largest routines. Of course, they meet in the middle, requiring similar time for routines with ve to ten thousand bytes of object code (perhaps 300 lines of FORTRAN, depending on details of comments, declarations, and so forth).
Caveats This data should be interpreted carefully, with due consideration for the
following points. At the large end of the size scale, there are unfortunately few samples, primarily due to the di culty of nding large routines. Furthermore, some of the largest routines (fpppp and twldrv) are somewhat arti cial, with the bulk of their code contained in a few huge basic blocks. Nevertheless, we have speci cally included them because of their notoriety among compiler writers. We must also remember that the two allocators implement di erent algorithms. Prioritybased coloring performs live range splitting, which we expect to allow more precise control of spilling; i.e., better code quality. So, while prioritybased coloring requires more time for large routines, the results may justify the e ort.29 where c and k are constants de ning the slope and position of the line, respectively.
Complexity On a loglog graph, a straight line is de ned by the equation y = kx
c
29
Note that the superiority of prioritybased coloring is not assured; recall the discussion in Section 7.2.1.
88
100.0
PriorityBased Yorktown
o +
o
o o o
+ + o o o + + + + o ++ oo o
10.0
+ +
+
allocation time (seconds)
+ + +
o o
o + + o + ++ + ++ + +++ + + + + o + + + + ++ o oo ++ + + + o + +o + o + + + ++ o +o+ + o+o o + + + ooo + + o + o oo + o + ++++ + o oo ++ o +++ ++o o o o o o ++ + + o o +++ + ++ +++ + + ooo o + ++++ + + oo + ++ ++ ++ + o o o o +++++++ + +o+ o +++ + + + oo o o o + ++ + + + + ++ + + o o o oo o + + ++ + oo oo o + o o o + + + ++ + + ++ + o oo o + + ++ + ++ + + + oo o oo ooo o + o oo o ++ oo oo o o o oooo o + + + + ooo o oo o oo o oo o oo o o ooo o o o o oo o o o o o o o o oo o o oo o o oo o o oo o o oo o o o o o
0.1
1.0
50
100
500
1000 object size (bytes)
5000
50000
Figure 7.2 Allocation Times
89 The slope of the line is completely determined by the value of c. Abstracting to \bigO" notation, we obtain the equation y = O(xc) where the constant c determines the slope. Therefore, by observing the slopes of lines representing the time required for allocation versus the size of the routine, we can determine each allocator's observed time complexity. cludes only the data for prioritybased coloring. The dotted line illustrates the slope of a line parallel to y = x2. The distribution of measurements suggests that the time required for prioritybased coloring is, in practice, O(n2), where n is the size of the routine, with a constanttime overhead dominating the smallest cases. tion of the Yorktown allocator. The dotted line illustrates the slope of a line parallel y = x and the dashed line illustrates the slope of a line parallel to y = x log x. The plots suggest that, in practice, the time required by the Yorktown allocator is O(n log n), where n is the size of the routine. The attening of the curve for the smallest routines is again explained by a constanttime overhead (a much greater overhead than required by prioritybased coloring). The wide variation in times for a given size routine may be in part explained by recalling that the number of iterations of the two major loops in the allocator (the buildcoalesce loop and the buildcolorspill loop) will vary independently of the routine size. Chapter 8 explores these questions and additional performance details in greater depth. tines. Given the overall costs of compilation, there seems to be little excuse for not using one of the global allocation techniques in an optimizing compiler. Of course, as machines get faster, the time required for allocation will continue to shrink (consider the results in Chapter 8). Note also that we have considered only FORTRAN routines; presumably other languages with better facilities for expressing modularity will tend to have smaller routines, leading to even smaller allocation times in practice. For small routines, Chow's implementation of prioritybased coloring shows that it can be very fast indeed. Unfortunately, our implementation of the Yorktown allo
PriorityBased Coloring Figure 7.3 repeats the graph shown earlier, but in
The Yorktown Allocator Figure 7.4 shows only the data for our implementa
Conclusions Both techniques are fast, especially for small and mediumsized rou
90
100.0
PriorityBased y = O(x^2)
o
o o o
10.0
o o o o o o oo o o o o o oo o o o ooo o oo o o o o o ooo o o ooo ooo o oo o o o oo o o oo o o o o o oo o oo oo o o o o o oo o oo o oo ooo o o oo o oo oo o o o oooo o ooo o oo o oo o oo o oo o o ooo o o o o oo o o o o o o o o oo o o oo o o oo o o oo o o oo o o o o o o oo o
allocation time (seconds)
0.1
1.0
o o
50
100
500
1000 object size (bytes)
5000
50000
Figure 7.3 Observed Complexity of PriorityBased Coloring
91
100.0
y = O(x log x) Yorktown y = O(x) + + +
+
10.0
+
allocation time (seconds)
+ + + + ++
+ + +
+ + + ++ + ++ + +++ + + + + + + + + ++ + ++ + + + + + + + + ++ + + + ++ ++ + + + + ++ + ++++ + ++ + ++ + + + + ++++ + + ++ ++ + + ++ + ++ + + ++ ++ ++ + +++++++ + + + ++++ ++ + ++ + + + + + + + + + + ++ + + + + + ++ + + ++ + + + ++ + ++ + + + + + + + + + +
0.1
1.0
50
100
500
1000 object size (bytes)
5000
50000
Figure 7.4 Observed Complexity of the Yorktown Allocator
92 cator seems slower. However, it is likely that this problem is susceptible to tuning.30 For large routines, the O(n2) behavior of prioritybased coloring makes its use problematic for production compilers. While tuning alone cannot improve its algorithmic complexity, it is possible that some form of \damage control" could be used to avoid explosive allocation times, trading away some allocation quality for improved allocation time.
To be fair, it should be noted that we worked hard to make our implementation fast on large routines. On the other hand, relatively little attention was paid to handling small cases.
30
93
Chapter 8 Engineering
To support our experiments, we have written an optimizing compiler for FORTRAN. The optimizer is organized as a collection of independent programs, each accomplishing a distinct transformation. Each routine is represented in a lowlevel intermediate form, ILOC. Each pass reads in a routine via stdin, performs the necessary analysis and trans
文档资料共享网 nexoncn.com
copyright ©right 20102020。
文档资料共享网内容来自网络，如有侵犯请联系客服。email:zhit325@126.com