Register economy heuristic for a cycle driven multiple issue instruction scheduler6718541Abstract A method for scheduling operations utilized by an optimizing compiler to reduce register pressure on a target hardware platform assigns register economy priority (REP) values to each operation in a basic block. For each time slot, operations are scheduled in order of their lowest REP values. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
{
g = (a + b) + (c * d) + (d - (e - f)); // evaluate some expression
}
This program has two specific features. The first is that among integer operations like "add", "sub" there is a multiply. For this example, it is assumed that "add", "sub" typically take 1 clock cycle to perform and that "mul" takes 2 clock cycles so that there are different operation latencies. The second feature is that variable "d" is used twice in the expression. This means that an optimizing compiler generates only one "Ld" for reading a value of variable "d" from memory and uses the resultant register as an operand both in "mul" and in "sub" operations. In this case, the Intermediate Representation (IR) of the expression in a compiler is not a balanced binary tree as it was in previous example, but a graph (FIGS. 2A and B). In the graph, operations are presented as nodes and dependencies between them as directed arcs in the form of a Data Flow Graph (DFG) which establishes a partial ordering on a set of operations of that expression. Partial ordering allows the expression of Instruction Level Parallelism (ILP) in the most convenient manner for operation scheduling. Any pair of operations which has no path between them is potentially parallel. For a given DFG and given operation latencies the hypothetical time to execute this program on a processor with unlimited parallel resources (ALCs) and unlimited register space can be calculated. FIG. 2A illustrates the calculation of the Earliest Start Time (EST) and Latest Start Time (LST) for each operation. At first the EST is calculated. For every operation having no predecessors the EST is assigned to zero (numbers near the top-right corner of each operation bar). Walking down through IR from each node to its successors an EST is assigned to every operation: EST(node)=max (EST(p1)+DEL(p1), . . . , EST(pm)+DEL(pm)). Here p1, . . . , pm are predecessors of the current operation, DEL(p1), . . . , and DEL(pm) is a delay (latency) of predecessor operation, i.e. time consumed to obtain its result. The next step is the calculation of the LST for each node as depicted in FIG. 2B. This calculation is done as follows. For last operation of the IR (add3) the LST is assigned the value of its EST. Walking up through the IR from each operation to its predecessors: LST(node)=min(LST(s1))-DEL(node), . . . , LST(sn))-DEL(node)). Here s1, . . . sn are successors of a current node. The LST is given in the figure near bottom-right corner of each operation bar. The EST and LST of the last operation of the IR of the expression is a theoretically minimum time (Tmin) to perform calculations of that expression in a processor with a given delay of operations. Operations having EST=LST are said to be critical, and are shaded in FIG. 2B, and other operations are noncritical. To keep the execution time equal to Tmin critical operations should be issued exactly at time slots corresponding to the EST=LST of that critical node. As to noncritical operations, each should be issued in some time slot between the EST and LST of that operation node. In general to keep the critical path close to Tmin the compiler (code generator) must schedule each operation not later than its LST. That's why the LST is taken as a priority parameter in well-known list scheduling algorithm. FIG. 3 is an example of the implementation of the Latest Start Time List Scheduling algorithm for a sample basic block being optimized. In FIG. 3 the earliest start time for each operation is indicated by an integer at the upper right-hand corner and the latest start time is indicated at the lower right-hand corner. The determination of earliest and latest start times is well-known in the art and is described in detail above. FIG. 3 includes a tree-like IR structure in the form of a data dependency graph (DFG), a ready list having ready instructions listed in time slots for each level of the DFG, a table indicating which ready ops are scheduled during the time slot, and a list of the number of registers allocated to scheduled instructions. As is well-known in the art, registers are released to the pool when the last use of the result of the scheduled instruction is resolved. The rightmost column in FIG. 3 shows how many registers are used (register pressure) in every time slot of the schedule (Table). Time is depicted at a leftmost column. Dashed arrows between operations in the schedule illustrate real result-operand value passing. Register pressure is a number of arrows crossing each boundary between adjacent time slots. Particularly for FIG. 3 two arrows from "Ld a", "Ld d" scheduled in time 0 cross the time boundary 0-1 (Regs=2), three arrows from "Ld d", "mul", "Ld a" cross 1-2 time boundary (Regs=3), and so on. Columns in the table marked ALC0 and ALC1 indicate two arithmetic channels of a hypothetical processor. For every clock cycle one operation of any kind (Ld, add, sub, etc.) may be issued to either ALC0 or ALC1. This is an illustration of hardware abilities similar for LST priority scheduling and REP scheduling. The Ready List is generated and modified during operation of the List Scheduling algorithm as depicted in the flow chart of FIG. 4. This algorithm is a time-driven way of placing operations into available operation slots of program code. The List scheduler takes into account properties of the Arithmetic-Logic Channels (ALCs) (how many channels are there, operation repertoire and latencies) and builds a timetable (tables in FIGS. 1 and 2) listing what operation is decoded every clock cycle and at which ALC. This process is performed for every Basic Block of the program being compiled. FIG. 3 illustrates scheduling of an example Basic Block using the Latest Start Time scheduling algorithm. At first the Ready list is initiated by operations having no predecessors (in this case Loads). The Ready List is ordered by some value, in FIG. 3 the latest start time, attributed for every operation which is given in parentheses at the right of the operation. At every time slot T the Ready List contains a set of operations which are ready to issue at this time T. This means that all the predecessors of the ready operation have been scheduled before and are finished (latency is over). Beginning from the head of the list ready operations are selected and ALC resources are allocated within current time slot T. The List Scheduling Algorithm implements a time-driven way of operation scheduling. Evolution of the IR scheduling is depicted in FIG. 3. At every current clock cycle beginning from 0 a set of ready operations is constructed (list of ready). The list is ordered in accordance with some priority which is shown in braces next to the operation in the list. Operations with smaller priority are located at the beginning of the list and will be selected to schedule first. In classical list scheduling algorithm this priority is equal to the LST. Once the ready list is established, the scheduler simply gets the first operation from the list and tries to allocate a free Arithmetic Logic Channel (ALC0 or ALC1) for it. If the operation is successfully scheduled, it's removed from the ready list. When all possible operations are scheduled at a current clock cycle, the algorithm goes to the next cycle. The ordered ready list is updated with regard to previously scheduled operations and their latencies. The process iterates until the ready list becomes empty. The resultant schedule takes 7 clock cycles and 5 registers to execute. For instance, in FIG. 3 at a time T=0 six operations are ready (Ld c, Ld d, Ld a, Ld b, Ld e, Ld f), but only two of them are scheduled: "Ld c".fwdarw.ALC0 "Ld d".fwdarw.ALC1. After scheduling, time is advanced to the next slot and the Ready List is updated for the next slot in accordance with results of scheduling. Successors of an operation scheduled in time T (or earlier) may become ready in time T+1. For instance (FIG. 3) in time T=1 operation "mul" becomes ready since both predecessors are scheduled at a time T=0 and have latencies=1. The process iterates until all operations of the Basic Block (BB) are scheduled. In FIG. 3, "Ld c" and "Ld d" have the lowest values of the latest starting time. These operations must be completed before "mul", "add2", "sub2", and "add3". Accordingly, "Ld d" is scheduled in time slot 0. Note that "Ld d" supplies data to "mul" and "sub2" as indicated by the dashed arrows in the chart. However, "sub2" is not scheduled by the latest start time listing algorithm until time slot 5. Accordingly, a register must be allocated for "Ld d" during time slots 0-5. As indicated by Regs column, this causes an extra register to be allocated at each time slot and increases register pressure. A preferred embodiment of the register economy heuristic relies upon the same scheduling algorithm, but a different priority (not an LST) for each operation. The new priority is calculated as shown in FIGS. 5A and B in two steps. At first, as depicted in FIG. 5A, for each node a number of dominating predecessors (NDPs) is calculated, including node itself: NDP=1+NDP(p1)+ . . . +NDP(pm). Here p1, . . . , pm are again predecessors. For nodes without predecessors NDP=1. This NDP value is calculated walking down through IR from each node to its successors. Values of NDP are given near top-right corner of each operation bar. Secondly, as depicted in FIG. 5B, a new Register Economy Priority (REP) is calculated for each operation. For last operation of the IR (add3) this value REP is assigned to 1. Also this value is assigned to a temporary counter CNT. Walking up through IR from each operation to its predecessors a temporary counter is incremented and assigned an REP(node)=CNT. For two or more predecessors the first predecessor having the greatest NDP is visited first. The REP value is given in the down side of FIG. 5B near the bottom-right corner of each operation bar. Note that operation "Ld d" is visited twice: once by path d3"-"sub2"-"Ld d" (and assigned a REP=8) and again by path "add3"-"add2"-"mul"-"Ld d" (and assigned a final REP=13). Actually, the value REP=8 is not assigned to "Ld d" since it wasn't the last visit to it from the remaining successor. FIG. 6 depicts a preferred embodiment of the invention which schedules a basic block according to a register economy priority list scheduling algorithm. Similar to the description of FIG. 3, at every time slot T the Ready List contains a set of operations which are ready to issue at this time T. This means that all the predecessors of the ready operation have been scheduled before and are finished (latency is over). However, in this case the priority of ready operations is determined by Register Economy Priority (REP) values, which are indicated at the lower right of each block. An NDP (number of predecessors) parameter is indicated at the upper right of each block. In FIG. 6, for time slot 0 all the load operations are listed in order of REP values (shown in parentheses after each operation in the ready list). These values indicate whether the operation must be held in a register for several time slots and are calculated according to an algorithm described above. "Ld a" and "Ld b" have the lowest REPs because these registers may be freed as soon as "add1" completes. "Ld d" has the highest REP because it cannot be cleared until "sub2" completes which in turn cannot be completed until "sub1" competes. Accordingly, "Ld d" is not scheduled until time slot 3, thereby reducing the register pressure in the previous time slots. For the example basic block (the same block depicted in FIGS. 2A and B ) the register pressure is reduced using the register economy priority listing algorithm. In this particular example, the reduced pressure is due to scheduling "Ld d" and "mul" later than in the example of FIGS. 2A and B. Thus, a register does not need to be allocated for "Ld d" during time slots 0-2 and fewer registers are required. On the other hand the overall execution time of the basic block has been increased due to the latency introduced by the "mul" operation. At time slot 0 all the loads are ready and "Ld a" and "Ld b" are scheduled in ALC0 and ALC1 because they have the lowest REPs. At time slot 1 the dependency of "add1" has been removed so that "add1" and "Ld c" have the lowest REPs and are scheduled next. A preferred embodiment of an Algorithm of REP value calculation is listed as two following recursive routines:
subroutine CalcNDP( operation cur_op )
{
operation cur_succ;
if( number_of_redecessors( cur_op ) == 0) {
NDP( cur_op ) = 1;
}
else {
NDP( cur_op ) = NDP( pred1 ) +...+ NDP( predn ) + 1;
}
endif
for( cur_succ in successors( cur_op )) {
CalcNDP( cur_succ );
}
}
integer cur_rep;
subroutine CalCREP( operation cur_op )
{
operation cur_pred;
NDP( cur_op) = cur_rep;
cur_rep = cur_rep + 1;
for( cur_pred in preds( cur_op )
from pred with max NDP
to pred with min NDP) {
CalcREP( cur_pred );
}
}
In the "for" loop some temporary variable `cur_pred` is declared to keep the current predecessor of current operation `cur_op`. Predecessors are assigned to `cur_pred` in order from the predecessor with the max NDP value (calculated previously in CalcNDP routine) to the predecessor with the min NDP value. If predecessors have equivalent values of NDP they are selected "from left to right" in the picture. When the current predecessor is assigned, the loop body is performed. It contains only the recursive invocation of CalcREP routine with a parameter `cur_pred` assigned. These two routines must be invoked once to calculate the NDP attribute for each operation of the BB in the following manner: CalcNDP(first_op_of_BB); cur.sub.13 ndp=1; CalcREP(last_op_of_BB); So the real NDP(Ld d)=13, not 8 since the second visit to this operation overwrites previously calculated value. A sequence of CalcREP invocations and FOR loop iterations for IR in FIGS. 4, 5 is given below.
Note two visits to "`Ld d`".
cur_ndp = 1
call CalcREP_1( add3 )
REP( add3 ) = 1
cur_rep = 2
for_1( cur_pred_1 in (add2(7), sub2(5)))
call CalcREP_2( add2 )
REP( add3 ) = 2
cur_rep = 3
for_2( cur_pred_2 in (add1(3), mul(3)))
call CalcREP_3( add1 )
REP( add1 ) = 3
cur_rep = 4
for_3( cur_pred_3 in (Ld a(1), Ld b(1)))
call CalcREP_4( Ld a)
REP( Ld a ) = 4
cur_rep = 5
exit CalcREP_4
call CalcREP_5( Ld b )
REP( Ld b ) = 5
cur_rep = 6
exit CalcREP_5
exit for_3
exit CalcREP_3
call CalcREP_6( mul )
REP( mul ) = 6
cur_rep = 7
for_4( cur_pred_4 in (Ld c(1), Ld d(1)))
call CalcREP_7( Ld c )
REP( Ld c ) = 7
cur_rep = 8
exit CalcREP_7
call CalcREP_8( Ld d ) // first visit to Ld d
REP( Ld c ) = 8
cur_rep = 9
exit CalcREP_8
exit for_4
exit CalcREP_6
exit for_2
exit CalcREP_2
call CalcREP_9( sub2 )
REP( sub2 ) = 9
cur_rep = 10
for_5( cur_pred_5 in (sub1(13), Ld d(1)))
call CalcREP_11( sub1 )
REP( sub1 ) = 11
cur_rep = 12
for_6( cur_pred_6 in (Ld e(1), Ld f(1)))
call CalcREP_12( Ld e)
REP( Ld e) = 12
cur_rep = 13
exit CalcREP_12
call CalcREP_13( Ld f)
REP( Ld f) = 13
cur_rep = 14
exit CalcREP_13
exit for_6
exit CalcREP_11
call CalcREP_13( Ld d) // Second visit to Ld d
REP( Ld d) = 13
cur_rep = 14
exit CalcREP_13
exit for_5
exit CalcREP_9
exit for_1
exit CalcREP_1.
In a preferred embodiment the invention may be implemented by software stored as program code on a computer readable media which may include magnetic storage media, optical storage media, signals modulated on electromagnetic waves, or other types of media. The program code is read from the media and executed by a digital processor. The invention has now been defined for the preferred embodiments. Alternatives and substitutions will now be apparent to persons of ordinary skill in the art. For example, different REP values can be assigned as long as the operations are ordered to reduce register pressure. Accordingly, it is not intended to limit the invention except as provided by the appended claims.
|
Same subclass Same class Consider this |
||||||||||
